Neural networks are powerful enough to memorize training data. Regularization prevents that and forces them to learn general patterns instead.
Give a neural network enough parameters and it will memorize the training data perfectly. Train loss: zero. Test loss: catastrophic. This is overfitting — the network learned the training set's noise instead of the underlying pattern.
The core challenge in deep learning is generalization: performing well on data the model has never seen. Regularization is any modification to the training procedure that reduces the gap between training error and test error, even if it slightly increases training error.
The simplest regularizer: add a penalty on the size of the weights to the loss function. The modified objective becomes J̃(θ) = J(θ) + α R(θ), where α controls the strength of regularization.
L2 regularization (weight decay) uses R(θ) = ½ ||w||2 = ½ ∑ wi2. This penalizes large weights proportionally. The gradient contribution is αw, so the update rule becomes w ← (1 − εα)w − ε∇J. At each step, the weights shrink by a fraction — hence "weight decay."
L1 regularization uses R(θ) = ||w||1 = ∑ |wi|. Unlike L2, L1 drives weights exactly to zero, producing sparse models. This is useful for feature selection: unimportant weights become exactly zero.
Adjust regularization strength. L2 shrinks all weights evenly; L1 drives some weights to exactly zero.
The best way to reduce overfitting is more data. When you cannot collect more, you can generate more by applying label-preserving transformations to existing data. This is data augmentation.
For images: random crops, horizontal flips, rotations, color jittering, cutout (randomly erasing rectangular patches). A horizontally flipped cat is still a cat. A slightly rotated digit 7 is still a 7. These transformations teach the network invariances it should already have.
Noise injection is another form of augmentation. Adding small Gaussian noise to inputs, hidden units, or even weights forces the network to be robust to perturbations. Adding noise to the output labels (label smoothing) prevents the model from becoming overconfident: instead of target [0, 0, 1], use [0.033, 0.033, 0.933].
During training, track the loss on a held-out validation set. Initially both training and validation loss decrease. At some point, training loss keeps decreasing but validation loss starts rising — the model is beginning to memorize. Early stopping means we stop training at the point of lowest validation loss.
Watch training and validation loss diverge. The vertical line marks the optimal stopping point.
Early stopping is arguably the most widely used regularizer in deep learning. It is essentially free — you need a validation set anyway for model selection, and stopping early actually saves compute.
During each training step, dropout randomly sets each hidden unit to zero with probability p (commonly p = 0.5). Each training step uses a different random "thin" network. At test time, all units are active but their outputs are scaled by (1 − p) to compensate.
Why does this work? Each hidden unit cannot rely on any particular other unit being present, so it must learn features that are independently useful. Dropout prevents co-adaptation — neurons developing complex, fragile dependencies.
Each click shows a new random dropout mask. Grayed-out neurons are "dropped" for this training step.
Batch normalization (BatchNorm) normalizes each layer's activations to have zero mean and unit variance across the mini-batch, then applies a learned affine transformation:
Here μB and σB2 are the batch mean and variance, and γ, β are learned parameters that allow the network to undo the normalization if needed.
BatchNorm was originally motivated as addressing internal covariate shift — the idea that layer inputs change during training as previous layers update. While this motivation is debated, the empirical benefits are clear: faster training, higher learning rates, and reduced sensitivity to initialization.
Multi-task learning trains one network on multiple related tasks simultaneously. The shared layers learn general features useful across tasks, while task-specific heads specialize. Sharing forces the network to learn more robust representations than it would from a single task.
Model ensembles train several models independently and average their predictions. If the models make different errors, the average reduces the total error. Ensembles are the single most reliable way to improve performance — at the cost of N× compute.
Parameter sharing is a form of regularization built into the architecture. CNNs share convolution filter weights across spatial positions. RNNs share weights across time steps. This encoding of invariance (translation invariance for CNNs, time invariance for RNNs) dramatically reduces the number of independent parameters.
Fit a polynomial to noisy data. Without regularization, a high-degree polynomial overfits wildly. Toggle different regularizers to see how they tame the model.
Increase polynomial degree to see overfitting. Toggle regularizers to fix it. The true function is shown as a dashed line.
Regularization is not just one technique — it is a design philosophy. Every modification that improves generalization counts. Here is where each technique reappears:
| Technique | Where It Appears |
|---|---|
| Weight decay (L2) | Standard in virtually all training recipes (Ch 8). AdamW decouples it from the gradient. |
| Data augmentation | Essential for vision (Ch 9, 12). Cutout, Mixup, CutMix. RandAugment. |
| Early stopping | Universal. Every training run uses validation-based stopping. |
| Dropout | Fully-connected layers in CNNs. Less common in modern architectures with BatchNorm. |
| Batch normalization | CNNs (Ch 9), optimization (Ch 8). LayerNorm replaces it in Transformers. |
| Ensembles | Competition winners, production systems. Snapshot ensembles, stochastic weight averaging. |
| Parameter sharing | CNNs share filters (Ch 9). RNNs share across time (Ch 10). Transformers share attention weights. |
Up next: Chapter 8: Optimization — the algorithms that actually minimize the (regularized) loss function.