Goodfellow et al., Chapter 7

Regularization

Neural networks are powerful enough to memorize training data. Regularization prevents that and forces them to learn general patterns instead.

Prerequisites: Chapter 6 (feedforward networks, backprop).
9
Chapters
3+
Simulations
9
Quizzes

Chapter 0: Why Regularize?

Give a neural network enough parameters and it will memorize the training data perfectly. Train loss: zero. Test loss: catastrophic. This is overfitting — the network learned the training set's noise instead of the underlying pattern.

The core challenge in deep learning is generalization: performing well on data the model has never seen. Regularization is any modification to the training procedure that reduces the gap between training error and test error, even if it slightly increases training error.

The bias-variance tradeoff: A model with too few parameters underfits (high bias). A model with too many overfits (high variance). Regularization lets us use large, expressive models but constrains them to behave simply, getting the best of both worlds.
Parameter Penalties
L1, L2 — keep weights small
Data Augmentation
Create more training data by transforming existing samples
Early Stopping
Stop training before the model memorizes noise
Dropout & Ensembles
Train many sub-networks, average their predictions
What is overfitting?

Chapter 1: Norm Penalties

The simplest regularizer: add a penalty on the size of the weights to the loss function. The modified objective becomes J̃(θ) = J(θ) + α R(θ), where α controls the strength of regularization.

L2 regularization (weight decay) uses R(θ) = ½ ||w||2 = ½ ∑ wi2. This penalizes large weights proportionally. The gradient contribution is αw, so the update rule becomes w ← (1 − εα)w − ε∇J. At each step, the weights shrink by a fraction — hence "weight decay."

J̃(θ) = J(θ) + ½ α ||w||2

L1 regularization uses R(θ) = ||w||1 = ∑ |wi|. Unlike L2, L1 drives weights exactly to zero, producing sparse models. This is useful for feature selection: unimportant weights become exactly zero.

Bayesian interpretation: L2 regularization is equivalent to a Gaussian prior on the weights: P(w) = N(0, 1/α). L1 regularization corresponds to a Laplace prior. The regularization strength α is the inverse of the prior variance. Tighter prior = more regularization.
L1 vs L2 Regularization

Adjust regularization strength. L2 shrinks all weights evenly; L1 drives some weights to exactly zero.

Strength α0.5
What is the key difference between L1 and L2 regularization?

Chapter 2: Data Augmentation

The best way to reduce overfitting is more data. When you cannot collect more, you can generate more by applying label-preserving transformations to existing data. This is data augmentation.

For images: random crops, horizontal flips, rotations, color jittering, cutout (randomly erasing rectangular patches). A horizontally flipped cat is still a cat. A slightly rotated digit 7 is still a 7. These transformations teach the network invariances it should already have.

For other domains: In NLP, augmentation includes synonym replacement, random insertion/deletion, and back-translation. In speech, speed perturbation and noise injection. For tabular data, adding Gaussian noise to features. The principle is the same: create new training examples that preserve the label.

Noise injection is another form of augmentation. Adding small Gaussian noise to inputs, hidden units, or even weights forces the network to be robust to perturbations. Adding noise to the output labels (label smoothing) prevents the model from becoming overconfident: instead of target [0, 0, 1], use [0.033, 0.033, 0.933].

Key insight: Data augmentation encodes domain knowledge about which transformations should not affect the answer. A randomly flipped landscape photo should still be classified as "landscape." This is a form of regularization because it adds constraints based on the structure of the problem.
Why does data augmentation reduce overfitting?

Chapter 3: Early Stopping

During training, track the loss on a held-out validation set. Initially both training and validation loss decrease. At some point, training loss keeps decreasing but validation loss starts rising — the model is beginning to memorize. Early stopping means we stop training at the point of lowest validation loss.

Early Stopping Visualization

Watch training and validation loss diverge. The vertical line marks the optimal stopping point.

Overfitting strength1.5

Early stopping is arguably the most widely used regularizer in deep learning. It is essentially free — you need a validation set anyway for model selection, and stopping early actually saves compute.

Connection to L2: Goodfellow et al. show that for simple linear models, early stopping is approximately equivalent to L2 regularization. The number of training steps plays the role of the inverse of the regularization strength: fewer steps = more regularization. This provides a principled way to think about when to stop.
When should you stop training according to early stopping?

Chapter 4: Dropout

During each training step, dropout randomly sets each hidden unit to zero with probability p (commonly p = 0.5). Each training step uses a different random "thin" network. At test time, all units are active but their outputs are scaled by (1 − p) to compensate.

htrain = m ⊙ h,   where mi ~ Bernoulli(1 − p)

Why does this work? Each hidden unit cannot rely on any particular other unit being present, so it must learn features that are independently useful. Dropout prevents co-adaptation — neurons developing complex, fragile dependencies.

Ensemble interpretation: Dropout trains an exponential number of sub-networks that share weights. A network with n hidden units has 2n possible dropout masks. At test time, using all units (with scaling) approximates averaging the predictions of all these sub-networks. This is a form of bagging with extreme parameter sharing.
Dropout Effect

Each click shows a new random dropout mask. Grayed-out neurons are "dropped" for this training step.

Drop rate p0.50
Why does dropout prevent co-adaptation of neurons?

Chapter 5: Batch Normalization

Batch normalization (BatchNorm) normalizes each layer's activations to have zero mean and unit variance across the mini-batch, then applies a learned affine transformation:

y = γ · (h − μB) / √(σB2 + ε) + β

Here μB and σB2 are the batch mean and variance, and γ, β are learned parameters that allow the network to undo the normalization if needed.

BatchNorm was originally motivated as addressing internal covariate shift — the idea that layer inputs change during training as previous layers update. While this motivation is debated, the empirical benefits are clear: faster training, higher learning rates, and reduced sensitivity to initialization.

Regularization effect: Because the normalization depends on the mini-batch statistics, each training example sees slightly different normalization (depending on which other examples are in the batch). This noise acts as a regularizer, similar to dropout. At test time, running averages of μ and σ2 are used instead of batch statistics.
Why does batch normalization have a regularization effect?

Chapter 6: Multi-Task Learning & Ensembles

Multi-task learning trains one network on multiple related tasks simultaneously. The shared layers learn general features useful across tasks, while task-specific heads specialize. Sharing forces the network to learn more robust representations than it would from a single task.

Model ensembles train several models independently and average their predictions. If the models make different errors, the average reduces the total error. Ensembles are the single most reliable way to improve performance — at the cost of N× compute.

Why ensembles work: Consider k models, each with error rate ε on independent mistakes. If we take a majority vote, the ensemble is wrong only when more than k/2 models agree on the wrong answer. By the binomial theorem, this probability shrinks exponentially with k. In practice, models are correlated, so gains are smaller but still reliable.

Parameter sharing is a form of regularization built into the architecture. CNNs share convolution filter weights across spatial positions. RNNs share weights across time steps. This encoding of invariance (translation invariance for CNNs, time invariance for RNNs) dramatically reduces the number of independent parameters.

Why do ensembles of models perform better than individual models?

Chapter 7: Regularization Playground

Fit a polynomial to noisy data. Without regularization, a high-degree polynomial overfits wildly. Toggle different regularizers to see how they tame the model.

Polynomial Fitting with Regularization

Increase polynomial degree to see overfitting. Toggle regularizers to fix it. The true function is shown as a dashed line.

Degree10
L2 strength0.0
Noise level0.30
Experiments: (1) Set degree=15, L2=0. The curve oscillates wildly between data points — classic overfitting. (2) Increase L2 — watch the oscillations smooth out. (3) Set degree=2, L2=0 — the model underfits (too simple). The sweet spot is enough complexity + enough regularization.
When a degree-15 polynomial overfits noisy data, what does L2 regularization do to fix it?

Chapter 8: Connections

Regularization is not just one technique — it is a design philosophy. Every modification that improves generalization counts. Here is where each technique reappears:

TechniqueWhere It Appears
Weight decay (L2)Standard in virtually all training recipes (Ch 8). AdamW decouples it from the gradient.
Data augmentationEssential for vision (Ch 9, 12). Cutout, Mixup, CutMix. RandAugment.
Early stoppingUniversal. Every training run uses validation-based stopping.
DropoutFully-connected layers in CNNs. Less common in modern architectures with BatchNorm.
Batch normalizationCNNs (Ch 9), optimization (Ch 8). LayerNorm replaces it in Transformers.
EnsemblesCompetition winners, production systems. Snapshot ensembles, stochastic weight averaging.
Parameter sharingCNNs share filters (Ch 9). RNNs share across time (Ch 10). Transformers share attention weights.
What you should take away: The best regularization combines multiple techniques. A modern recipe: BatchNorm + weight decay + data augmentation + early stopping. No single technique dominates. The right combination depends on the dataset, model size, and compute budget.

Up next: Chapter 8: Optimization — the algorithms that actually minimize the (regularized) loss function.

Why is a combination of regularization techniques more effective than any single one?