Deep Learning Foundations

Setting Up the Data
& the Loss

Architecture is ready — now make it actually trainable.

Prerequisites: Neural Networks Part 1 + Basic calculus. That's it.
10
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: Why Setup Matters

In Part 1, we built a neural network: layers of neurons, activation functions, forward passes. The architecture is ready. But if you just throw raw data at it with random weights and hit "train," you'll likely get garbage.

Training a neural network is like tuning a complex instrument. The architecture is the instrument itself — but you also need to tune the strings (weight initialization), prepare the sheet music (data preprocessing), keep the sound balanced (batch normalization), prevent feedback loops (regularization), and choose how to score the performance (loss function).

The setup problem: Two identical architectures can produce wildly different results depending on how you preprocess data, initialize weights, regularize, and choose your loss function. Getting the setup right is often the difference between a model that trains and one that doesn't.
Same Network, Different Setup

Both networks have the same architecture (2 hidden layers, 16 neurons each). Left: poor setup (no preprocessing, zero init). Right: good setup (normalized data, He init). Watch the loss curves diverge.

Click to compare good vs bad setup.

This lesson covers every decision you need to make before calling your training loop: how to prepare the data, how to set initial weights, how to keep activations healthy, how to prevent overfitting, and how to measure error. These aren't minor details — they're the foundation of successful training.

Why can two networks with identical architectures produce very different results?

Chapter 1: Data Preprocessing

Before a single gradient is computed, your data needs to be cleaned up. Raw features come in all shapes: pixel intensities from 0 to 255, prices in the thousands, ages from 0 to 100. If one feature is 100x larger than another, the loss landscape becomes a long, narrow valley that gradient descent struggles to navigate.

Mean subtraction is the simplest fix. Compute the mean of each feature across your training set, then subtract it from every data point. This centers the data cloud around the origin. For images, you might compute the mean across all pixels (per-pixel mean) or a single global mean.

x' = x − μtrain

Normalization goes further: after centering, divide by the standard deviation. Now each feature has zero mean and unit variance. The loss landscape becomes more circular, and gradient descent converges faster.

x' = (x − μ) / σ
Critical rule: Compute μ and σ on the training set only. Then apply those same values to validation and test data. If you compute statistics on the test set, you're leaking information from the future.
Effect of Preprocessing on Loss Landscape

Left: raw data creates an elongated loss contour. Right: normalized data creates circular contours. The gradient points directly toward the minimum.

PCA whitening is a more aggressive transform. First, PCA rotates the data so axes align with the directions of greatest variance. Then whitening divides each axis by its eigenvalue, making all directions have equal variance. The result: a perfectly spherical data cloud. In practice, standard normalization is almost always sufficient — PCA whitening is rarely needed for deep networks because the first layer can learn its own rotation.

Why must you compute mean and standard deviation only from the training set?

Chapter 2: Weight Initialization

You've preprocessed your data. Now, what values do the weights start at? This matters enormously. The wrong initialization can kill training before it begins.

All zeros: If every weight is zero, every neuron computes the same output. They all receive the same gradient. They all update identically. They stay identical forever. This is called the symmetry problem — the network has many neurons but they all do the same thing, as if you had just one.

Symmetry breaking: We must initialize with different random values so that each neuron starts computing something unique. Random initialization breaks the symmetry and lets the network learn diverse features.

Small random: Initialize weights from a small Gaussian, say N(0, 0.01). This works for shallow networks, but for deep ones the activations shrink layer by layer — by layer 10, they're effectively zero. Gradients vanish. Training stalls.

Too large: Initialize from N(0, 1.0). Now activations explode — they saturate sigmoid/tanh or blow up with ReLU. Gradients explode or vanish. Also unusable.

Xavier initialization (Glorot & Bengio, 2010): Set the variance to 1/nin, where nin is the number of inputs to the layer. This keeps the variance of activations roughly constant across layers. For a layer with nin inputs:

w ~ N(0, 1 / nin)

He initialization (He et al., 2015): For ReLU networks, half the activations are zeroed out, so we need to compensate by doubling the variance:

w ~ N(0, 2 / nin)
The golden rule: Xavier for tanh/sigmoid, He for ReLU. Both aim for the same goal: keep activation variance stable across layers so gradients neither vanish nor explode.
Activation Variance Across Layers

Watch how activation variance changes layer by layer with different initializations. Ideal: the bars stay roughly the same height.

Small random: activations collapse to zero in deeper layers.
Why does initializing all weights to zero prevent learning?

Chapter 3: Batch Normalization

Even with good initialization, activations can drift during training. As weights update, the distribution of inputs to each layer shifts — a problem called internal covariate shift. Later layers must constantly adapt to a moving target.

Batch normalization (Ioffe & Szegedy, 2015) fixes this by normalizing activations within each mini-batch. For each feature in a layer, compute the batch mean and variance, then normalize:

x̂ = (x − μB) / √(σB2 + ε)

But wait — if we always force activations to zero mean and unit variance, we're limiting what the network can represent. Maybe the optimal activation distribution isn't standard normal. So batch norm adds two learnable parameters: a scale γ and a shift β:

y = γ · x̂ + β

The network can learn to undo the normalization entirely (by setting γ = σ and β = μ), or it can keep the normalized version, or anything in between. The key insight: the network chooses its activation distribution rather than having it imposed by upstream weight changes.

At test time: You don't have a batch to compute statistics from. Instead, use running averages of μ and σ2 accumulated during training. This is why you see model.eval() in PyTorch — it switches batch norm from batch statistics to running statistics.
Input batch
Activations from previous layer
Normalize
x̂ = (x − μB) / √(σB2 + ε)
Scale & Shift
y = γ · x̂ + β (learnable)
↓ to activation function
Batch Normalization Effect

Without batch norm, activations drift as training progresses (left). With batch norm, they stay centered and stable (right). Drag the epoch slider to see how distributions evolve.

Epoch 0
BenefitWhy
Faster trainingAllows higher learning rates without divergence
Reduces sensitivity to initNormalization dampens the effect of bad starting weights
Slight regularizationBatch statistics add noise, acting like mini-dropout
Why does batch norm include learnable γ and β parameters?

Chapter 4: Regularization

A neural network with enough parameters can memorize the entire training set — every image, every label, every noise artifact. It achieves near-zero training loss but fails catastrophically on new data. This is overfitting: the model has learned the training data's noise instead of its underlying patterns.

Regularization is any technique that fights overfitting by constraining the model's complexity. The most common approach: add a penalty to the loss function that discourages large weights.

L2 regularization (weight decay) adds the sum of squared weights to the loss:

Ltotal = Ldata + λ ∑ wi2

The gradient of the penalty is 2λw — it pushes every weight toward zero proportionally to its magnitude. Large weights get penalized heavily. The result: the network prefers many small weights over a few large ones, which produces smoother, more generalizable functions.

Why smaller weights help: A function with small weights changes slowly — it can't swing wildly between data points. It's forced to find the smooth trend rather than memorize every bump. Think of it as Occam's razor: among all functions that fit the data, prefer the simplest.

L1 regularization adds the sum of absolute values: λ ∑ |wi|. Unlike L2, L1 pushes weights all the way to exactly zero, producing sparse networks where many weights are inactive. This is useful for feature selection but less common in deep learning.

Max norm constraints clip the weight vector of each neuron if its norm exceeds a threshold c: if ||w|| > c, rescale w → c · w / ||w||. This bounds the maximum capacity directly.

Regularization Strength

A network fits noisy data. Increase λ to see L2 regularization smoothing the fit. Too little: overfitting. Too much: underfitting.

λ 0.01
MethodPenaltyEffect on WeightsUse Case
L2λ ∑ w2Shrinks toward zeroDefault for deep learning
L1λ ∑ |w|Drives to exactly zeroFeature selection, sparsity
Max normClip if ||w|| > cBounds magnitudeStability with high LR
What happens if you set λ too high in L2 regularization?

Chapter 5: Dropout

Dropout (Srivastava et al., 2014) is a beautifully simple idea: during each training step, randomly "turn off" each neuron with probability p (typically 0.5). Set its output to zero. Gone. The remaining neurons must learn to be useful on their own, without relying on any specific partner.

The intuition: Imagine a team project where any member might be absent on any given day. Everyone must be capable of contributing independently — no one can free-ride on a single star player. That's dropout. It prevents co-adaptation: neurons learning to depend on very specific other neurons.

During training, we create a random binary mask for each layer and element-wise multiply it with the activations. But this introduces a problem at test time: if we used all neurons (no dropout), the expected activation would be (1−p) times larger than during training, because we'd have more active neurons.

Inverted dropout is the standard fix: during training, divide the surviving activations by (1−p) to compensate. Now the expected value stays the same whether dropout is on or off, and at test time you simply use all neurons without modification.

python
# Inverted dropout during training
mask = (np.random.rand(*h.shape) > p)  # p = drop probability
h = h * mask / (1 - p)                # scale up survivors

# At test time: just use h as-is. No scaling needed.
Dropout in Action

A 3-layer network with dropout. Click "Drop" to randomly zero out neurons (gray). Each forward pass uses a different random subset of the network.

Drop rate 0.5
Ensemble interpretation: With n neurons and dropout, training samples 2n possible sub-networks. At test time, using all neurons with scaled weights approximates the average prediction of all those sub-networks. Dropout is a computationally cheap form of ensemble learning.
Why does inverted dropout scale activations by 1/(1−p) during training?

Chapter 6: Loss Functions

The loss function tells the network how wrong it is. Every choice of loss function encodes a different belief about what "wrong" means. Pick the wrong one, and you're optimizing for the wrong thing.

For classification, the two dominant losses are:

Softmax cross-entropy (the standard for multi-class classification): First, convert raw scores (logits) into probabilities via the softmax function. Then measure how far the predicted distribution is from the true label using cross-entropy:

L = −log(pcorrect) = −log(esy / ∑j esj)

If the network is confident and correct (pcorrect close to 1), the loss is near zero. If it's wrong or uncertain, −log(p) grows steeply. This loss has a beautiful probabilistic interpretation: it's the negative log-likelihood under a categorical distribution.

SVM / hinge loss (multi-class SVM): For each wrong class j, penalize if its score is within a margin of the correct class score:

L = ∑j ≠ y max(0, sj − sy + 1)

This only cares about getting the ordering right with a margin — it doesn't try to push probabilities to 0 or 1. In practice, cross-entropy is almost always preferred for neural networks because its gradients are smoother and it works naturally with softmax.

Classification vs regression: For regression (predicting a continuous value), use L2 loss (mean squared error: ∑(ŷ − y)2) or L1 loss (mean absolute error: ∑|ŷ − y|). L2 penalizes large errors quadratically, making it sensitive to outliers. L1 is more robust but has a non-smooth gradient at zero.
Loss Function Comparison

Predicted score for the correct class on the x-axis, loss on the y-axis. Cross-entropy grows sharply as confidence in the wrong answer increases. Hinge loss is piecewise linear.

LossTaskGradient BehaviorWhen to Use
Cross-entropyClassificationSmooth, probabilisticDefault for classification
HingeClassificationPiecewise linearSVMs, margin-based models
MSE (L2)RegressionProportional to errorDefault for regression
MAE (L1)RegressionConstant magnitudeOutlier-robust regression
Why is cross-entropy preferred over hinge loss for neural networks?

Chapter 7: Initialization Explorer

This is the payoff. We'll train the same 5-layer ReLU network on the same data, changing only the weight initialization. Watch how the distribution of activations evolves layer by layer. Healthy training keeps the histograms roughly bell-shaped with consistent spread. Bad initialization causes them to collapse to zero or explode to the extremes.

Weight Initialization Showdown

Four initialization strategies, same architecture. Each row shows activation histograms across 5 layers. Green = healthy, Red = collapsed/exploded. Click an init method, then "Forward Pass" to push random data through the network.

Select an initialization and click Forward Pass.
What to look for:
Zeros: All histograms collapse to a single spike at zero. Every neuron outputs zero. Dead network.
Small random: First layers are okay, but later layers shrink toward zero. Activations vanish.
Xavier: Designed for tanh/sigmoid. With ReLU, it's slightly too conservative — activations shrink gently.
He: Just right for ReLU. Histograms stay consistently spread across all layers.

Try clicking "Forward Pass" multiple times for each initialization. He init produces consistent, healthy distributions every time. Small random sometimes works for early layers but always collapses by layer 5. Zeros are dead on arrival.

Chapter 8: Hyperparameter Summary

Every technique we've covered comes with knobs to turn. Setting these hyperparameters — values not learned by gradient descent but chosen by you — is both art and science. Here's a practical reference.

HyperparameterTypical RangeStart WithEffect of Too HighEffect of Too Low
Learning rate1e-5 to 1e-11e-3Divergence, loss explodesTraining too slow, stuck
Reg strength λ1e-5 to 1e-11e-4UnderfittingOverfitting
Dropout rate p0.0 to 0.50.0 (no dropout)Underfitting, noisy gradientsOverfitting
Batch size32 to 51264 or 128Less noise, fewer updatesNoisy gradients, slow
Tuning strategy: Start with a known-good default (e.g., Adam with lr=1e-3, no dropout, λ=1e-4, batch size 64). Train until you can overfit the training set — this confirms your model has enough capacity. Then add regularization (dropout, stronger λ, data augmentation) until the gap between training and validation loss closes.
Step 1: Overfit
No regularization. Can you reach near-zero training loss? If not, increase model size.
Step 2: Regularize
Add dropout, L2, data augmentation. Close the train/val gap.
Step 3: Tune LR
Try LR schedules (cosine, step decay). Fine-tune learning rate.
Step 4: Scale
More data, bigger model, longer training. Repeat.

Learning rate is the single most important hyperparameter. Too high and the loss oscillates or diverges. Too low and training takes forever or gets stuck in a bad local minimum. A common technique: learning rate warmup (start low, increase linearly) followed by cosine decay (gradually decrease to near zero).

Batch size controls the noise in gradient estimates. Smaller batches give noisier but more frequent updates. Larger batches give more accurate gradients but fewer updates per epoch. There's evidence that moderate noise (batch size 32-256) actually helps generalization by preventing the optimizer from settling into sharp minima.

What should you do first when tuning a new model?

Chapter 9: Connections

We've covered everything between designing an architecture and training it. Here's where these techniques fit in the bigger picture:

This LessonWhat Comes Next
Data preprocessingData augmentation (flips, crops, color jitter)
Weight initializationResidual connections (skip connections fix deep init)
Batch normalizationLayer norm, group norm, RMS norm (transformer variants)
DropoutDropPath, stochastic depth (modern regularization)
Loss functionsContrastive loss, focal loss, triplet loss
Hyperparameter tuningLearning rate schedules, optimizers (Adam, AdamW)
The training pipeline: Data preprocessing → Weight init → Forward pass → Compute loss → Backward pass → Update weights (with regularization). We've covered everything except the backward pass (backpropagation) and the update rule (optimization). Those are next.

The techniques in this lesson scale remarkably well. Batch normalization was designed for convolutional networks in 2015 but its descendants (layer norm) power every transformer today. He initialization makes training 100-layer ResNets possible. Dropout, though less common in modern architectures, remains a reliable tool when data is scarce.

Related lessons: Neural Networks Part 1 (the architecture), Linear Classification (where it all started), Image Classification (data pipeline in practice).

"The key to artificial intelligence has always been the representation." — Jeff Hawkins. Every technique in this lesson is about ensuring your network can find the right representation: preprocessing gives it clean inputs, initialization gives it a fair start, normalization keeps it stable, regularization keeps it honest, and the loss function tells it what "right" means.