Neural Networks Part 2 — Setting Up the Data and the Loss

Chapter 0: Why Setup Matters

In Part 1, we built a neural network: layers of neurons, activation functions, forward passes. The architecture is ready. But if you just throw raw data at it with random weights and hit "train," you'll likely get garbage.

Training a neural network is like tuning a complex instrument. The architecture is the instrument itself — but you also need to tune the strings (weight initialization), prepare the sheet music (data preprocessing), keep the sound balanced (batch normalization), prevent feedback loops (regularization), and choose how to score the performance (loss function).

The setup problem: Two identical architectures can produce wildly different results depending on how you preprocess data, initialize weights, regularize, and choose your loss function. Getting the setup right is often the difference between a model that trains and one that doesn't.

Same Network, Different Setup

Both networks have the same architecture (2 hidden layers, 16 neurons each). Left: poor setup (no preprocessing, zero init). Right: good setup (normalized data, He init). Watch the loss curves diverge.

Click to compare good vs bad setup.

This lesson covers every decision you need to make before calling your training loop: how to prepare the data, how to set initial weights, how to keep activations healthy, how to prevent overfitting, and how to measure error. These aren't minor details — they're the foundation of successful training.

Why can two networks with identical architectures produce very different results?

They use different programming languages Differences in data preprocessing, weight initialization, regularization, and loss function Neural networks are inherently random and can never be controlled

Chapter 1: Data Preprocessing

Before a single gradient is computed, your data needs to be cleaned up. Raw features come in all shapes: pixel intensities from 0 to 255, prices in the thousands, ages from 0 to 100. If one feature is 100x larger than another, the loss landscape becomes a long, narrow valley that gradient descent struggles to navigate.

Mean subtraction is the simplest fix. Compute the mean of each feature across your training set, then subtract it from every data point. This centers the data cloud around the origin. For images, you might compute the mean across all pixels (per-pixel mean) or a single global mean.

x' = x − μ_train

Normalization goes further: after centering, divide by the standard deviation. Now each feature has zero mean and unit variance. The loss landscape becomes more circular, and gradient descent converges faster.

x' = (x − μ) / σ

Critical rule: Compute μ and σ on the training set only. Then apply those same values to validation and test data. If you compute statistics on the test set, you're leaking information from the future.

Effect of Preprocessing on Loss Landscape

Left: raw data creates an elongated loss contour. Right: normalized data creates circular contours. The gradient points directly toward the minimum.

PCA whitening is a more aggressive transform. First, PCA rotates the data so axes align with the directions of greatest variance. Then whitening divides each axis by its eigenvalue, making all directions have equal variance. The result: a perfectly spherical data cloud. In practice, standard normalization is almost always sufficient — PCA whitening is rarely needed for deep networks because the first layer can learn its own rotation.

Why must you compute mean and standard deviation only from the training set?

Using test-set statistics leaks future information into preprocessing, breaking the train/test contract Test-set statistics are too noisy to be useful It's just a convention — either way works fine

Chapter 2: Weight Initialization

You've preprocessed your data. Now, what values do the weights start at? This matters enormously. The wrong initialization can kill training before it begins.

All zeros: If every weight is zero, every neuron computes the same output. They all receive the same gradient. They all update identically. They stay identical forever. This is called the symmetry problem — the network has many neurons but they all do the same thing, as if you had just one.

Symmetry breaking: We must initialize with different random values so that each neuron starts computing something unique. Random initialization breaks the symmetry and lets the network learn diverse features.

Small random: Initialize weights from a small Gaussian, say N(0, 0.01). This works for shallow networks, but for deep ones the activations shrink layer by layer — by layer 10, they're effectively zero. Gradients vanish. Training stalls.

Too large: Initialize from N(0, 1.0). Now activations explode — they saturate sigmoid/tanh or blow up with ReLU. Gradients explode or vanish. Also unusable.

Xavier initialization (Glorot & Bengio, 2010): Set the variance to 1/n_in, where n_in is the number of inputs to the layer. This keeps the variance of activations roughly constant across layers. For a layer with n_in inputs:

w ~ N(0, 1 / n_in)

He initialization (He et al., 2015): For ReLU networks, half the activations are zeroed out, so we need to compensate by doubling the variance:

w ~ N(0, 2 / n_in)

The golden rule: Xavier for tanh/sigmoid, He for ReLU. Both aim for the same goal: keep activation variance stable across layers so gradients neither vanish nor explode.

Activation Variance Across Layers

Watch how activation variance changes layer by layer with different initializations. Ideal: the bars stay roughly the same height.

Small random: activations collapse to zero in deeper layers.

Why does initializing all weights to zero prevent learning?

Zero weights cause division by zero errors All neurons compute the same output, receive the same gradient, and never differentiate The loss function is undefined at zero

Chapter 3: Batch Normalization

Even with good initialization, activations can drift during training. As weights update, the distribution of inputs to each layer shifts — a problem called internal covariate shift. Later layers must constantly adapt to a moving target.

Batch normalization (Ioffe & Szegedy, 2015) fixes this by normalizing activations within each mini-batch. For each feature in a layer, compute the batch mean and variance, then normalize:

x̂ = (x − μ_B) / √(σ_B² + ε)

But wait — if we always force activations to zero mean and unit variance, we're limiting what the network can represent. Maybe the optimal activation distribution isn't standard normal. So batch norm adds two learnable parameters: a scale γ and a shift β:

y = γ · x̂ + β

The network can learn to undo the normalization entirely (by setting γ = σ and β = μ), or it can keep the normalized version, or anything in between. The key insight: the network chooses its activation distribution rather than having it imposed by upstream weight changes.

At test time: You don't have a batch to compute statistics from. Instead, use running averages of μ and σ² accumulated during training. This is why you see model.eval() in PyTorch — it switches batch norm from batch statistics to running statistics.

Input batch

Activations from previous layer

↓

Normalize

x̂ = (x − μ_B) / √(σ_B² + ε)

↓

Scale & Shift

y = γ · x̂ + β (learnable)

↓ to activation function

Batch Normalization Effect

Without batch norm, activations drift as training progresses (left). With batch norm, they stay centered and stable (right). Drag the epoch slider to see how distributions evolve.

Epoch 0

Benefit	Why
Faster training	Allows higher learning rates without divergence
Reduces sensitivity to init	Normalization dampens the effect of bad starting weights
Slight regularization	Batch statistics add noise, acting like mini-dropout

Why does batch norm include learnable γ and β parameters?

So the network can learn the optimal activation distribution, including undoing the normalization if needed To speed up the normalization computation They replace the weights and biases of the layer

Chapter 4: Regularization

A neural network with enough parameters can memorize the entire training set — every image, every label, every noise artifact. It achieves near-zero training loss but fails catastrophically on new data. This is overfitting: the model has learned the training data's noise instead of its underlying patterns.

Regularization is any technique that fights overfitting by constraining the model's complexity. The most common approach: add a penalty to the loss function that discourages large weights.

L2 regularization (weight decay) adds the sum of squared weights to the loss:

L_total = L_data + λ ∑ w_i²

The gradient of the penalty is 2λw — it pushes every weight toward zero proportionally to its magnitude. Large weights get penalized heavily. The result: the network prefers many small weights over a few large ones, which produces smoother, more generalizable functions.

Why smaller weights help: A function with small weights changes slowly — it can't swing wildly between data points. It's forced to find the smooth trend rather than memorize every bump. Think of it as Occam's razor: among all functions that fit the data, prefer the simplest.

L1 regularization adds the sum of absolute values: λ ∑ |w_i|. Unlike L2, L1 pushes weights all the way to exactly zero, producing sparse networks where many weights are inactive. This is useful for feature selection but less common in deep learning.

Max norm constraints clip the weight vector of each neuron if its norm exceeds a threshold c: if ||w|| > c, rescale w → c · w / ||w||. This bounds the maximum capacity directly.

Regularization Strength

A network fits noisy data. Increase λ to see L2 regularization smoothing the fit. Too little: overfitting. Too much: underfitting.

λ 0.01

Method	Penalty	Effect on Weights	Use Case
L2	λ ∑ w²	Shrinks toward zero	Default for deep learning
L1	λ ∑ \|w\|	Drives to exactly zero	Feature selection, sparsity
Max norm	Clip if \|\|w\|\| > c	Bounds magnitude	Stability with high LR

What happens if you set λ too high in L2 regularization?

The model overfits more severely Training becomes faster The model underfits — weights are pushed so close to zero it can't capture the real pattern

Chapter 5: Dropout

Dropout (Srivastava et al., 2014) is a beautifully simple idea: during each training step, randomly "turn off" each neuron with probability p (typically 0.5). Set its output to zero. Gone. The remaining neurons must learn to be useful on their own, without relying on any specific partner.

The intuition: Imagine a team project where any member might be absent on any given day. Everyone must be capable of contributing independently — no one can free-ride on a single star player. That's dropout. It prevents co-adaptation: neurons learning to depend on very specific other neurons.

During training, we create a random binary mask for each layer and element-wise multiply it with the activations. But this introduces a problem at test time: if we used all neurons (no dropout), the expected activation would be (1−p) times larger than during training, because we'd have more active neurons.

Inverted dropout is the standard fix: during training, divide the surviving activations by (1−p) to compensate. Now the expected value stays the same whether dropout is on or off, and at test time you simply use all neurons without modification.

python
# Inverted dropout during training
mask = (np.random.rand(*h.shape) > p)  # p = drop probability
h = h * mask / (1 - p)                # scale up survivors

# At test time: just use h as-is. No scaling needed.

Dropout in Action

A 3-layer network with dropout. Click "Drop" to randomly zero out neurons (gray). Each forward pass uses a different random subset of the network.

Drop rate 0.5

Ensemble interpretation: With n neurons and dropout, training samples 2ⁿ possible sub-networks. At test time, using all neurons with scaled weights approximates the average prediction of all those sub-networks. Dropout is a computationally cheap form of ensemble learning.

Why does inverted dropout scale activations by 1/(1−p) during training?

To keep the expected activation magnitude the same, so test-time predictions need no adjustment To make surviving neurons learn faster To increase the gradient signal

Chapter 6: Loss Functions

The loss function tells the network how wrong it is. Every choice of loss function encodes a different belief about what "wrong" means. Pick the wrong one, and you're optimizing for the wrong thing.

For classification, the two dominant losses are:

Softmax cross-entropy (the standard for multi-class classification): First, convert raw scores (logits) into probabilities via the softmax function. Then measure how far the predicted distribution is from the true label using cross-entropy:

L = −log(p_correct) = −log(e^s_y / ∑_j e^s_j)

If the network is confident and correct (p_correct close to 1), the loss is near zero. If it's wrong or uncertain, −log(p) grows steeply. This loss has a beautiful probabilistic interpretation: it's the negative log-likelihood under a categorical distribution.

SVM / hinge loss (multi-class SVM): For each wrong class j, penalize if its score is within a margin of the correct class score:

L = ∑_{j ≠ y} max(0, s_j − s_y + 1)

This only cares about getting the ordering right with a margin — it doesn't try to push probabilities to 0 or 1. In practice, cross-entropy is almost always preferred for neural networks because its gradients are smoother and it works naturally with softmax.

Classification vs regression: For regression (predicting a continuous value), use L2 loss (mean squared error: ∑(ŷ − y)²) or L1 loss (mean absolute error: ∑|ŷ − y|). L2 penalizes large errors quadratically, making it sensitive to outliers. L1 is more robust but has a non-smooth gradient at zero.

Loss Function Comparison

Predicted score for the correct class on the x-axis, loss on the y-axis. Cross-entropy grows sharply as confidence in the wrong answer increases. Hinge loss is piecewise linear.

Loss	Task	Gradient Behavior	When to Use
Cross-entropy	Classification	Smooth, probabilistic	Default for classification
Hinge	Classification	Piecewise linear	SVMs, margin-based models
MSE (L2)	Regression	Proportional to error	Default for regression
MAE (L1)	Regression	Constant magnitude	Outlier-robust regression

Why is cross-entropy preferred over hinge loss for neural networks?

Its smooth gradients work better with gradient descent, and it has a natural probabilistic interpretation It's faster to compute Hinge loss can only be used with two classes

Chapter 7: Initialization Explorer

This is the payoff. We'll train the same 5-layer ReLU network on the same data, changing only the weight initialization. Watch how the distribution of activations evolves layer by layer. Healthy training keeps the histograms roughly bell-shaped with consistent spread. Bad initialization causes them to collapse to zero or explode to the extremes.

Weight Initialization Showdown

Four initialization strategies, same architecture. Each row shows activation histograms across 5 layers. Green = healthy, Red = collapsed/exploded. Click an init method, then "Forward Pass" to push random data through the network.

Select an initialization and click Forward Pass.

What to look for:
• Zeros: All histograms collapse to a single spike at zero. Every neuron outputs zero. Dead network.
• Small random: First layers are okay, but later layers shrink toward zero. Activations vanish.
• Xavier: Designed for tanh/sigmoid. With ReLU, it's slightly too conservative — activations shrink gently.
• He: Just right for ReLU. Histograms stay consistently spread across all layers.

Try clicking "Forward Pass" multiple times for each initialization. He init produces consistent, healthy distributions every time. Small random sometimes works for early layers but always collapses by layer 5. Zeros are dead on arrival.

Chapter 8: Hyperparameter Summary

Every technique we've covered comes with knobs to turn. Setting these hyperparameters — values not learned by gradient descent but chosen by you — is both art and science. Here's a practical reference.

Hyperparameter	Typical Range	Start With	Effect of Too High	Effect of Too Low
Learning rate	1e-5 to 1e-1	1e-3	Divergence, loss explodes	Training too slow, stuck
Reg strength λ	1e-5 to 1e-1	1e-4	Underfitting	Overfitting
Dropout rate p	0.0 to 0.5	0.0 (no dropout)	Underfitting, noisy gradients	Overfitting
Batch size	32 to 512	64 or 128	Less noise, fewer updates	Noisy gradients, slow

Tuning strategy: Start with a known-good default (e.g., Adam with lr=1e-3, no dropout, λ=1e-4, batch size 64). Train until you can overfit the training set — this confirms your model has enough capacity. Then add regularization (dropout, stronger λ, data augmentation) until the gap between training and validation loss closes.

Step 1: Overfit

No regularization. Can you reach near-zero training loss? If not, increase model size.

↓

Step 2: Regularize

Add dropout, L2, data augmentation. Close the train/val gap.

↓

Step 3: Tune LR

Try LR schedules (cosine, step decay). Fine-tune learning rate.

↓

Step 4: Scale

More data, bigger model, longer training. Repeat.

Learning rate is the single most important hyperparameter. Too high and the loss oscillates or diverges. Too low and training takes forever or gets stuck in a bad local minimum. A common technique: learning rate warmup (start low, increase linearly) followed by cosine decay (gradually decrease to near zero).

Batch size controls the noise in gradient estimates. Smaller batches give noisier but more frequent updates. Larger batches give more accurate gradients but fewer updates per epoch. There's evidence that moderate noise (batch size 32-256) actually helps generalization by preventing the optimizer from settling into sharp minima.

What should you do first when tuning a new model?

Add heavy regularization immediately Verify you can overfit the training data with no regularization, confirming sufficient model capacity Use the smallest possible model

Chapter 9: Connections

We've covered everything between designing an architecture and training it. Here's where these techniques fit in the bigger picture:

This Lesson	What Comes Next
Data preprocessing	Data augmentation (flips, crops, color jitter)
Weight initialization	Residual connections (skip connections fix deep init)
Batch normalization	Layer norm, group norm, RMS norm (transformer variants)
Dropout	DropPath, stochastic depth (modern regularization)
Loss functions	Contrastive loss, focal loss, triplet loss
Hyperparameter tuning	Learning rate schedules, optimizers (Adam, AdamW)

The training pipeline: Data preprocessing → Weight init → Forward pass → Compute loss → Backward pass → Update weights (with regularization). We've covered everything except the backward pass (backpropagation) and the update rule (optimization). Those are next.

The techniques in this lesson scale remarkably well. Batch normalization was designed for convolutional networks in 2015 but its descendants (layer norm) power every transformer today. He initialization makes training 100-layer ResNets possible. Dropout, though less common in modern architectures, remains a reliable tool when data is scarce.

Related lessons: Neural Networks Part 1 (the architecture), Linear Classification (where it all started), Image Classification (data pipeline in practice).

"The key to artificial intelligence has always been the representation." — Jeff Hawkins. Every technique in this lesson is about ensuring your network can find the right representation: preprocessing gives it clean inputs, initialization gives it a fair start, normalization keeps it stable, regularization keeps it honest, and the loss function tells it what "right" means.

Setting Up the Data& the Loss