Deep Learning Foundations

Neural Net Case Study
From Scratch

Build a complete 2-layer neural network from raw Python — forward pass, loss, backprop, parameter update — and watch it learn a spiral.

Prerequisites: Neural Networks Part 1 + Backpropagation basics. That's it.
10
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: Why Build One from Scratch?

You've learned what neurons are. You've seen activation functions and layer diagrams. You know about backpropagation in theory. But do you really understand how a neural network learns? There's only one way to find out: build one yourself, from raw numbers, and watch it train.

We need a dataset that's simple enough to visualize but hard enough that a linear classifier fails completely. Enter the spiral dataset: three classes of points wound into interlocking spirals. No straight line — not even three straight lines — can separate them.

The plan: We'll generate spiral data, watch a linear classifier fail on it, then build a 2-layer neural net from scratch — forward pass, loss, backprop, weight update — all in plain code. By the end, you'll watch the network carve curved decision boundaries through the spirals in real time.
The Spiral Dataset

Three classes spiraling outward. Try to imagine drawing straight lines to separate them. You can't. This is the problem we'll solve.

This is the same "minimal neural network case study" from Stanford's cs231n. We'll follow the same structure: generate data, fail with a linear model, succeed with a two-layer net. Every line of code will be explained. No magic.

Why is the spiral dataset a good test for neural networks?

Chapter 1: Generating Data — The Spiral

Let's build the dataset. We want K = 3 classes, each with N = 100 points, living in 2D space. Each class spirals outward from the origin at a different angle:

python
import numpy as np

N = 100   # points per class
K = 3     # number of classes
D = 2     # dimensionality (x, y)
X = np.zeros((N*K, D))
y = np.zeros(N*K, dtype='int')

for j in range(K):
    ix = range(N*j, N*(j+1))
    r = np.linspace(0.0, 1, N)   # radius grows from 0 to 1
    t = np.linspace(j*4, (j+1)*4, N) + np.random.randn(N)*0.2  # angle + noise
    X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
    y[ix] = j

Each class gets its own angular range. The r variable grows linearly from 0 to 1, so points start near the origin and spiral outward. A small amount of Gaussian noise (0.2) makes the spirals fuzzy — realistic, not perfectly clean.

Why spirals? They're the minimal 2D dataset that is completely impossible for any linear classifier. The classes interleave — between any two points of the same class, there's always a point from another class. You need a model that can learn curved boundaries.

The result: a matrix X of shape (300, 2) — 300 points, each with an x and y coordinate — and a label vector y of shape (300,) with values 0, 1, or 2.

Spiral Data Generator

Adjust noise to see how it affects separability. More noise = harder problem. Points per class: 100.

Noise 0.20
In the spiral data generation, what does the variable r control?

Chapter 2: The Linear Classifier — Watch It Fail

Before we build a neural network, let's see what happens when we try a plain linear classifier: scores = W · x + b, trained with softmax cross-entropy loss. This is the approach from the linear classification lesson, applied directly to our spiral.

python
# Initialize weights
W = 0.01 * np.random.randn(D, K)   # 2x3
b = np.zeros((1, K))               # 1x3

# Compute scores
scores = np.dot(X, W) + b   # 300x3

The scores matrix has shape (300, 3) — one score per class for each of the 300 points. We convert scores to probabilities via softmax, compute the cross-entropy loss, and update W and b with gradient descent. After training, the linear classifier converges to about 49% accuracy. On a 3-class problem, random chance is 33%. So the linear model barely beats guessing.

Why it fails: A linear classifier can only draw straight decision boundaries. For 3 classes, it draws three lines meeting at a point — dividing the plane into three wedges. Spirals can't be separated by wedges. The model is fundamentally incapable, no matter how long you train it.
Linear Classifier on Spirals

The colored regions show the linear classifier's decision boundaries after training. Notice: only straight lines. Many points land in the wrong region. Click "Train" to run 200 steps of gradient descent.

The straight boundary lines are the best the linear model can do. It captures the gross direction of each class but misclassifies everything near the spiral arms. We need something that can bend.

A well-trained linear classifier on the spiral dataset achieves roughly what accuracy?

Chapter 3: The Score Function — Adding a Hidden Layer

The fix is beautifully simple. Instead of mapping inputs directly to class scores with one matrix, we add a hidden layer in between. The data passes through a first linear transformation, then a nonlinearity (ReLU), then a second linear transformation:

f = W2 · max(0, W1 · x + b1) + b2

Let's unpack this. W1 has shape (D, h) where D = 2 (our input dimension) and h is the number of hidden neurons — say 100. So W1 · x + b1 gives us a vector of h numbers. Then max(0, ·) is the ReLU activation: it zeros out negatives and keeps positives. Finally, W2 has shape (h, K) — it maps the h hidden features to K = 3 class scores.

Think of it this way: The first layer (W1) learns to transform the raw 2D coordinates into a new h-dimensional space where the spirals are linearly separable. The second layer (W2) is just a linear classifier in that new space. ReLU between them ensures the transformation is nonlinear — without it, two matrix multiplies collapse into one (W2 · W1 = one bigger W), and we're back to a linear model.
python
# Initialize parameters
h = 100  # hidden layer size
W1 = 0.01 * np.random.randn(D, h)    # 2 x 100
b1 = np.zeros((1, h))                # 1 x 100
W2 = 0.01 * np.random.randn(h, K)    # 100 x 3
b2 = np.zeros((1, K))                # 1 x 3

We initialize weights with small random numbers (not zeros — that would make all neurons identical) and biases at zero. The hidden size h is a hyperparameter — we choose it. More hidden neurons = more capacity = more complex boundaries, but also more parameters to train.

Network Architecture

The two-layer network: 2 inputs → h hidden neurons with ReLU → 3 class scores. Adjust h to see the network grow.

Hidden size h 8
Why is the ReLU activation essential between the two layers?

Chapter 4: Forward Pass — Computing Scores

The forward pass pushes input data through the network to produce class scores. It's three lines of NumPy:

python
# Forward pass
hidden = np.dot(X, W1) + b1      # (300, 100) raw hidden activations
hidden = np.maximum(0, hidden)   # (300, 100) ReLU: zero out negatives
scores = np.dot(hidden, W2) + b2  # (300, 3) class scores

Let's trace through a single point. Say x = [0.3, -0.5] (a 2D coordinate in our spiral).

Step 1 — Hidden pre-activations: Multiply x by W1 (a 2×h matrix) and add b1. This gives h numbers — one per hidden neuron. Each number is a weighted combination of the two input coordinates.

Step 2 — ReLU: Any negative values become zero. Positive values pass through unchanged. This is the nonlinearity that gives the network its power. After ReLU, some neurons are "off" (zero) and some are "on" (positive). The pattern of which neurons fire is what encodes useful features.

Step 3 — Output scores: Multiply the h-dimensional hidden vector by W2 (h×3) and add b2. This produces 3 scores: one per class. The highest score is the predicted class.

Dimensions check: X is (300, 2). W1 is (2, 100). So X · W1 is (300, 100). After ReLU, still (300, 100). W2 is (100, 3). So hidden · W2 is (300, 3). Three scores per point, exactly what softmax needs.
Forward Pass Trace

Watch a single input flow through the network. Each hidden neuron computes a weighted sum, then ReLU kills negatives. The survivors combine into three class scores.

After the ReLU step, what happens to hidden neurons with negative pre-activation values?

Chapter 5: Computing the Loss — How Wrong Are We?

We have scores. Now we need a single number that measures how bad those scores are — the loss. We use softmax cross-entropy, the same loss from linear classification, applied to our neural net's output.

Step 1 — Softmax: Convert raw scores into probabilities. For numerical stability, subtract the max score first:

python
# Softmax: scores -> probabilities
exp_scores = np.exp(scores - np.max(scores, axis=1, keepdims=True))
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)  # (300, 3)

Now probs[i] is a probability distribution over 3 classes for the i-th point. All values are between 0 and 1, and each row sums to 1.

Step 2 — Cross-entropy loss: For each point, look up the probability assigned to the correct class, take the negative log, and average over all points:

python
# Cross-entropy loss
correct_logprobs = -np.log(probs[range(N*K), y])  # (300,)
data_loss = np.sum(correct_logprobs) / (N*K)

# L2 regularization
reg = 1e-3
reg_loss = 0.5 * reg * (np.sum(W1*W1) + np.sum(W2*W2))
loss = data_loss + reg_loss

The regularization term penalizes large weights. Without it, the network could memorize the training data with extreme weight values. The hyperparameter reg controls the strength of this penalty — larger means simpler, smoother decision boundaries.

Why negative log? If the correct class has probability 1.0, −log(1.0) = 0 — no loss. If the correct class has probability 0.01, −log(0.01) = 4.6 — huge loss. The log function converts "the model's confidence in the correct answer" into a smoothly increasing penalty that's differentiable everywhere.
Loss Landscape

How −log(p) penalizes low confidence. The x-axis is the probability assigned to the correct class. As it drops toward 0, the loss rockets up.

What does L2 regularization do to the loss function?

Chapter 6: Backward Pass — Computing Gradients

We have a loss. Now we need the gradients — how should each weight change to reduce the loss? This is backpropagation: we work backward from the loss through each operation, applying the chain rule at every step.

Step 1 — Gradient on scores. The gradient of the softmax cross-entropy loss with respect to the scores has a beautifully simple form: it's just the probabilities, with 1 subtracted from the correct class:

python
# Gradient on scores
dscores = probs.copy()                 # (300, 3)
dscores[range(N*K), y] -= 1           # subtract 1 from correct class
dscores /= (N*K)                       # average over batch
Why this gradient is intuitive: If the network predicts [0.7, 0.2, 0.1] and the correct class is 0, then dscores becomes [0.7−1, 0.2, 0.1] = [−0.3, 0.2, 0.1]. The negative value pushes the correct class score up, and the positive values push the wrong classes down. The gradient automatically encodes "boost the right answer, suppress the wrong ones."

Step 2 — Backprop into W2 and b2. Since scores = hidden · W2 + b2, the gradients follow from matrix calculus:

python
# Backprop into W2 and b2
dW2 = np.dot(hidden.T, dscores)        # (100, 3)
db2 = np.sum(dscores, axis=0, keepdims=True)  # (1, 3)

Step 3 — Backprop into the hidden layer. We propagate the gradient through W2, then through the ReLU:

python
# Backprop into hidden layer
dhidden = np.dot(dscores, W2.T)        # (300, 100)
dhidden[hidden <= 0] = 0              # ReLU gate: zero grad where input was ≤ 0

The ReLU gradient is the key line. For any hidden neuron that was "off" (had value ≤ 0 after ReLU), the gradient is zero — it contributed nothing to the output, so changing it won't change the loss. For "on" neurons, the gradient passes through unchanged.

Step 4 — Backprop into W1 and b1:

python
# Backprop into W1 and b1
dW1 = np.dot(X.T, dhidden)             # (2, 100)
db1 = np.sum(dhidden, axis=0, keepdims=True)  # (1, 100)

# Add regularization gradient
dW2 += reg * W2
dW1 += reg * W1
Loss
−log(pcorrect) + regularization
↑ dscores = probs − 1correct
scores = hidden · W2 + b2
dW2 = hiddenT · dscores
↑ dhidden = dscores · W2T
ReLU gate
dhidden[off] = 0
↑ pass through where neuron was on
X · W1 + b1
dW1 = XT · dhidden
During backprop through ReLU, what happens to the gradient of hidden neurons that were "off" (output was zero)?

Chapter 7: SHOWCASELive Training

Everything comes together. Forward pass, loss, backward pass, weight update — repeated thousands of times. Watch the network learn the spiral in real time. The decision boundary starts as random noise and gradually sculpts itself to wrap around each spiral arm.

Controls: Click Step for one gradient descent step, or Auto to train continuously. Adjust the learning rate to see how it affects convergence — too high and it oscillates, too low and it creeps. Change hidden size to see how capacity affects the boundary. Click Reset to start fresh with new random weights.
Live Training: 2-Layer Net on Spiral Data

Left: decision boundary evolving over the data. Right: loss curve. Watch the network learn.

Learning rate 0.32
Hidden neurons 100
Step: 0  |  Loss: -  |  Accuracy: -

Things to try:

Chapter 8: Why It Works — What the Hidden Layer Learns

The showcase demonstrated that it works. Now let's understand why. The hidden layer transforms the 2D input into a 100-dimensional space. In this new space, the spirals become linearly separable — and the second layer is just a linear classifier in that space.

Each hidden neuron computes w · x + b and then applies ReLU. Geometrically, each neuron draws a line in the 2D input space. Points on one side of the line activate the neuron (positive value); points on the other side are zero. With 100 neurons, you get 100 different lines, creating a patchwork of regions. Each region has a unique pattern of which neurons are on and which are off.

The key insight: The hidden layer doesn't learn the final classification. It learns a coordinate system — a new representation where the data is easy to classify. The first layer does the hard work (untangling spirals); the second layer does the easy work (drawing straight boundaries in the new space).
Hidden Neuron Features

Each tile shows what one hidden neuron "sees" — the region of 2D space where it activates (bright) vs. where it's off (dark). The first layer learns to tile the plane with these half-spaces. Click "Train & Show" to train a small network and visualize its learned features.

This is the deep learning recipe in miniature: learn features, then classify on those features. In a deep network with many layers, each layer transforms the representation slightly, building up from raw pixels to edges to textures to object parts to full objects. Our two-layer net does this in one jump — from raw coordinates to spiral-aware features.

What does the hidden layer learn to do with the spiral data?

Chapter 9: Connections — From Scratch to Frameworks

You've just built a complete neural network from scratch: data generation, forward pass, loss computation, backward pass, and parameter update. Every deep learning framework — PyTorch, JAX, TensorFlow — does exactly what we did, just with automatic differentiation and GPU acceleration.

Here's our entire training loop translated to PyTorch, for comparison:

python
import torch

# Same architecture
model = torch.nn.Sequential(
    torch.nn.Linear(2, 100),
    torch.nn.ReLU(),
    torch.nn.Linear(100, 3)
)
optimizer = torch.optim.SGD(model.parameters(), lr=1.0, weight_decay=1e-3)
loss_fn = torch.nn.CrossEntropyLoss()

for step in range(10000):
    scores = model(X_tensor)                    # forward pass (automatic)
    loss = loss_fn(scores, y_tensor)            # loss (automatic)
    optimizer.zero_grad()
    loss.backward()                              # backward pass (automatic)
    optimizer.step()                             # parameter update (automatic)

Four lines replace dozens. But now you know what each of those four lines does under the hood.

What we didFramework equivalent
np.dot(X, W1) + b1, ReLU, np.dot(hidden, W2) + b2model(X)
Softmax + −log + regularizationCrossEntropyLoss + weight_decay
Manual dW1, db1, dW2, db2 via chain ruleloss.backward()
W -= lr * dWoptimizer.step()
Where to go next: You've built the foundation. The jump from here to modern deep learning is scaling — more layers, bigger datasets, cleverer optimizers (Adam, learning rate schedules), and specialized architectures (CNNs for images, Transformers for sequences). But the core loop — forward, loss, backward, update — never changes.

Related lessons:

"What I cannot create, I do not understand." — Richard Feynman

In a modern framework like PyTorch, which operation replaces our manual chain-rule gradient computation?