Neural Network Case Study — Building a Net from Scratch

Chapter 0: Why Build One from Scratch?

You've learned what neurons are. You've seen activation functions and layer diagrams. You know about backpropagation in theory. But do you really understand how a neural network learns? There's only one way to find out: build one yourself, from raw numbers, and watch it train.

We need a dataset that's simple enough to visualize but hard enough that a linear classifier fails completely. Enter the spiral dataset: three classes of points wound into interlocking spirals. No straight line — not even three straight lines — can separate them.

The plan: We'll generate spiral data, watch a linear classifier fail on it, then build a 2-layer neural net from scratch — forward pass, loss, backprop, weight update — all in plain code. By the end, you'll watch the network carve curved decision boundaries through the spirals in real time.

The Spiral Dataset

Three classes spiraling outward. Try to imagine drawing straight lines to separate them. You can't. This is the problem we'll solve.

This is the same "minimal neural network case study" from Stanford's cs231n. We'll follow the same structure: generate data, fail with a linear model, succeed with a two-layer net. Every line of code will be explained. No magic.

Why is the spiral dataset a good test for neural networks?

It has too many data points for k-NN The classes are interleaved so no straight decision boundary can separate them The data is high-dimensional

Chapter 1: Generating Data — The Spiral

Let's build the dataset. We want K = 3 classes, each with N = 100 points, living in 2D space. Each class spirals outward from the origin at a different angle:

python
import numpy as np

N = 100   # points per class
K = 3     # number of classes
D = 2     # dimensionality (x, y)
X = np.zeros((N*K, D))
y = np.zeros(N*K, dtype='int')

for j in range(K):
    ix = range(N*j, N*(j+1))
    r = np.linspace(0.0, 1, N)   # radius grows from 0 to 1
    t = np.linspace(j*4, (j+1)*4, N) + np.random.randn(N)*0.2  # angle + noise
    X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
    y[ix] = j

Each class gets its own angular range. The r variable grows linearly from 0 to 1, so points start near the origin and spiral outward. A small amount of Gaussian noise (0.2) makes the spirals fuzzy — realistic, not perfectly clean.

Why spirals? They're the minimal 2D dataset that is completely impossible for any linear classifier. The classes interleave — between any two points of the same class, there's always a point from another class. You need a model that can learn curved boundaries.

The result: a matrix X of shape (300, 2) — 300 points, each with an x and y coordinate — and a label vector y of shape (300,) with values 0, 1, or 2.

Spiral Data Generator

Adjust noise to see how it affects separability. More noise = harder problem. Points per class: 100.

Noise 0.20

In the spiral data generation, what does the variable r control?

The class label for each point The distance from the origin — how far out each point sits The amount of noise added

Chapter 2: The Linear Classifier — Watch It Fail

Before we build a neural network, let's see what happens when we try a plain linear classifier: scores = W · x + b, trained with softmax cross-entropy loss. This is the approach from the linear classification lesson, applied directly to our spiral.

python
# Initialize weights
W = 0.01 * np.random.randn(D, K)   # 2x3
b = np.zeros((1, K))               # 1x3

# Compute scores
scores = np.dot(X, W) + b   # 300x3

The scores matrix has shape (300, 3) — one score per class for each of the 300 points. We convert scores to probabilities via softmax, compute the cross-entropy loss, and update W and b with gradient descent. After training, the linear classifier converges to about 49% accuracy. On a 3-class problem, random chance is 33%. So the linear model barely beats guessing.

Why it fails: A linear classifier can only draw straight decision boundaries. For 3 classes, it draws three lines meeting at a point — dividing the plane into three wedges. Spirals can't be separated by wedges. The model is fundamentally incapable, no matter how long you train it.

Linear Classifier on Spirals

The colored regions show the linear classifier's decision boundaries after training. Notice: only straight lines. Many points land in the wrong region. Click "Train" to run 200 steps of gradient descent.

The straight boundary lines are the best the linear model can do. It captures the gross direction of each class but misclassifies everything near the spiral arms. We need something that can bend.

A well-trained linear classifier on the spiral dataset achieves roughly what accuracy?

95% — almost perfect ~49% — barely better than random 33% — exactly random chance

Chapter 3: The Score Function — Adding a Hidden Layer

The fix is beautifully simple. Instead of mapping inputs directly to class scores with one matrix, we add a hidden layer in between. The data passes through a first linear transformation, then a nonlinearity (ReLU), then a second linear transformation:

f = W₂ · max(0, W₁ · x + b₁) + b₂

Let's unpack this. W₁ has shape (D, h) where D = 2 (our input dimension) and h is the number of hidden neurons — say 100. So W₁ · x + b₁ gives us a vector of h numbers. Then max(0, ·) is the ReLU activation: it zeros out negatives and keeps positives. Finally, W₂ has shape (h, K) — it maps the h hidden features to K = 3 class scores.

Think of it this way: The first layer (W₁) learns to transform the raw 2D coordinates into a new h-dimensional space where the spirals are linearly separable. The second layer (W₂) is just a linear classifier in that new space. ReLU between them ensures the transformation is nonlinear — without it, two matrix multiplies collapse into one (W₂ · W₁ = one bigger W), and we're back to a linear model.

python
# Initialize parameters
h = 100  # hidden layer size
W1 = 0.01 * np.random.randn(D, h)    # 2 x 100
b1 = np.zeros((1, h))                # 1 x 100
W2 = 0.01 * np.random.randn(h, K)    # 100 x 3
b2 = np.zeros((1, K))                # 1 x 3

We initialize weights with small random numbers (not zeros — that would make all neurons identical) and biases at zero. The hidden size h is a hyperparameter — we choose it. More hidden neurons = more capacity = more complex boundaries, but also more parameters to train.

Network Architecture

The two-layer network: 2 inputs → h hidden neurons with ReLU → 3 class scores. Adjust h to see the network grow.

Hidden size h 8

Why is the ReLU activation essential between the two layers?

Without it, two linear layers collapse into one linear operation — no nonlinearity, no curved boundaries ReLU makes the network train faster ReLU prevents overfitting

Chapter 4: Forward Pass — Computing Scores

The forward pass pushes input data through the network to produce class scores. It's three lines of NumPy:

python
# Forward pass
hidden = np.dot(X, W1) + b1      # (300, 100) raw hidden activations
hidden = np.maximum(0, hidden)   # (300, 100) ReLU: zero out negatives
scores = np.dot(hidden, W2) + b2  # (300, 3) class scores

Let's trace through a single point. Say x = [0.3, -0.5] (a 2D coordinate in our spiral).

Step 1 — Hidden pre-activations: Multiply x by W₁ (a 2×h matrix) and add b₁. This gives h numbers — one per hidden neuron. Each number is a weighted combination of the two input coordinates.

Step 2 — ReLU: Any negative values become zero. Positive values pass through unchanged. This is the nonlinearity that gives the network its power. After ReLU, some neurons are "off" (zero) and some are "on" (positive). The pattern of which neurons fire is what encodes useful features.

Step 3 — Output scores: Multiply the h-dimensional hidden vector by W₂ (h×3) and add b₂. This produces 3 scores: one per class. The highest score is the predicted class.

Dimensions check: X is (300, 2). W₁ is (2, 100). So X · W₁ is (300, 100). After ReLU, still (300, 100). W₂ is (100, 3). So hidden · W₂ is (300, 3). Three scores per point, exactly what softmax needs.

Forward Pass Trace

Watch a single input flow through the network. Each hidden neuron computes a weighted sum, then ReLU kills negatives. The survivors combine into three class scores.

After the ReLU step, what happens to hidden neurons with negative pre-activation values?

They become exactly zero — they're "off" for this input They flip to positive They stay negative but get scaled down

Chapter 5: Computing the Loss — How Wrong Are We?

We have scores. Now we need a single number that measures how bad those scores are — the loss. We use softmax cross-entropy, the same loss from linear classification, applied to our neural net's output.

Step 1 — Softmax: Convert raw scores into probabilities. For numerical stability, subtract the max score first:

python
# Softmax: scores -> probabilities
exp_scores = np.exp(scores - np.max(scores, axis=1, keepdims=True))
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)  # (300, 3)

Now probs[i] is a probability distribution over 3 classes for the i-th point. All values are between 0 and 1, and each row sums to 1.

Step 2 — Cross-entropy loss: For each point, look up the probability assigned to the correct class, take the negative log, and average over all points:

python
# Cross-entropy loss
correct_logprobs = -np.log(probs[range(N*K), y])  # (300,)
data_loss = np.sum(correct_logprobs) / (N*K)

# L2 regularization
reg = 1e-3
reg_loss = 0.5 * reg * (np.sum(W1*W1) + np.sum(W2*W2))
loss = data_loss + reg_loss

The regularization term penalizes large weights. Without it, the network could memorize the training data with extreme weight values. The hyperparameter reg controls the strength of this penalty — larger means simpler, smoother decision boundaries.

Why negative log? If the correct class has probability 1.0, −log(1.0) = 0 — no loss. If the correct class has probability 0.01, −log(0.01) = 4.6 — huge loss. The log function converts "the model's confidence in the correct answer" into a smoothly increasing penalty that's differentiable everywhere.

Loss Landscape

How −log(p) penalizes low confidence. The x-axis is the probability assigned to the correct class. As it drops toward 0, the loss rockets up.

What does L2 regularization do to the loss function?

It only affects the backward pass, not the loss It makes the network train faster by reducing the number of weights It adds a penalty proportional to the squared weight magnitudes, discouraging extreme values

Chapter 6: Backward Pass — Computing Gradients

We have a loss. Now we need the gradients — how should each weight change to reduce the loss? This is backpropagation: we work backward from the loss through each operation, applying the chain rule at every step.

Step 1 — Gradient on scores. The gradient of the softmax cross-entropy loss with respect to the scores has a beautifully simple form: it's just the probabilities, with 1 subtracted from the correct class:

python
# Gradient on scores
dscores = probs.copy()                 # (300, 3)
dscores[range(N*K), y] -= 1           # subtract 1 from correct class
dscores /= (N*K)                       # average over batch

Why this gradient is intuitive: If the network predicts [0.7, 0.2, 0.1] and the correct class is 0, then dscores becomes [0.7−1, 0.2, 0.1] = [−0.3, 0.2, 0.1]. The negative value pushes the correct class score up, and the positive values push the wrong classes down. The gradient automatically encodes "boost the right answer, suppress the wrong ones."

Step 2 — Backprop into W₂ and b₂. Since scores = hidden · W₂ + b₂, the gradients follow from matrix calculus:

python
# Backprop into W2 and b2
dW2 = np.dot(hidden.T, dscores)        # (100, 3)
db2 = np.sum(dscores, axis=0, keepdims=True)  # (1, 3)

Step 3 — Backprop into the hidden layer. We propagate the gradient through W₂, then through the ReLU:

python
# Backprop into hidden layer
dhidden = np.dot(dscores, W2.T)        # (300, 100)
dhidden[hidden <= 0] = 0              # ReLU gate: zero grad where input was ≤ 0

The ReLU gradient is the key line. For any hidden neuron that was "off" (had value ≤ 0 after ReLU), the gradient is zero — it contributed nothing to the output, so changing it won't change the loss. For "on" neurons, the gradient passes through unchanged.

Step 4 — Backprop into W₁ and b₁:

python
# Backprop into W1 and b1
dW1 = np.dot(X.T, dhidden)             # (2, 100)
db1 = np.sum(dhidden, axis=0, keepdims=True)  # (1, 100)

# Add regularization gradient
dW2 += reg * W2
dW1 += reg * W1

Loss

−log(p_correct) + regularization

↑ dscores = probs − 1_correct

scores = hidden · W₂ + b₂

dW₂ = hidden^T · dscores

↑ dhidden = dscores · W₂^T

ReLU gate

dhidden[off] = 0

↑ pass through where neuron was on

X · W₁ + b₁

dW₁ = X^T · dhidden

During backprop through ReLU, what happens to the gradient of hidden neurons that were "off" (output was zero)?

The gradient is set to zero — "off" neurons don't contribute, so they receive no gradient signal The gradient is doubled to compensate The gradient passes through unchanged

Chapter 7: SHOWCASE — Live Training

Everything comes together. Forward pass, loss, backward pass, weight update — repeated thousands of times. Watch the network learn the spiral in real time. The decision boundary starts as random noise and gradually sculpts itself to wrap around each spiral arm.

Controls: Click Step for one gradient descent step, or Auto to train continuously. Adjust the learning rate to see how it affects convergence — too high and it oscillates, too low and it creeps. Change hidden size to see how capacity affects the boundary. Click Reset to start fresh with new random weights.

Live Training: 2-Layer Net on Spiral Data

Left: decision boundary evolving over the data. Right: loss curve. Watch the network learn.

Learning rate 0.32

Hidden neurons 100

Step: 0 | Loss: - | Accuracy: -

Things to try:

Set hidden neurons to 10. The boundary gets chunky — not enough neurons to carve the fine curves. Accuracy tops out around 80-85%.
Set hidden neurons to 200. The boundary becomes silky smooth. The network has more capacity than it needs.
Crank the learning rate to 3.0. Watch the loss explode — the steps are too big and overshoot the minimum.
Drop the learning rate to 0.01. Training still works, just painfully slow.

Chapter 8: Why It Works — What the Hidden Layer Learns

The showcase demonstrated that it works. Now let's understand why. The hidden layer transforms the 2D input into a 100-dimensional space. In this new space, the spirals become linearly separable — and the second layer is just a linear classifier in that space.

Each hidden neuron computes w · x + b and then applies ReLU. Geometrically, each neuron draws a line in the 2D input space. Points on one side of the line activate the neuron (positive value); points on the other side are zero. With 100 neurons, you get 100 different lines, creating a patchwork of regions. Each region has a unique pattern of which neurons are on and which are off.

The key insight: The hidden layer doesn't learn the final classification. It learns a coordinate system — a new representation where the data is easy to classify. The first layer does the hard work (untangling spirals); the second layer does the easy work (drawing straight boundaries in the new space).

Hidden Neuron Features

Each tile shows what one hidden neuron "sees" — the region of 2D space where it activates (bright) vs. where it's off (dark). The first layer learns to tile the plane with these half-spaces. Click "Train & Show" to train a small network and visualize its learned features.

This is the deep learning recipe in miniature: learn features, then classify on those features. In a deep network with many layers, each layer transforms the representation slightly, building up from raw pixels to edges to textures to object parts to full objects. Our two-layer net does this in one jump — from raw coordinates to spiral-aware features.

What does the hidden layer learn to do with the spiral data?

It transforms the data into a new coordinate system where the spirals become linearly separable It memorizes every training point individually It directly outputs the class labels

Chapter 9: Connections — From Scratch to Frameworks

You've just built a complete neural network from scratch: data generation, forward pass, loss computation, backward pass, and parameter update. Every deep learning framework — PyTorch, JAX, TensorFlow — does exactly what we did, just with automatic differentiation and GPU acceleration.

Here's our entire training loop translated to PyTorch, for comparison:

python
import torch

# Same architecture
model = torch.nn.Sequential(
    torch.nn.Linear(2, 100),
    torch.nn.ReLU(),
    torch.nn.Linear(100, 3)
)
optimizer = torch.optim.SGD(model.parameters(), lr=1.0, weight_decay=1e-3)
loss_fn = torch.nn.CrossEntropyLoss()

for step in range(10000):
    scores = model(X_tensor)                    # forward pass (automatic)
    loss = loss_fn(scores, y_tensor)            # loss (automatic)
    optimizer.zero_grad()
    loss.backward()                              # backward pass (automatic)
    optimizer.step()                             # parameter update (automatic)

Four lines replace dozens. But now you know what each of those four lines does under the hood.

What we did	Framework equivalent
np.dot(X, W1) + b1, ReLU, np.dot(hidden, W2) + b2	`model(X)`
Softmax + −log + regularization	`CrossEntropyLoss + weight_decay`
Manual dW1, db1, dW2, db2 via chain rule	`loss.backward()`
W -= lr * dW	`optimizer.step()`

Where to go next: You've built the foundation. The jump from here to modern deep learning is scaling — more layers, bigger datasets, cleverer optimizers (Adam, learning rate schedules), and specialized architectures (CNNs for images, Transformers for sequences). But the core loop — forward, loss, backward, update — never changes.

Related lessons:

Neural Networks Part 1 — architecture, activations, and the universal approximation theorem
Neural Networks Part 2 — data preprocessing, weight initialization, batch normalization
Optimization & Backpropagation — gradient descent, learning rate, momentum
Linear Classification — the linear model we extended here

"What I cannot create, I do not understand." — Richard Feynman

In a modern framework like PyTorch, which operation replaces our manual chain-rule gradient computation?

model.forward() loss.backward() — automatic differentiation computes all gradients optimizer.step()

Neural Net Case StudyFrom Scratch