Build a complete 2-layer neural network from raw Python — forward pass, loss, backprop, parameter update — and watch it learn a spiral.
You've learned what neurons are. You've seen activation functions and layer diagrams. You know about backpropagation in theory. But do you really understand how a neural network learns? There's only one way to find out: build one yourself, from raw numbers, and watch it train.
We need a dataset that's simple enough to visualize but hard enough that a linear classifier fails completely. Enter the spiral dataset: three classes of points wound into interlocking spirals. No straight line — not even three straight lines — can separate them.
Three classes spiraling outward. Try to imagine drawing straight lines to separate them. You can't. This is the problem we'll solve.
This is the same "minimal neural network case study" from Stanford's cs231n. We'll follow the same structure: generate data, fail with a linear model, succeed with a two-layer net. Every line of code will be explained. No magic.
Let's build the dataset. We want K = 3 classes, each with N = 100 points, living in 2D space. Each class spirals outward from the origin at a different angle:
python import numpy as np N = 100 # points per class K = 3 # number of classes D = 2 # dimensionality (x, y) X = np.zeros((N*K, D)) y = np.zeros(N*K, dtype='int') for j in range(K): ix = range(N*j, N*(j+1)) r = np.linspace(0.0, 1, N) # radius grows from 0 to 1 t = np.linspace(j*4, (j+1)*4, N) + np.random.randn(N)*0.2 # angle + noise X[ix] = np.c_[r*np.sin(t), r*np.cos(t)] y[ix] = j
Each class gets its own angular range. The r variable grows linearly from 0 to 1, so points start near the origin and spiral outward. A small amount of Gaussian noise (0.2) makes the spirals fuzzy — realistic, not perfectly clean.
The result: a matrix X of shape (300, 2) — 300 points, each with an x and y coordinate — and a label vector y of shape (300,) with values 0, 1, or 2.
Adjust noise to see how it affects separability. More noise = harder problem. Points per class: 100.
r control?Before we build a neural network, let's see what happens when we try a plain linear classifier: scores = W · x + b, trained with softmax cross-entropy loss. This is the approach from the linear classification lesson, applied directly to our spiral.
python # Initialize weights W = 0.01 * np.random.randn(D, K) # 2x3 b = np.zeros((1, K)) # 1x3 # Compute scores scores = np.dot(X, W) + b # 300x3
The scores matrix has shape (300, 3) — one score per class for each of the 300 points. We convert scores to probabilities via softmax, compute the cross-entropy loss, and update W and b with gradient descent. After training, the linear classifier converges to about 49% accuracy. On a 3-class problem, random chance is 33%. So the linear model barely beats guessing.
The colored regions show the linear classifier's decision boundaries after training. Notice: only straight lines. Many points land in the wrong region. Click "Train" to run 200 steps of gradient descent.
The straight boundary lines are the best the linear model can do. It captures the gross direction of each class but misclassifies everything near the spiral arms. We need something that can bend.
The fix is beautifully simple. Instead of mapping inputs directly to class scores with one matrix, we add a hidden layer in between. The data passes through a first linear transformation, then a nonlinearity (ReLU), then a second linear transformation:
Let's unpack this. W1 has shape (D, h) where D = 2 (our input dimension) and h is the number of hidden neurons — say 100. So W1 · x + b1 gives us a vector of h numbers. Then max(0, ·) is the ReLU activation: it zeros out negatives and keeps positives. Finally, W2 has shape (h, K) — it maps the h hidden features to K = 3 class scores.
python # Initialize parameters h = 100 # hidden layer size W1 = 0.01 * np.random.randn(D, h) # 2 x 100 b1 = np.zeros((1, h)) # 1 x 100 W2 = 0.01 * np.random.randn(h, K) # 100 x 3 b2 = np.zeros((1, K)) # 1 x 3
We initialize weights with small random numbers (not zeros — that would make all neurons identical) and biases at zero. The hidden size h is a hyperparameter — we choose it. More hidden neurons = more capacity = more complex boundaries, but also more parameters to train.
The two-layer network: 2 inputs → h hidden neurons with ReLU → 3 class scores. Adjust h to see the network grow.
The forward pass pushes input data through the network to produce class scores. It's three lines of NumPy:
python # Forward pass hidden = np.dot(X, W1) + b1 # (300, 100) raw hidden activations hidden = np.maximum(0, hidden) # (300, 100) ReLU: zero out negatives scores = np.dot(hidden, W2) + b2 # (300, 3) class scores
Let's trace through a single point. Say x = [0.3, -0.5] (a 2D coordinate in our spiral).
Step 1 — Hidden pre-activations: Multiply x by W1 (a 2×h matrix) and add b1. This gives h numbers — one per hidden neuron. Each number is a weighted combination of the two input coordinates.
Step 2 — ReLU: Any negative values become zero. Positive values pass through unchanged. This is the nonlinearity that gives the network its power. After ReLU, some neurons are "off" (zero) and some are "on" (positive). The pattern of which neurons fire is what encodes useful features.
Step 3 — Output scores: Multiply the h-dimensional hidden vector by W2 (h×3) and add b2. This produces 3 scores: one per class. The highest score is the predicted class.
Watch a single input flow through the network. Each hidden neuron computes a weighted sum, then ReLU kills negatives. The survivors combine into three class scores.
We have scores. Now we need a single number that measures how bad those scores are — the loss. We use softmax cross-entropy, the same loss from linear classification, applied to our neural net's output.
Step 1 — Softmax: Convert raw scores into probabilities. For numerical stability, subtract the max score first:
python # Softmax: scores -> probabilities exp_scores = np.exp(scores - np.max(scores, axis=1, keepdims=True)) probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # (300, 3)
Now probs[i] is a probability distribution over 3 classes for the i-th point. All values are between 0 and 1, and each row sums to 1.
Step 2 — Cross-entropy loss: For each point, look up the probability assigned to the correct class, take the negative log, and average over all points:
python # Cross-entropy loss correct_logprobs = -np.log(probs[range(N*K), y]) # (300,) data_loss = np.sum(correct_logprobs) / (N*K) # L2 regularization reg = 1e-3 reg_loss = 0.5 * reg * (np.sum(W1*W1) + np.sum(W2*W2)) loss = data_loss + reg_loss
The regularization term penalizes large weights. Without it, the network could memorize the training data with extreme weight values. The hyperparameter reg controls the strength of this penalty — larger means simpler, smoother decision boundaries.
How −log(p) penalizes low confidence. The x-axis is the probability assigned to the correct class. As it drops toward 0, the loss rockets up.
We have a loss. Now we need the gradients — how should each weight change to reduce the loss? This is backpropagation: we work backward from the loss through each operation, applying the chain rule at every step.
Step 1 — Gradient on scores. The gradient of the softmax cross-entropy loss with respect to the scores has a beautifully simple form: it's just the probabilities, with 1 subtracted from the correct class:
python # Gradient on scores dscores = probs.copy() # (300, 3) dscores[range(N*K), y] -= 1 # subtract 1 from correct class dscores /= (N*K) # average over batch
Step 2 — Backprop into W2 and b2. Since scores = hidden · W2 + b2, the gradients follow from matrix calculus:
python # Backprop into W2 and b2 dW2 = np.dot(hidden.T, dscores) # (100, 3) db2 = np.sum(dscores, axis=0, keepdims=True) # (1, 3)
Step 3 — Backprop into the hidden layer. We propagate the gradient through W2, then through the ReLU:
python # Backprop into hidden layer dhidden = np.dot(dscores, W2.T) # (300, 100) dhidden[hidden <= 0] = 0 # ReLU gate: zero grad where input was ≤ 0
The ReLU gradient is the key line. For any hidden neuron that was "off" (had value ≤ 0 after ReLU), the gradient is zero — it contributed nothing to the output, so changing it won't change the loss. For "on" neurons, the gradient passes through unchanged.
Step 4 — Backprop into W1 and b1:
python # Backprop into W1 and b1 dW1 = np.dot(X.T, dhidden) # (2, 100) db1 = np.sum(dhidden, axis=0, keepdims=True) # (1, 100) # Add regularization gradient dW2 += reg * W2 dW1 += reg * W1
Everything comes together. Forward pass, loss, backward pass, weight update — repeated thousands of times. Watch the network learn the spiral in real time. The decision boundary starts as random noise and gradually sculpts itself to wrap around each spiral arm.
Left: decision boundary evolving over the data. Right: loss curve. Watch the network learn.
Things to try:
The showcase demonstrated that it works. Now let's understand why. The hidden layer transforms the 2D input into a 100-dimensional space. In this new space, the spirals become linearly separable — and the second layer is just a linear classifier in that space.
Each hidden neuron computes w · x + b and then applies ReLU. Geometrically, each neuron draws a line in the 2D input space. Points on one side of the line activate the neuron (positive value); points on the other side are zero. With 100 neurons, you get 100 different lines, creating a patchwork of regions. Each region has a unique pattern of which neurons are on and which are off.
Each tile shows what one hidden neuron "sees" — the region of 2D space where it activates (bright) vs. where it's off (dark). The first layer learns to tile the plane with these half-spaces. Click "Train & Show" to train a small network and visualize its learned features.
This is the deep learning recipe in miniature: learn features, then classify on those features. In a deep network with many layers, each layer transforms the representation slightly, building up from raw pixels to edges to textures to object parts to full objects. Our two-layer net does this in one jump — from raw coordinates to spiral-aware features.
You've just built a complete neural network from scratch: data generation, forward pass, loss computation, backward pass, and parameter update. Every deep learning framework — PyTorch, JAX, TensorFlow — does exactly what we did, just with automatic differentiation and GPU acceleration.
Here's our entire training loop translated to PyTorch, for comparison:
python import torch # Same architecture model = torch.nn.Sequential( torch.nn.Linear(2, 100), torch.nn.ReLU(), torch.nn.Linear(100, 3) ) optimizer = torch.optim.SGD(model.parameters(), lr=1.0, weight_decay=1e-3) loss_fn = torch.nn.CrossEntropyLoss() for step in range(10000): scores = model(X_tensor) # forward pass (automatic) loss = loss_fn(scores, y_tensor) # loss (automatic) optimizer.zero_grad() loss.backward() # backward pass (automatic) optimizer.step() # parameter update (automatic)
Four lines replace dozens. But now you know what each of those four lines does under the hood.
| What we did | Framework equivalent |
|---|---|
| np.dot(X, W1) + b1, ReLU, np.dot(hidden, W2) + b2 | model(X) |
| Softmax + −log + regularization | CrossEntropyLoss + weight_decay |
| Manual dW1, db1, dW2, db2 via chain rule | loss.backward() |
| W -= lr * dW | optimizer.step() |
Related lessons:
"What I cannot create, I do not understand." — Richard Feynman