Deep Learning Foundations

Neural Networks
From Zero

From a single neuron to a universal function approximator — in ten chapters.

Prerequisites: Linear classification + Basic calculus. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Neural Networks?

In the linear classification lesson, we learned that a linear classifier computes f(x) = Wx + b. It draws straight decision boundaries. And for many problems, straight lines are enough.

But what happens when the data isn't linearly separable? Consider the classic XOR problem: two classes of points arranged so that no single straight line can separate them. Class A sits at (0,0) and (1,1). Class B sits at (0,1) and (1,0). Try drawing a line. You can't.

The fundamental limit: Linear classifiers can only draw straight boundaries. Real data — faces, speech, language — has curved, tangled, deeply nonlinear structure. We need a model that can bend.
The XOR Problem

Orange and teal dots cannot be separated by any straight line. Drag the line to try. A neural network solves this effortlessly by learning curved boundaries.

The solution: stack simple linear operations with nonlinear activation functions between them. Each layer warps the space a little. Stack enough layers, and you can sculpt any decision boundary you want. That's a neural network.

Linear Classifier
f = Wx + b → straight boundary
↓ add nonlinearity
Two-Layer Net
f = W2 · max(0, W1x + b1) + b2 → curved boundary
↓ add more layers
Deep Network
Arbitrarily complex boundaries
Why can't a linear classifier solve the XOR problem?

Chapter 1: Modeling Neurons — Biological Inspiration

The name "neural network" comes from the brain. A biological neuron receives electrical signals through its dendrites, integrates them in the cell body, and if the total signal exceeds a threshold, it fires an output pulse down its axon to the next neuron's dendrites.

The mathematical analogy: inputs xi arrive along "dendrites," each multiplied by a weight wi (the "synapse strength"). The cell body sums them: w · x + b. If the result is large enough, the activation function "fires" — producing a nonzero output.

Biological vs Artificial Neuron

Left: a simplified biological neuron. Right: its mathematical model. Signals arrive, get weighted, sum together, pass through an activation function.

Important caveat: Real neurons are staggeringly complex. They use precise spike timing, have thousands of connection types, release dozens of neurotransmitters, and compute with analog chemistry. The artificial neuron is a cartoon sketch — just the loosest inspiration. Don't take the analogy too far. Modern neural networks succeed because of math and engineering, not because they faithfully copy biology.

This simplification is useful, not accurate. The mathematical model gives us a composable building block: something small and simple that, when wired together in large numbers, produces powerful behavior. That's the real insight — not the biology.

One more historical note: the perceptron, proposed by Frank Rosenblatt in 1958, was the first artificial neuron. It could only learn linearly separable patterns. When Minsky and Papert proved this limitation in 1969, interest in neural networks collapsed for over a decade — the first "AI winter." The fix, as we'll see, was adding hidden layers.

Why should we be cautious about the biological neuron analogy?

Chapter 2: A Single Neuron — It's a Linear Classifier

Let's formalize the single neuron. It takes n inputs x1, x2, ..., xn, computes a weighted sum, adds a bias, and passes the result through an activation function σ:

output = σ(w1x1 + w2x2 + ... + wnxn + b) = σ(w · x + b)

If we use a sigmoid activation, the output is between 0 and 1. That's a probability. A single neuron with a sigmoid is literally logistic regression — binary classification with a linear decision boundary. If we use a threshold activation (output 1 if sum > 0, else 0), that's the original perceptron.

Key insight: A single neuron = a linear classifier. It draws one straight line (or hyperplane in higher dimensions) through the input space. Everything above the line maps to one class, everything below to the other. We already know this from the linear classification lesson — one neuron is just f = Wx + b with an activation on top.
Single Neuron Classifier

A single neuron with two inputs classifies 2D points. Adjust the weights and bias with the sliders to move the decision boundary. The boundary is always a straight line.

w1 1.0
w2 1.0
bias 0.0

Notice: no matter how you adjust the sliders, the boundary is always a straight line. That's the limit of one neuron. To get curves, we need layers of neurons.

A single neuron with a sigmoid activation is equivalent to:

Chapter 3: Activation Functions — Sigmoid, Tanh, ReLU

The activation function is what makes a neural network nonlinear. Without it, stacking layers would be pointless: a linear function of a linear function is still linear. W2(W1x) = (W2W1)x = W'x. You'd just have a single linear layer with extra steps.

Why activations matter: Without nonlinear activation functions, a 100-layer network collapses to a single matrix multiply. The activation is what gives neural networks their power to learn curved, complex decision boundaries.

Sigmoid: σ(z) = 1 / (1 + e−z). Squashes any input to the range (0, 1). Historically popular because it looks like a neuron "firing rate." But it has serious problems: gradients vanish when inputs are very large or very small (the curve is flat), and the output is not zero-centered.

Tanh: tanh(z) = 2σ(2z) − 1. Squashes input to (−1, 1). Zero-centered, which is better than sigmoid. But it still saturates — gradients vanish for large |z|.

ReLU: f(z) = max(0, z). Dead simple: output z if positive, zero if negative. Doesn't saturate for positive inputs. Computationally trivial. This one function revolutionized deep learning. Proposed for neural networks by Nair and Hinton in 2010, it made training deep networks drastically faster.

Activation Functions Compared

Three activation functions plotted. Hover to see output values. Notice how sigmoid and tanh flatten out at the extremes (vanishing gradients) while ReLU stays linear for positive inputs.

FunctionRangeZero-CenteredSaturatesSpeed
Sigmoid(0, 1)NoYesSlow (exp)
Tanh(−1, 1)YesYesSlow (exp)
ReLU[0, ∞)NoNo (positive side)Fast (max)
The ReLU revolution: Before ReLU, training networks deeper than 2-3 layers was agonizingly slow because of vanishing gradients. ReLU's constant gradient for positive inputs let gradients flow through many layers without shrinking. Krizhevsky et al. (2012) used ReLU in AlexNet and trained 6x faster than with tanh.
Why did ReLU accelerate deep learning?

Chapter 4: ReLU and Friends — The Dead Neuron Problem

ReLU has one nasty failure mode. If a neuron's pre-activation is always negative (say, because a large gradient update pushed the weights too far), the output is permanently zero. Zero output means zero gradient flowing back. The neuron will never update again. It's dead.

The dead neuron problem: A ReLU neuron that outputs zero for all inputs in the training set will never recover. Its gradient is zero, so it can never learn again. With high learning rates, up to 40% of neurons in a network can die.

Leaky ReLU: f(z) = z if z > 0, else αz (typically α = 0.01). Instead of outputting zero for negative inputs, it outputs a small negative slope. Dead neurons can't happen because there's always a nonzero gradient.

ELU (Exponential Linear Unit): f(z) = z if z > 0, else α(ez − 1). Smoother than Leaky ReLU. Outputs approach −α for large negative inputs, which pushes the mean activation closer to zero.

GELU (Gaussian Error Linear Unit): f(z) = z · Φ(z), where Φ is the standard Gaussian CDF. Used in BERT and GPT. It's smooth, and unlike ReLU, it doesn't have a hard kink at zero — it gradually "turns on." Think of it as a soft gate: inputs that are very negative get zeroed out, very positive pass through, and inputs near zero get a smooth blend.

ReLU Variants

Compare ReLU, Leaky ReLU, ELU, and GELU. Notice how the variants handle negative inputs differently. Use the slider to adjust α for Leaky ReLU and ELU.

α 0.10
Practical advice (cs231n): Use ReLU. Be careful with learning rates. If you see many dead neurons, try Leaky ReLU. Don't use sigmoid or tanh in hidden layers. In modern transformers, GELU is the default.
What is the "dead neuron" problem?

Chapter 5: Layers — Stacking Neurons Together

A single neuron draws one boundary. To draw complex boundaries, we wire neurons together in layers. The most common arrangement is the fully-connected layer (also called a dense layer): every neuron in one layer connects to every neuron in the next.

Architecture terminology: A neural network has an input layer (your data), one or more hidden layers (where the computation happens), and an output layer (the final prediction). We don't count the input layer, so a "2-layer network" has one hidden layer and one output layer.

For a 2-layer network, the forward pass looks like this:

h = f(W1 · x + b1)  s = W2 · h + b2

Where f is an activation function (like ReLU), W1 maps inputs to hidden units, W2 maps hidden units to output scores, and h is the hidden representation — the network's internal encoding of the input.

For a 3-layer network, we add another layer:

h1 = f(W1 · x + b1)  h2 = f(W2 · h1 + b2)  s = W3 · h2 + b3
Network Architecture Builder

A fully-connected network. Each circle is a neuron. Each line is a weight. Click the buttons to change the number of hidden layers and neurons. Watch how the parameter count explodes.

1 hidden layer, 4 neurons, 0 params
python
import numpy as np

def forward(x, W1, b1, W2, b2):
    # 2-layer neural network forward pass
    h = np.maximum(0, W1 @ x + b1)  # ReLU activation
    scores = W2 @ h + b2              # output layer (no activation)
    return scores

# Example: 2D input, 4 hidden neurons, 3 classes
W1 = np.random.randn(4, 2) * 0.01
b1 = np.zeros((4, 1))
W2 = np.random.randn(3, 4) * 0.01
b2 = np.zeros((3, 1))

x = np.array([[0.5], [-0.3]])
scores = forward(x, W1, b1, W2, b2)
A "3-layer neural network" has how many hidden layers?

Chapter 6: Representational Power — Universal Approximation

How powerful are neural networks? A stunning mathematical result tells us: very. The Universal Approximation Theorem (Cybenko, 1989; Hornik, 1991) proves that a neural network with just one hidden layer and enough neurons can approximate any continuous function to any desired accuracy.

Universal Approximation Theorem: A 2-layer network (one hidden layer) with a sufficient number of neurons can represent any continuous function on a bounded domain. The network is a universal function approximator.

This sounds like "just use one hidden layer with lots of neurons and you're done." But there's a catch. The theorem says a solution exists — it doesn't say gradient descent will find it. A very wide, shallow network might need an astronomically large number of neurons, while a deeper network can represent the same function with exponentially fewer parameters.

Think of it like building with LEGO. A one-layer-deep structure can theoretically build anything, but you'd need millions of pieces all laid flat. With depth, you can stack and reuse pieces, building the same thing far more efficiently. Deep networks compose features hierarchically: edges → textures → parts → objects.

Width vs Depth

Fitting a wavy target function (gray). A wide shallow network (orange) and a narrow deep network (teal) both approximate it, but with different efficiency. Adjust neurons and layers to see the tradeoff.

Width (neurons) 6
Depth (layers) 1
Depth wins in practice: While a single hidden layer can represent anything, deeper networks learn better features from data. ResNets with 150+ layers dramatically outperform 2-layer networks with the same parameter count. Depth enables hierarchical feature learning — that's the real insight.
The Universal Approximation Theorem guarantees that a wide enough 2-layer network can represent any function. Why do we still use deep networks?

Chapter 7: Playground — Neural Network Classifier

Time to see everything in action. Below is a 2D classification task. You control the network architecture — how many hidden layers, how many neurons per layer, and which activation function to use. Hit "Train" and watch the decision boundary evolve in real time as the network learns.

What to try: Start with 1 hidden layer, 4 neurons, ReLU. Train on the spiral dataset. Then try 0 hidden layers (linear classifier) — watch it fail. Add layers. Add neurons. Switch to sigmoid and see how much slower training is. Break it. Fix it. Build intuition.
Neural Network Playground

Choose a dataset and architecture, then train. The background color shows the network's decision boundary. Orange = class A, teal = class B.

Dataset:
Activation:
Hidden Layers 1
Neurons/Layer 8
Learning Rate 0.10
Epoch: 0 | Loss: — | Acc: —

Notice how a linear classifier (0 hidden layers) can only draw a straight line — it fails on circles and spirals. Adding just one hidden layer with a few ReLU neurons lets the network bend the boundary. More layers and neurons let it carve out increasingly intricate regions.

Chapter 8: How Many Neurons? — Capacity and Overfitting

With more neurons, the network can represent more complex functions. So why not use a massive network for every problem? Because of overfitting: a network with too much capacity memorizes the training data — including its noise — instead of learning the underlying pattern.

Capacity vs Generalization

Fitting noisy data with different network sizes. Too few neurons: underfits (can't capture the pattern). Too many neurons: overfits (memorizes every noisy point). Adjust the neuron count to find the sweet spot.

Neurons 5

The cs231n notes make a counterintuitive recommendation: don't reduce overfitting by using fewer neurons. Smaller networks have fewer local minima, but those minima tend to be worse. Larger networks have many more local minima, but they tend to be equivalent and have lower loss. The solutions found by large networks are more varied, but on average they're better.

The cs231n rule: Use as many neurons as you can afford computationally, then control overfitting with regularization (L2 penalty, dropout, etc.) — not by shrinking the network. Regularization gives you the best of both worlds: high capacity with controlled complexity.
Analogy: Think of a network's neurons like employees. Having too few employees means the team can't handle complex tasks (underfitting). Having too many idle employees seems wasteful (overfitting risk). But the right approach isn't to fire people — it's to give them clear guidelines and structure (regularization). A large, well-managed team outperforms a tiny, overworked one.
According to cs231n, what's the recommended way to fight overfitting?

Chapter 9: Beyond — From Architecture to Training

You now know how to set up a neural network: choose the number of layers, the number of neurons per layer, and the activation function. You understand why ReLU is the default, why depth matters, and why regularization beats architecture shrinkage for controlling overfitting.

But architecture is only half the story. A beautifully designed network with random weights is useless. The next step is training: using gradient descent and backpropagation to find good weights. That's where the real magic happens.

TopicThis LessonNext Steps
ArchitectureLayers, neurons, activationsConvNets, ResNets, Transformers
TrainingNot coveredOptimization & Backprop
RegularizationMentioned (L2, dropout)Dropout, BatchNorm, data augmentation
Activation FunctionsSigmoid, tanh, ReLU, Leaky ReLU, ELU, GELUSwish, Mish (in practice)
The story so far: Image Classification taught us the problem. Linear Classification gave us f = Wx + b. This lesson stacked those linear classifiers with nonlinearities to build neural networks. Next: how do we actually find the right weights? That's optimization and backpropagation.

Related lessons:

Feynman: "What I cannot create, I do not understand." You've now built a neural network from a single neuron up. You understand every piece. Next, you'll learn to train it.
Which component makes a neural network more powerful than a linear classifier?