From a single neuron to a universal function approximator — in ten chapters.
In the linear classification lesson, we learned that a linear classifier computes f(x) = Wx + b. It draws straight decision boundaries. And for many problems, straight lines are enough.
But what happens when the data isn't linearly separable? Consider the classic XOR problem: two classes of points arranged so that no single straight line can separate them. Class A sits at (0,0) and (1,1). Class B sits at (0,1) and (1,0). Try drawing a line. You can't.
Orange and teal dots cannot be separated by any straight line. Drag the line to try. A neural network solves this effortlessly by learning curved boundaries.
The solution: stack simple linear operations with nonlinear activation functions between them. Each layer warps the space a little. Stack enough layers, and you can sculpt any decision boundary you want. That's a neural network.
The name "neural network" comes from the brain. A biological neuron receives electrical signals through its dendrites, integrates them in the cell body, and if the total signal exceeds a threshold, it fires an output pulse down its axon to the next neuron's dendrites.
The mathematical analogy: inputs xi arrive along "dendrites," each multiplied by a weight wi (the "synapse strength"). The cell body sums them: w · x + b. If the result is large enough, the activation function "fires" — producing a nonzero output.
Left: a simplified biological neuron. Right: its mathematical model. Signals arrive, get weighted, sum together, pass through an activation function.
This simplification is useful, not accurate. The mathematical model gives us a composable building block: something small and simple that, when wired together in large numbers, produces powerful behavior. That's the real insight — not the biology.
One more historical note: the perceptron, proposed by Frank Rosenblatt in 1958, was the first artificial neuron. It could only learn linearly separable patterns. When Minsky and Papert proved this limitation in 1969, interest in neural networks collapsed for over a decade — the first "AI winter." The fix, as we'll see, was adding hidden layers.
Let's formalize the single neuron. It takes n inputs x1, x2, ..., xn, computes a weighted sum, adds a bias, and passes the result through an activation function σ:
If we use a sigmoid activation, the output is between 0 and 1. That's a probability. A single neuron with a sigmoid is literally logistic regression — binary classification with a linear decision boundary. If we use a threshold activation (output 1 if sum > 0, else 0), that's the original perceptron.
f = Wx + b with an activation on top.A single neuron with two inputs classifies 2D points. Adjust the weights and bias with the sliders to move the decision boundary. The boundary is always a straight line.
Notice: no matter how you adjust the sliders, the boundary is always a straight line. That's the limit of one neuron. To get curves, we need layers of neurons.
The activation function is what makes a neural network nonlinear. Without it, stacking layers would be pointless: a linear function of a linear function is still linear. W2(W1x) = (W2W1)x = W'x. You'd just have a single linear layer with extra steps.
Sigmoid: σ(z) = 1 / (1 + e−z). Squashes any input to the range (0, 1). Historically popular because it looks like a neuron "firing rate." But it has serious problems: gradients vanish when inputs are very large or very small (the curve is flat), and the output is not zero-centered.
Tanh: tanh(z) = 2σ(2z) − 1. Squashes input to (−1, 1). Zero-centered, which is better than sigmoid. But it still saturates — gradients vanish for large |z|.
ReLU: f(z) = max(0, z). Dead simple: output z if positive, zero if negative. Doesn't saturate for positive inputs. Computationally trivial. This one function revolutionized deep learning. Proposed for neural networks by Nair and Hinton in 2010, it made training deep networks drastically faster.
Three activation functions plotted. Hover to see output values. Notice how sigmoid and tanh flatten out at the extremes (vanishing gradients) while ReLU stays linear for positive inputs.
| Function | Range | Zero-Centered | Saturates | Speed |
|---|---|---|---|---|
| Sigmoid | (0, 1) | No | Yes | Slow (exp) |
| Tanh | (−1, 1) | Yes | Yes | Slow (exp) |
| ReLU | [0, ∞) | No | No (positive side) | Fast (max) |
ReLU has one nasty failure mode. If a neuron's pre-activation is always negative (say, because a large gradient update pushed the weights too far), the output is permanently zero. Zero output means zero gradient flowing back. The neuron will never update again. It's dead.
Leaky ReLU: f(z) = z if z > 0, else αz (typically α = 0.01). Instead of outputting zero for negative inputs, it outputs a small negative slope. Dead neurons can't happen because there's always a nonzero gradient.
ELU (Exponential Linear Unit): f(z) = z if z > 0, else α(ez − 1). Smoother than Leaky ReLU. Outputs approach −α for large negative inputs, which pushes the mean activation closer to zero.
GELU (Gaussian Error Linear Unit): f(z) = z · Φ(z), where Φ is the standard Gaussian CDF. Used in BERT and GPT. It's smooth, and unlike ReLU, it doesn't have a hard kink at zero — it gradually "turns on." Think of it as a soft gate: inputs that are very negative get zeroed out, very positive pass through, and inputs near zero get a smooth blend.
Compare ReLU, Leaky ReLU, ELU, and GELU. Notice how the variants handle negative inputs differently. Use the slider to adjust α for Leaky ReLU and ELU.
A single neuron draws one boundary. To draw complex boundaries, we wire neurons together in layers. The most common arrangement is the fully-connected layer (also called a dense layer): every neuron in one layer connects to every neuron in the next.
For a 2-layer network, the forward pass looks like this:
Where f is an activation function (like ReLU), W1 maps inputs to hidden units, W2 maps hidden units to output scores, and h is the hidden representation — the network's internal encoding of the input.
For a 3-layer network, we add another layer:
A fully-connected network. Each circle is a neuron. Each line is a weight. Click the buttons to change the number of hidden layers and neurons. Watch how the parameter count explodes.
python import numpy as np def forward(x, W1, b1, W2, b2): # 2-layer neural network forward pass h = np.maximum(0, W1 @ x + b1) # ReLU activation scores = W2 @ h + b2 # output layer (no activation) return scores # Example: 2D input, 4 hidden neurons, 3 classes W1 = np.random.randn(4, 2) * 0.01 b1 = np.zeros((4, 1)) W2 = np.random.randn(3, 4) * 0.01 b2 = np.zeros((3, 1)) x = np.array([[0.5], [-0.3]]) scores = forward(x, W1, b1, W2, b2)
How powerful are neural networks? A stunning mathematical result tells us: very. The Universal Approximation Theorem (Cybenko, 1989; Hornik, 1991) proves that a neural network with just one hidden layer and enough neurons can approximate any continuous function to any desired accuracy.
This sounds like "just use one hidden layer with lots of neurons and you're done." But there's a catch. The theorem says a solution exists — it doesn't say gradient descent will find it. A very wide, shallow network might need an astronomically large number of neurons, while a deeper network can represent the same function with exponentially fewer parameters.
Think of it like building with LEGO. A one-layer-deep structure can theoretically build anything, but you'd need millions of pieces all laid flat. With depth, you can stack and reuse pieces, building the same thing far more efficiently. Deep networks compose features hierarchically: edges → textures → parts → objects.
Fitting a wavy target function (gray). A wide shallow network (orange) and a narrow deep network (teal) both approximate it, but with different efficiency. Adjust neurons and layers to see the tradeoff.
Time to see everything in action. Below is a 2D classification task. You control the network architecture — how many hidden layers, how many neurons per layer, and which activation function to use. Hit "Train" and watch the decision boundary evolve in real time as the network learns.
Choose a dataset and architecture, then train. The background color shows the network's decision boundary. Orange = class A, teal = class B.
Notice how a linear classifier (0 hidden layers) can only draw a straight line — it fails on circles and spirals. Adding just one hidden layer with a few ReLU neurons lets the network bend the boundary. More layers and neurons let it carve out increasingly intricate regions.
With more neurons, the network can represent more complex functions. So why not use a massive network for every problem? Because of overfitting: a network with too much capacity memorizes the training data — including its noise — instead of learning the underlying pattern.
Fitting noisy data with different network sizes. Too few neurons: underfits (can't capture the pattern). Too many neurons: overfits (memorizes every noisy point). Adjust the neuron count to find the sweet spot.
The cs231n notes make a counterintuitive recommendation: don't reduce overfitting by using fewer neurons. Smaller networks have fewer local minima, but those minima tend to be worse. Larger networks have many more local minima, but they tend to be equivalent and have lower loss. The solutions found by large networks are more varied, but on average they're better.
You now know how to set up a neural network: choose the number of layers, the number of neurons per layer, and the activation function. You understand why ReLU is the default, why depth matters, and why regularization beats architecture shrinkage for controlling overfitting.
But architecture is only half the story. A beautifully designed network with random weights is useless. The next step is training: using gradient descent and backpropagation to find good weights. That's where the real magic happens.
| Topic | This Lesson | Next Steps |
|---|---|---|
| Architecture | Layers, neurons, activations | ConvNets, ResNets, Transformers |
| Training | Not covered | Optimization & Backprop |
| Regularization | Mentioned (L2, dropout) | Dropout, BatchNorm, data augmentation |
| Activation Functions | Sigmoid, tanh, ReLU, Leaky ReLU, ELU, GELU | Swish, Mish (in practice) |
Related lessons: