EE269 Lecture 23 — Mert Pilanci, Stanford

Neural Networks

From linear classifiers to universal function approximators — one hidden layer at a time.

Prerequisites: Regression (normal equations) + Gradient descent. That's it.
8
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: From Linear to Nonlinear

You're building a classifier for two-class data. You've mastered linear methods: a weight vector w and a bias that define a separating hyperplane wTx + b = 0. For linearly separable data, this works beautifully.

But what about the data below? Two classes arranged in a spiral. No single line (or hyperplane) can separate them. Every linear classifier will misclassify at least a quarter of the points.

This isn't a contrived example. Real-world signals — speech phonemes, EEG states, radar returns — routinely have nonlinear decision boundaries. Linear classifiers hit a hard wall.

The limitation of linearity. A linear model computes f(x) = wTx + b. This defines a hyperplane in feature space. If the true decision boundary is curved, no amount of training data or regularization will fix the fundamental model mismatch. We need nonlinear basis functions.

The idea is simple: transform x through a set of nonlinear functions, then apply a linear classifier to the transformed features. If the transformations are chosen well, data that was tangled in the original space becomes separable in the new space.

Neural networks learn these transformations from data, rather than requiring us to design them by hand.

Linear vs. Nonlinear Boundary

Two spirals that no linear classifier can separate. The orange line shows the best linear attempt. Click "Add Hidden Layer" to see what a neural network can learn.

Linear classifier
Why can't a linear classifier separate two interleaved spirals?

Chapter 1: The Single Hidden Layer Architecture

A single hidden layer neural network maps an input vector x ∈ Rp to an output through two stages. First, M hidden units compute nonlinear features. Then, the output layer combines those features linearly.

Stage 1: Hidden Layer

Each hidden unit m = 1, ..., M computes:

zm = σ(αmT x) = σ(αm0 + αm1x1 + ... + αmpxp)

where αm is the weight vector for hidden unit m (including bias αm0), and σ is a nonlinear activation function (sigmoid, tanh, or ReLU — we'll explore these in Chapter 2).

Think of each hidden unit as a "feature detector." The weights αm define which direction in input space this unit responds to, and the activation function makes the response nonlinear.

Stage 2: Output Layer

For regression (predicting a continuous value), output k is:

yk = gkkT z) = gkk0 + βk1z1 + ... + βkMzM)

where gk(u) = u (the identity function) for regression, and g = softmax for classification.

Two-stage nonlinear feature extraction. The hidden layer creates M new features z1, ..., zM that are nonlinear functions of the input. The output layer is just a linear model on these learned features. The entire network is differentiable, so we can learn both α (feature detectors) and β (output weights) jointly by gradient descent.

Counting Parameters

With p inputs, M hidden units, and K outputs:

LayerParametersCount
Hidden (α)M vectors of size p+1M(p + 1)
Output (β)K vectors of size M+1K(M + 1)
TotalM(p+1) + K(M+1)

For p = 10 inputs, M = 50 hidden units, K = 3 outputs: 50(11) + 3(51) = 550 + 153 = 703 parameters. Small enough to fit on a napkin, yet powerful enough to approximate virtually any smooth function.

Input x ∈ Rp
Raw features (pixels, signal samples, ...)
↓ αmTx
Hidden z ∈ RM
zm = σ(αmTx) — learned nonlinear features
↓ βkTz
Output y ∈ RK
yk = gkkTz) — prediction
Network Architecture Visualizer

Adjust the number of hidden units M to see how the network structure changes. Each line is a learnable weight.

Hidden units M 4
In a single hidden layer network with p=5 inputs, M=20 hidden units, and K=1 output, how many total parameters are there?

Chapter 2: Activation Functions

The activation function σ is what makes neural networks nonlinear. Without it, stacking linear layers just gives another linear function: W2(W1x) = (W2W1)x. The choice of σ has major practical implications for training speed and gradient flow.

Sigmoid

σ(u) = 1 / (1 + e−u)

Maps any real number to (0, 1). Historically the first activation used. Derivative: σ'(u) = σ(u)(1 − σ(u)). Maximum derivative of 0.25 at u = 0. Problem: for large |u|, the derivative is near zero — vanishing gradients. During backprop, gradients get multiplied through layers, and tiny derivatives kill the signal.

Tanh

tanh(u) = (eu − e−u) / (eu + e−u)

Maps to (−1, 1). Zero-centered, so outputs can be positive or negative. Derivative: tanh'(u) = 1 − tanh2(u). Maximum of 1.0 at u = 0. Still suffers vanishing gradients in the tails, but less severely than sigmoid since the max derivative is 4× larger.

ReLU (Rectified Linear Unit)

ReLU(u) = max(0, u)

Dead simple. For positive inputs, the gradient is exactly 1 — no vanishing. For negative inputs, the gradient is 0. This creates dead neurons (units that never activate), but in practice ReLU trains much faster than sigmoid or tanh for deep networks.

Why ReLU won. The constant gradient of 1 for positive inputs means backprop signals pass through without shrinking. This is why deep networks (10+ layers) became practical — the gradient can flow undiminished through many ReLU layers. Sigmoid networks rarely worked beyond 2-3 layers.
ActivationRangeMax |σ'|Vanishing?Use
Sigmoid(0, 1)0.25YesOutput for binary classification
Tanh(−1, 1)1.0ModerateRNNs, small nets
ReLU[0, ∞)1.0NoDefault for hidden layers
Activation Functions & Their Derivatives

See each activation (solid) and its derivative (dashed). Notice how sigmoid's derivative nearly vanishes for |u| > 3.

Function Sigmoid
Why did ReLU largely replace sigmoid in hidden layers of deep networks?

Chapter 3: Universal Approximation

Here's a remarkable fact: a single hidden layer network with enough hidden units can approximate any continuous function on a compact set to arbitrary accuracy. This is the Universal Approximation Theorem (Cybenko 1989, Hornik 1991).

More precisely: for any continuous f: [a,b]p → R and any ε > 0, there exists M hidden units with weights αm and βm such that:

|f(x) − ∑m=1M βm σ(αmT x)| < ε   for all x ∈ [a,b]p

provided σ is a non-constant, bounded, continuous function (sigmoid qualifies).

Intuition: How Does This Work?

Consider 1D. A sigmoid σ(αx + b) is a smooth step function. By making α large, it becomes nearly a sharp step at x = −b/α. Now, two sigmoids with opposite signs create a "bump":

bump(x) = σ(α(x − a)) − σ(α(x − b))

This bump is approximately 1 on [a, b] and 0 elsewhere. By summing many bumps of different heights and positions, we can approximate any function — like building a histogram that matches the target function's shape.

Existence, not efficiency. The theorem says such a network exists. It says nothing about how many hidden units you need (could be astronomically many), or whether gradient descent can find the weights. It's a theoretical guarantee, not a practical recipe. Real networks work well for reasons beyond this theorem.

Worked Example: Approximating |x|

The function f(x) = |x| is continuous but not differentiable at 0. With M = 2 ReLU units:

h(x) = ReLU(x) + ReLU(−x) = max(0, x) + max(0, −x) = |x|

Two hidden units, exact representation. With smooth activations like sigmoid, we'd need more units for the sharp corner at zero, but we can get arbitrarily close.

Universal Approximation Demo

A 1D target function (teal) is approximated by a sum of sigmoid bumps (orange). Add more hidden units to get a tighter fit.

Hidden units M 4
The Universal Approximation Theorem guarantees that a single hidden layer network can:

Chapter 4: Backpropagation

We have a network with parameters θ = {α1, ..., αM, β}. Training means finding θ that minimizes a loss L(θ). Gradient descent needs ∂L/∂θ for every parameter. Backpropagation computes all these gradients efficiently using the chain rule, in one backward pass through the network.

Setup: Concrete 2-Layer Network

Let's derive backprop for a concrete case. One input x, one hidden layer with M units, one output ŷ. Regression with MSE loss.

zm = σ(αm x + bm)   for m = 1,...,M
ŷ = ∑m=1M βm zm + β0
L = ½(y − ŷ)2

Forward Pass (Compute Everything)

With concrete numbers. Let x = 2.0, y = 1.0 (the target). Two hidden units (M = 2). Sigmoid activation. Initial weights:

ParameterValue
α1, b10.5, −0.3
α2, b2−0.4, 0.2
β1, β2, β00.6, −0.8, 0.1

Hidden unit 1: α1x + b1 = 0.5(2) − 0.3 = 0.7. σ(0.7) = 1/(1 + e−0.7) = 0.668. So z1 = 0.668.

Hidden unit 2: α2x + b2 = −0.4(2) + 0.2 = −0.6. σ(−0.6) = 1/(1 + e0.6) = 0.354. So z2 = 0.354.

Output: ŷ = 0.6(0.668) + (−0.8)(0.354) + 0.1 = 0.401 − 0.283 + 0.1 = 0.218.

Loss: L = 0.5(1.0 − 0.218)2 = 0.5(0.782)2 = 0.306.

Backward Pass (Chain Rule)

Now we compute gradients, starting from the loss and working backwards. This is why it's called backpropagation.

Step 1: Loss → output.

∂L/∂ŷ = −(y − ŷ) = −(1.0 − 0.218) = −0.782

Step 2: Output → output weights.

∂L/∂βm = (∂L/∂ŷ)(∂ŷ/∂βm) = (−0.782) · zm
∂L/∂β1 = −0.782 × 0.668 = −0.522
∂L/∂β2 = −0.782 × 0.354 = −0.277

Step 3: Output → hidden activations.

∂L/∂zm = (∂L/∂ŷ) · βm
∂L/∂z1 = −0.782 × 0.6 = −0.469
∂L/∂z2 = −0.782 × (−0.8) = 0.626

Step 4: Through the activation function. Since zm = σ(am) and σ'(a) = σ(a)(1 − σ(a)):

∂L/∂am = (∂L/∂zm) · σ'(am)
σ'(a1) = 0.668(1 − 0.668) = 0.222
∂L/∂a1 = −0.469 × 0.222 = −0.104
σ'(a2) = 0.354(1 − 0.354) = 0.229
∂L/∂a2 = 0.626 × 0.229 = 0.143

Step 5: Hidden activations → input weights. Since am = αmx + bm:

∂L/∂αm = (∂L/∂am) · x
∂L/∂α1 = −0.104 × 2.0 = −0.208
∂L/∂α2 = 0.143 × 2.0 = 0.287
Backprop is just the chain rule, applied systematically. Each layer's gradient depends on the gradient from the layer above (the "error signal" flowing backwards) multiplied by the local derivative. Forward pass: compute and cache all intermediate values. Backward pass: multiply cached values with incoming gradients, layer by layer. Cost: one backward pass = roughly 2× one forward pass.

The General Pattern

Forward: compute am, zm, ŷ, L
Cache all intermediate values
Backward: ∂L/∂ŷ
−(y − ŷ) for MSE
↓ × zm
∂L/∂βm
Gradient for output weights
↓ × βm × σ'
∂L/∂αm
Gradient for input weights
Step-by-Step Forward + Backward Pass

Click "Step" to walk through the forward pass, then the backward pass. Watch numbers flow through the network.

Ready — click Step
During backpropagation, the gradient of the loss with respect to a hidden weight αm requires:

Chapter 5: Loss Functions

The loss function L measures how wrong the network's prediction is. Different tasks need different losses. The choice of loss also determines the output activation function.

Regression: Mean Squared Error

For predicting a continuous value y ∈ R:

LMSE = (1/N) ∑i=1N (yi − ŷi)2

Output activation: g(u) = u (identity). The gradient is simple: ∂L/∂ŷi = −2(yi − ŷi)/N. MSE penalizes large errors quadratically, making it sensitive to outliers.

Binary Classification: Binary Cross-Entropy

For predicting y ∈ {0, 1}. The output ŷ = σ(βTz) ∈ (0,1) represents P(y = 1 | x).

LBCE = −(1/N) ∑i=1N [yi log(ŷi) + (1 − yi) log(1 − ŷi)]

When y = 1 and ŷ is near 0, log(ŷ) → −∞, creating a huge loss. This forces the network to be confident about correct predictions. The gradient has a beautiful form: ∂L/∂ŷ = (ŷ − y)/(ŷ(1 − ŷ)), which cancels the sigmoid derivative during backprop, leaving just ŷ − y.

Multi-Class Classification: Softmax + Cross-Entropy

For K classes, the output layer uses softmax:

k = exp(ak) / ∑j=1K exp(aj)

This converts K raw scores (logits) a1,...,aK into a valid probability distribution: all ŷk > 0 and they sum to 1.

LCE = −(1/N) ∑i=1Nk=1K yik log(ŷik)

where yik is 1 if sample i has true class k, 0 otherwise (one-hot encoding).

The loss-activation pairing.
• Regression: identity output + MSE
• Binary classification: sigmoid output + binary cross-entropy
• Multi-class: softmax output + cross-entropy
In each case, the gradient at the output simplifies to ŷ − y. This is not a coincidence — these are the canonical link functions from generalized linear models.

Worked Example: Softmax

Three classes. Raw logits: a = [2.0, 1.0, 0.5]. Compute softmax:

exp(a) = [7.389, 2.718, 1.649]. Sum = 11.756.

ŷ = [0.629, 0.231, 0.140].

True class is k = 0 (one-hot: [1, 0, 0]). Cross-entropy loss:

L = −log(0.629) = 0.464.

If the network were more confident (logits [5.0, 1.0, 0.5]), ŷ0 would be 0.973, and L = −log(0.973) = 0.027. Higher confidence in the correct class → lower loss.

Softmax Visualization

Adjust the three logits and see how softmax converts them to probabilities. The loss is computed for the highlighted class.

Logit a1 2.0
Logit a2 1.0
Logit a3 0.5
For a 3-class classification problem, the output activation should be:

Chapter 6: SGD & Training

We know how to compute gradients (backprop) and what to minimize (loss function). Now: how do we actually run the optimization?

Batch Gradient Descent

Compute the gradient using ALL N training samples, then update:

θt+1 = θt − η · (1/N) ∑i=1Nθ L(xi, yi; θt)

where η is the learning rate. This gives the exact gradient but costs O(N) per step. For N = 1 million images, each update is extremely expensive.

Stochastic Gradient Descent (SGD)

Use a single random sample to estimate the gradient:

θt+1 = θt − η · ∇θ L(xit, yit; θt)

where it is randomly chosen. The gradient estimate is noisy (high variance) but unbiased: E[∇L(xi)] = (1/N)∑∇L(xi). Each update is O(1) in dataset size, so we can make many more updates per second.

Mini-Batch SGD

The practical compromise. Use a random subset B of B samples:

θt+1 = θt − η · (1/B) ∑i ∈ Bθ L(xi, yi; θt)

Typical batch sizes: B = 32, 64, 128, 256. The variance of the gradient estimate scales as σ2/B, so increasing B reduces noise. But the cost also scales linearly. B = 32 is a sweet spot for many problems.

SGD noise is a feature, not a bug. The stochasticity helps escape sharp local minima and saddle points. Networks trained with small batches often generalize better than those trained with very large batches. The noise acts as implicit regularization, steering the optimizer toward flatter minima that transfer better to unseen data.

Learning Rate: The Most Important Hyperparameter

Too large: the updates overshoot the minimum, loss oscillates or diverges. Too small: training takes forever and gets stuck in poor local minima.

Common schedule: start with η = 0.01 or 0.001, reduce by a factor of 10 when loss plateaus. More advanced: Adam optimizer automatically adapts per-parameter learning rates.

An Epoch of Training

One epoch = one pass through the entire training set. With N = 50,000 samples and B = 50, one epoch = 1,000 mini-batch updates. Typical training runs: 50–300 epochs.

SGD on a Loss Surface

Watch SGD (noisy, orange) vs. batch gradient descent (smooth, teal) descend the same loss surface. SGD bounces but gets there faster per unit of compute.

Learning rate η 0.050
Batch size B 1
Compared to batch gradient descent, mini-batch SGD with B=32:

Chapter 7: Mastery

You've built the complete neural network story from the ground up: the architecture (Chapter 1), the nonlinearity (Chapter 2), the theoretical guarantee (Chapter 3), the training algorithm (Chapters 4-6). Let's consolidate.

ComponentFormulaRole
Hidden layerzm = σ(αmTx)Learned nonlinear features
Output (regression)ŷ = βTzLinear combination of features
Output (classification)ŷ = softmax(βTz)Class probabilities
MSE loss½(y − ŷ)2Regression objective
Cross-entropy loss−y log(ŷ)Classification objective
BackpropChain rule layer by layerCompute all gradients
SGD updateθ ← θ − η∇LMove toward minimum

What Comes Next

Single hidden layer networks are powerful in theory, but in practice, deep networks (many layers) learn hierarchical features more efficiently. The next lecture explores why depth helps and introduces convolutional neural networks.

TopicThis LectureNext Lecture
Depth1 hidden layerL hidden layers
ArchitectureFully connectedConvolutional (CNNs)
Regularization(not covered)BatchNorm, Dropout
ConnectionsFeed-forward onlyResidual (skip) connections
Related lessons.
Lecture 22: Adaptive Filters — LMS is SGD on a linear model
Lecture 24: Deep Learning & CNNs — depth + convolution
Lecture 25: Attention & Transformers — the modern attention mechanism

"What I cannot create, I do not understand." — Richard Feynman