EE269 Lecture 23 — Neural Networks

Chapter 0: From Linear to Nonlinear

You're building a classifier for two-class data. You've mastered linear methods: a weight vector w and a bias that define a separating hyperplane w^Tx + b = 0. For linearly separable data, this works beautifully.

But what about the data below? Two classes arranged in a spiral. No single line (or hyperplane) can separate them. Every linear classifier will misclassify at least a quarter of the points.

This isn't a contrived example. Real-world signals — speech phonemes, EEG states, radar returns — routinely have nonlinear decision boundaries. Linear classifiers hit a hard wall.

The limitation of linearity. A linear model computes f(x) = w^Tx + b. This defines a hyperplane in feature space. If the true decision boundary is curved, no amount of training data or regularization will fix the fundamental model mismatch. We need nonlinear basis functions.

The idea is simple: transform x through a set of nonlinear functions, then apply a linear classifier to the transformed features. If the transformations are chosen well, data that was tangled in the original space becomes separable in the new space.

Neural networks learn these transformations from data, rather than requiring us to design them by hand.

Linear vs. Nonlinear Boundary

Two spirals that no linear classifier can separate. The orange line shows the best linear attempt. Click "Add Hidden Layer" to see what a neural network can learn.

Linear classifier

Why can't a linear classifier separate two interleaved spirals?

Not enough training data The decision boundary is nonlinear — no single hyperplane can separate the classes The learning rate is too small

Chapter 1: The Single Hidden Layer Architecture

A single hidden layer neural network maps an input vector x ∈ R^p to an output through two stages. First, M hidden units compute nonlinear features. Then, the output layer combines those features linearly.

Stage 1: Hidden Layer

Each hidden unit m = 1, ..., M computes:

z_m = σ(α_m^T x) = σ(α_m0 + α_m1x₁ + ... + α_mpx_p)

where α_m is the weight vector for hidden unit m (including bias α_m0), and σ is a nonlinear activation function (sigmoid, tanh, or ReLU — we'll explore these in Chapter 2).

Think of each hidden unit as a "feature detector." The weights α_m define which direction in input space this unit responds to, and the activation function makes the response nonlinear.

Stage 2: Output Layer

For regression (predicting a continuous value), output k is:

y_k = g_k(β_k^T z) = g_k(β_k0 + β_k1z₁ + ... + β_kMz_M)

where g_k(u) = u (the identity function) for regression, and g = softmax for classification.

Two-stage nonlinear feature extraction. The hidden layer creates M new features z₁, ..., z_M that are nonlinear functions of the input. The output layer is just a linear model on these learned features. The entire network is differentiable, so we can learn both α (feature detectors) and β (output weights) jointly by gradient descent.

Counting Parameters

With p inputs, M hidden units, and K outputs:

Layer	Parameters	Count
Hidden (α)	M vectors of size p+1	M(p + 1)
Output (β)	K vectors of size M+1	K(M + 1)
Total		M(p+1) + K(M+1)

For p = 10 inputs, M = 50 hidden units, K = 3 outputs: 50(11) + 3(51) = 550 + 153 = 703 parameters. Small enough to fit on a napkin, yet powerful enough to approximate virtually any smooth function.

Input x ∈ R^p

Raw features (pixels, signal samples, ...)

↓ α_m^Tx

Hidden z ∈ R^M

z_m = σ(α_m^Tx) — learned nonlinear features

↓ β_k^Tz

Output y ∈ R^K

y_k = g_k(β_k^Tz) — prediction

Network Architecture Visualizer

Adjust the number of hidden units M to see how the network structure changes. Each line is a learnable weight.

Hidden units M 4

In a single hidden layer network with p=5 inputs, M=20 hidden units, and K=1 output, how many total parameters are there?

100 141 — M(p+1) + K(M+1) = 20(6) + 1(21) 120

Chapter 2: Activation Functions

The activation function σ is what makes neural networks nonlinear. Without it, stacking linear layers just gives another linear function: W₂(W₁x) = (W₂W₁)x. The choice of σ has major practical implications for training speed and gradient flow.

Sigmoid

σ(u) = 1 / (1 + e^−u)

Maps any real number to (0, 1). Historically the first activation used. Derivative: σ'(u) = σ(u)(1 − σ(u)). Maximum derivative of 0.25 at u = 0. Problem: for large |u|, the derivative is near zero — vanishing gradients. During backprop, gradients get multiplied through layers, and tiny derivatives kill the signal.

Tanh

tanh(u) = (e^u − e^−u) / (e^u + e^−u)

Maps to (−1, 1). Zero-centered, so outputs can be positive or negative. Derivative: tanh'(u) = 1 − tanh²(u). Maximum of 1.0 at u = 0. Still suffers vanishing gradients in the tails, but less severely than sigmoid since the max derivative is 4× larger.

ReLU (Rectified Linear Unit)

ReLU(u) = max(0, u)

Dead simple. For positive inputs, the gradient is exactly 1 — no vanishing. For negative inputs, the gradient is 0. This creates dead neurons (units that never activate), but in practice ReLU trains much faster than sigmoid or tanh for deep networks.

Why ReLU won. The constant gradient of 1 for positive inputs means backprop signals pass through without shrinking. This is why deep networks (10+ layers) became practical — the gradient can flow undiminished through many ReLU layers. Sigmoid networks rarely worked beyond 2-3 layers.

Activation	Range	Max \|σ'\|	Vanishing?	Use
Sigmoid	(0, 1)	0.25	Yes	Output for binary classification
Tanh	(−1, 1)	1.0	Moderate	RNNs, small nets
ReLU	[0, ∞)	1.0	No	Default for hidden layers

Activation Functions & Their Derivatives

See each activation (solid) and its derivative (dashed). Notice how sigmoid's derivative nearly vanishes for |u| > 3.

Function Sigmoid

Why did ReLU largely replace sigmoid in hidden layers of deep networks?

ReLU has a constant gradient of 1 for positive inputs, avoiding vanishing gradients in deep networks ReLU is smoother than sigmoid ReLU outputs are bounded, preventing exploding activations

Chapter 3: Universal Approximation

Here's a remarkable fact: a single hidden layer network with enough hidden units can approximate any continuous function on a compact set to arbitrary accuracy. This is the Universal Approximation Theorem (Cybenko 1989, Hornik 1991).

More precisely: for any continuous f: [a,b]^p → R and any ε > 0, there exists M hidden units with weights α_m and β_m such that:

|f(x) − ∑_m=1^M β_m σ(α_m^T x)| < ε for all x ∈ [a,b]^p

provided σ is a non-constant, bounded, continuous function (sigmoid qualifies).

Intuition: How Does This Work?

Consider 1D. A sigmoid σ(αx + b) is a smooth step function. By making α large, it becomes nearly a sharp step at x = −b/α. Now, two sigmoids with opposite signs create a "bump":

bump(x) = σ(α(x − a)) − σ(α(x − b))

This bump is approximately 1 on [a, b] and 0 elsewhere. By summing many bumps of different heights and positions, we can approximate any function — like building a histogram that matches the target function's shape.

Existence, not efficiency. The theorem says such a network exists. It says nothing about how many hidden units you need (could be astronomically many), or whether gradient descent can find the weights. It's a theoretical guarantee, not a practical recipe. Real networks work well for reasons beyond this theorem.

Worked Example: Approximating |x|

The function f(x) = |x| is continuous but not differentiable at 0. With M = 2 ReLU units:

h(x) = ReLU(x) + ReLU(−x) = max(0, x) + max(0, −x) = |x|

Two hidden units, exact representation. With smooth activations like sigmoid, we'd need more units for the sharp corner at zero, but we can get arbitrarily close.

Universal Approximation Demo

A 1D target function (teal) is approximated by a sum of sigmoid bumps (orange). Add more hidden units to get a tighter fit.

Hidden units M 4

The Universal Approximation Theorem guarantees that a single hidden layer network can:

Always be trained efficiently with gradient descent Approximate any continuous function to arbitrary accuracy, given enough hidden units Generalize perfectly to unseen data

Chapter 4: Backpropagation

We have a network with parameters θ = {α₁, ..., α_M, β}. Training means finding θ that minimizes a loss L(θ). Gradient descent needs ∂L/∂θ for every parameter. Backpropagation computes all these gradients efficiently using the chain rule, in one backward pass through the network.

Setup: Concrete 2-Layer Network

Let's derive backprop for a concrete case. One input x, one hidden layer with M units, one output ŷ. Regression with MSE loss.

z_m = σ(α_m x + b_m) for m = 1,...,M

ŷ = ∑_m=1^M β_m z_m + β₀

L = ½(y − ŷ)²

Forward Pass (Compute Everything)

With concrete numbers. Let x = 2.0, y = 1.0 (the target). Two hidden units (M = 2). Sigmoid activation. Initial weights:

Parameter	Value
α₁, b₁	0.5, −0.3
α₂, b₂	−0.4, 0.2
β₁, β₂, β₀	0.6, −0.8, 0.1

Hidden unit 1: α₁x + b₁ = 0.5(2) − 0.3 = 0.7. σ(0.7) = 1/(1 + e^−0.7) = 0.668. So z₁ = 0.668.

Hidden unit 2: α₂x + b₂ = −0.4(2) + 0.2 = −0.6. σ(−0.6) = 1/(1 + e^0.6) = 0.354. So z₂ = 0.354.

Output: ŷ = 0.6(0.668) + (−0.8)(0.354) + 0.1 = 0.401 − 0.283 + 0.1 = 0.218.

Loss: L = 0.5(1.0 − 0.218)² = 0.5(0.782)² = 0.306.

Backward Pass (Chain Rule)

Now we compute gradients, starting from the loss and working backwards. This is why it's called backpropagation.

Step 1: Loss → output.

∂L/∂ŷ = −(y − ŷ) = −(1.0 − 0.218) = −0.782

Step 2: Output → output weights.

∂L/∂β_m = (∂L/∂ŷ)(∂ŷ/∂β_m) = (−0.782) · z_m

∂L/∂β₁ = −0.782 × 0.668 = −0.522

∂L/∂β₂ = −0.782 × 0.354 = −0.277

Step 3: Output → hidden activations.

∂L/∂z_m = (∂L/∂ŷ) · β_m

∂L/∂z₁ = −0.782 × 0.6 = −0.469

∂L/∂z₂ = −0.782 × (−0.8) = 0.626

Step 4: Through the activation function. Since z_m = σ(a_m) and σ'(a) = σ(a)(1 − σ(a)):

∂L/∂a_m = (∂L/∂z_m) · σ'(a_m)

σ'(a₁) = 0.668(1 − 0.668) = 0.222

∂L/∂a₁ = −0.469 × 0.222 = −0.104

σ'(a₂) = 0.354(1 − 0.354) = 0.229

∂L/∂a₂ = 0.626 × 0.229 = 0.143

Step 5: Hidden activations → input weights. Since a_m = α_mx + b_m:

∂L/∂α_m = (∂L/∂a_m) · x

∂L/∂α₁ = −0.104 × 2.0 = −0.208

∂L/∂α₂ = 0.143 × 2.0 = 0.287

Backprop is just the chain rule, applied systematically. Each layer's gradient depends on the gradient from the layer above (the "error signal" flowing backwards) multiplied by the local derivative. Forward pass: compute and cache all intermediate values. Backward pass: multiply cached values with incoming gradients, layer by layer. Cost: one backward pass = roughly 2× one forward pass.

The General Pattern

Forward: compute a_m, z_m, ŷ, L

Cache all intermediate values

↓

Backward: ∂L/∂ŷ

−(y − ŷ) for MSE

↓ × z_m

∂L/∂β_m

Gradient for output weights

↓ × β_m × σ'

∂L/∂α_m

Gradient for input weights

Step-by-Step Forward + Backward Pass

Click "Step" to walk through the forward pass, then the backward pass. Watch numbers flow through the network.

Ready — click Step

During backpropagation, the gradient of the loss with respect to a hidden weight α_m requires:

Only the local activation z_m Only the output weight β_m The chain of derivatives from the loss through the output layer, activation derivative, and the input x

Chapter 5: Loss Functions

The loss function L measures how wrong the network's prediction is. Different tasks need different losses. The choice of loss also determines the output activation function.

Regression: Mean Squared Error

For predicting a continuous value y ∈ R:

L_MSE = (1/N) ∑_i=1^N (y_i − ŷ_i)²

Output activation: g(u) = u (identity). The gradient is simple: ∂L/∂ŷ_i = −2(y_i − ŷ_i)/N. MSE penalizes large errors quadratically, making it sensitive to outliers.

Binary Classification: Binary Cross-Entropy

For predicting y ∈ {0, 1}. The output ŷ = σ(β^Tz) ∈ (0,1) represents P(y = 1 | x).

L_BCE = −(1/N) ∑_i=1^N [y_i log(ŷ_i) + (1 − y_i) log(1 − ŷ_i)]

When y = 1 and ŷ is near 0, log(ŷ) → −∞, creating a huge loss. This forces the network to be confident about correct predictions. The gradient has a beautiful form: ∂L/∂ŷ = (ŷ − y)/(ŷ(1 − ŷ)), which cancels the sigmoid derivative during backprop, leaving just ŷ − y.

Multi-Class Classification: Softmax + Cross-Entropy

For K classes, the output layer uses softmax:

ŷ_k = exp(a_k) / ∑_j=1^K exp(a_j)

This converts K raw scores (logits) a₁,...,a_K into a valid probability distribution: all ŷ_k > 0 and they sum to 1.

L_CE = −(1/N) ∑_i=1^N ∑_k=1^K y_ik log(ŷ_ik)

where y_ik is 1 if sample i has true class k, 0 otherwise (one-hot encoding).

The loss-activation pairing.
• Regression: identity output + MSE
• Binary classification: sigmoid output + binary cross-entropy
• Multi-class: softmax output + cross-entropy
In each case, the gradient at the output simplifies to ŷ − y. This is not a coincidence — these are the canonical link functions from generalized linear models.

Worked Example: Softmax

Three classes. Raw logits: a = [2.0, 1.0, 0.5]. Compute softmax:

exp(a) = [7.389, 2.718, 1.649]. Sum = 11.756.

ŷ = [0.629, 0.231, 0.140].

True class is k = 0 (one-hot: [1, 0, 0]). Cross-entropy loss:

L = −log(0.629) = 0.464.

If the network were more confident (logits [5.0, 1.0, 0.5]), ŷ₀ would be 0.973, and L = −log(0.973) = 0.027. Higher confidence in the correct class → lower loss.

Softmax Visualization

Adjust the three logits and see how softmax converts them to probabilities. The loss is computed for the highlighted class.

Logit a₁ 2.0

Logit a₂ 1.0

Logit a₃ 0.5

For a 3-class classification problem, the output activation should be:

Sigmoid on each output independently ReLU on each output Softmax across all outputs, so they form a valid probability distribution

Chapter 6: SGD & Training

We know how to compute gradients (backprop) and what to minimize (loss function). Now: how do we actually run the optimization?

Batch Gradient Descent

Compute the gradient using ALL N training samples, then update:

θ_t+1 = θ_t − η · (1/N) ∑_i=1^N ∇_θ L(x_i, y_i; θ_t)

where η is the learning rate. This gives the exact gradient but costs O(N) per step. For N = 1 million images, each update is extremely expensive.

Stochastic Gradient Descent (SGD)

Use a single random sample to estimate the gradient:

θ_t+1 = θ_t − η · ∇_θ L(x_{i_t}, y_{i_t}; θ_t)

where i_t is randomly chosen. The gradient estimate is noisy (high variance) but unbiased: E[∇L(x_i)] = (1/N)∑∇L(x_i). Each update is O(1) in dataset size, so we can make many more updates per second.

Mini-Batch SGD

The practical compromise. Use a random subset B of B samples:

θ_t+1 = θ_t − η · (1/B) ∑_{i ∈ B} ∇_θ L(x_i, y_i; θ_t)

Typical batch sizes: B = 32, 64, 128, 256. The variance of the gradient estimate scales as σ²/B, so increasing B reduces noise. But the cost also scales linearly. B = 32 is a sweet spot for many problems.

SGD noise is a feature, not a bug. The stochasticity helps escape sharp local minima and saddle points. Networks trained with small batches often generalize better than those trained with very large batches. The noise acts as implicit regularization, steering the optimizer toward flatter minima that transfer better to unseen data.

Learning Rate: The Most Important Hyperparameter

Too large: the updates overshoot the minimum, loss oscillates or diverges. Too small: training takes forever and gets stuck in poor local minima.

Common schedule: start with η = 0.01 or 0.001, reduce by a factor of 10 when loss plateaus. More advanced: Adam optimizer automatically adapts per-parameter learning rates.

An Epoch of Training

One epoch = one pass through the entire training set. With N = 50,000 samples and B = 50, one epoch = 1,000 mini-batch updates. Typical training runs: 50–300 epochs.

SGD on a Loss Surface

Watch SGD (noisy, orange) vs. batch gradient descent (smooth, teal) descend the same loss surface. SGD bounces but gets there faster per unit of compute.

Learning rate η 0.050

Batch size B 1

Compared to batch gradient descent, mini-batch SGD with B=32:

Has noisier gradient estimates but makes many more updates per second, often converging faster in wall-clock time Always converges to a better minimum Uses more memory per update

Chapter 7: Mastery

You've built the complete neural network story from the ground up: the architecture (Chapter 1), the nonlinearity (Chapter 2), the theoretical guarantee (Chapter 3), the training algorithm (Chapters 4-6). Let's consolidate.

Component	Formula	Role
Hidden layer	z_m = σ(α_m^Tx)	Learned nonlinear features
Output (regression)	ŷ = β^Tz	Linear combination of features
Output (classification)	ŷ = softmax(β^Tz)	Class probabilities
MSE loss	½(y − ŷ)²	Regression objective
Cross-entropy loss	−y log(ŷ)	Classification objective
Backprop	Chain rule layer by layer	Compute all gradients
SGD update	θ ← θ − η∇L	Move toward minimum

What Comes Next

Single hidden layer networks are powerful in theory, but in practice, deep networks (many layers) learn hierarchical features more efficiently. The next lecture explores why depth helps and introduces convolutional neural networks.

Topic	This Lecture	Next Lecture
Depth	1 hidden layer	L hidden layers
Architecture	Fully connected	Convolutional (CNNs)
Regularization	(not covered)	BatchNorm, Dropout
Connections	Feed-forward only	Residual (skip) connections

Related lessons.
• Lecture 22: Adaptive Filters — LMS is SGD on a linear model
• Lecture 24: Deep Learning & CNNs — depth + convolution
• Lecture 25: Attention & Transformers — the modern attention mechanism

"What I cannot create, I do not understand." — Richard Feynman