Stack linear layers with nonlinear activations and you can approximate any function. Add convolutions for images, attention for sequences, and you get modern deep learning.
Logistic regression draws a single hyperplane. What if the decision boundary is a spiral? A checkerboard? A face versus not-a-face? No single linear function can capture these patterns.
The fix is deceptively simple: stack multiple linear layers with nonlinear activations between them. Each layer transforms the representation, making previously inseparable patterns separable. This is a deep neural network.
An MLP is a sequence of fully connected layers. Each layer computes:
where h0 = x is the input, zl is the pre-activation, and hl is the activation (post-nonlinearity). The final layer produces the output: for classification, a softmax; for regression, a linear output.
Consider a network with one hidden layer of H units for binary classification:
The first layer maps x from D dimensions to H dimensions (learning features). The second layer is logistic regression on those learned features. Each additional layer learns more abstract features from the previous layer's output.
| Component | Role | Parameters |
|---|---|---|
| Input layer | Raw features x ∈ RD | None |
| Hidden layer l | φ(Wlhl−1 + bl) | Wl ∈ RHl×Hl−1, bl ∈ RHl |
| Output layer | Softmax or linear | WL ∈ RC×HL−1, bL ∈ RC |
The activation function φ is what makes neural networks nonlinear. Without it, any depth of layers collapses to W' = WLWL−1…W1. Murphy (13.2.3) surveys the major choices.
The classic sigmoid σ(a) = 1/(1+e−a) squashes inputs to [0,1]. Its cousin tanh(a) maps to [−1, 1]. Both suffer from saturation: for large |a|, the gradient is nearly zero, killing learning in deep networks.
| Activation | Formula | Range | Key Property |
|---|---|---|---|
| Sigmoid | 1/(1+e−a) | [0, 1] | Saturates both ends |
| Tanh | (ea−e−a)/(ea+e−a) | [−1, 1] | Zero-centered, saturates |
| ReLU | max(0, a) | [0, ∞) | No saturation for a > 0, but "dead" if a < 0 |
| Leaky ReLU | max(αa, a), α≈0.01 | (−∞, ∞) | No dead neurons |
| ELU | a if a>0, α(ea−1) else | [−α, ∞) | Smooth near zero |
| Swish/SiLU | a · σ(a) | ≈[−0.28, ∞) | Smooth, non-monotonic |
| GELU | a · Φ(a) | ≈[−0.17, ∞) | Used in Transformers |
For output layers, the choice depends on the task: softmax for multi-class classification, sigmoid for binary or multi-label, and linear (identity) for regression. The hidden layer activation and output activation serve fundamentally different purposes.
How do we compute gradients in a network with millions of parameters across dozens of layers? By the chain rule, applied systematically. This is backpropagation (Rumelhart, Hinton, Williams, 1986).
Consider a loss L = NLL(f(x; θ), y). The forward pass computes h1, h2, …, hL, and the loss. The backward pass propagates gradients in reverse:
The vector δl is the error signal at layer l. It tells us how much each pre-activation contributes to the loss. We compute it by multiplying the downstream error by the transposed weight matrix and the local activation derivative.
The cost of backprop is roughly 2–3x the cost of the forward pass. The memory cost is higher: we must store all intermediate activations for the backward pass (or recompute them, trading time for memory via gradient checkpointing).
Training a DNN is harder than training logistic regression. The loss landscape is non-convex with many local minima and saddle points. Murphy (13.4) covers the essential tricks.
Learning rate scheduling: Too large and training diverges. Too small and it takes forever. Common schedules include step decay, cosine annealing, and warmup followed by decay. Adam optimizer adapts per-parameter learning rates using first and second moment estimates.
Residual connections (He et al. 2016): Instead of computing hl = f(hl−1), compute hl = hl−1 + f(hl−1). The gradient flows through the identity shortcut without any multiplication, solving the vanishing gradient problem. This enabled networks with hundreds of layers.
Batch normalization (Ioffe & Szegedy 2015) normalizes each layer's pre-activations to zero mean and unit variance within each mini-batch, then applies a learned scale and shift. It stabilizes training, allows higher learning rates, and acts as a mild regularizer.
| Problem | Solution | Murphy Section |
|---|---|---|
| Vanishing gradients | ReLU, residual connections, normalization | 13.4.2–13.4.4 |
| Overfitting | Dropout, weight decay, early stopping, data augmentation | 13.5 |
| Slow convergence | Adam, learning rate warmup, BatchNorm | 13.4.1 |
| Poor initialization | He init (ReLU), Xavier/Glorot init (tanh) | 13.4.5 |
Images have spatial structure: nearby pixels are correlated, and patterns (edges, textures) can appear anywhere. An MLP ignores this, treating each pixel as an independent feature. A convolutional neural network (CNN) exploits it.
A convolutional layer applies a small kernel (filter) K of size k×k across the image, computing a dot product at each spatial position:
This is a cross-correlation (Murphy 14.2.1). The same kernel is applied at every position — weight sharing. This has two key benefits:
A typical conv layer has multiple kernels (say 64), each producing one feature map. The input is C channels (e.g., RGB = 3). So the kernel is actually k×k×C, and we have F such kernels, giving F output feature maps. Total parameters: F × k × k × C + F.
Pooling layers (Murphy 14.2.2) downsample feature maps, reducing spatial resolution. Max pooling takes the maximum over a small window (e.g., 2×2). This provides a degree of translation invariance and reduces computation.
| Layer Type | Purpose | Parameters |
|---|---|---|
| Conv2D(k, F) | Detect local patterns | F × k² × Cin + F |
| MaxPool(s) | Downsample, add invariance | 0 |
| BatchNorm | Normalize activations | 2 × F (scale + shift) |
| Global AvgPool | Collapse spatial dims | 0 |
The history of CNNs is a story of going deeper and wider while managing gradients. Murphy (14.3) traces the key milestones.
LeNet-5 (LeCun 1998): Two conv layers, two pooling layers, three FC layers. Only ~60K parameters. Designed for handwritten digit recognition. The template for all CNNs that followed.
AlexNet (Krizhevsky 2012): Scaled up LeNet to 60M parameters, used ReLU instead of sigmoid, applied dropout and data augmentation. Won ImageNet 2012 by a huge margin, launching the deep learning revolution.
| Architecture | Year | Depth | Key Innovation |
|---|---|---|---|
| LeNet-5 | 1998 | 5 | Conv + pool template |
| AlexNet | 2012 | 8 | ReLU, dropout, GPU training |
| VGGNet | 2014 | 16–19 | Small 3×3 filters throughout |
| GoogLeNet | 2014 | 22 | Inception modules, 1×1 bottleneck |
| ResNet | 2015 | 50–152 | Residual connections |
| DenseNet | 2017 | 121+ | Dense connections (all layers to all) |
The pattern: Early layers learn low-level features (edges, colors). Middle layers learn textures and parts. Deep layers learn objects and scenes. This hierarchical feature learning is the core power of deep CNNs.
Sequences have temporal structure: the meaning of a word depends on what came before. CNNs can handle fixed-size inputs, but sentences and time series have variable length. Recurrent neural networks (RNNs) process sequences one step at a time, maintaining a hidden state.
The hidden state ht is a compressed summary of the sequence so far. At each step, it combines the previous state with the new input. The same weights (Whh, Wxh) are used at every time step — this is weight sharing across time.
LSTMs (Hochreiter & Schmidhuber 1997) solve this with a cell state ct that flows through time with minimal modification. Three gates control information flow:
| Gate | Formula | Role |
|---|---|---|
| Forget (ft) | σ(Wf[ht−1, xt] + bf) | What to erase from cell state |
| Input (it) | σ(Wi[ht−1, xt] + bi) | What to write to cell state |
| Output (ot) | σ(Wo[ht−1, xt] + bo) | What to expose as hidden state |
RNNs process sequences left-to-right, compressing everything into a fixed-size hidden state. For long sequences, early information gets washed out. Attention (Bahdanau et al. 2014) fixes this by letting the model look back at all previous states.
The Transformer (Vaswani et al. 2017) replaces recurrence entirely with self-attention. Each token attends to every other token in parallel:
where Q = XWQ, K = XWK, V = XWV are linear projections of the input sequence X. The √dk scaling prevents the dot products from becoming too large before softmax.
Positional encoding: Self-attention is permutation-invariant — it cannot tell word order. We add positional information using sinusoidal or learned embeddings: xi' = xi + PE(i).
A Transformer block stacks: (1) multi-head self-attention, (2) LayerNorm + residual, (3) feed-forward MLP, (4) LayerNorm + residual. GPT stacks these for language generation. BERT uses bidirectional attention for understanding.
| Architecture | RNN | Transformer |
|---|---|---|
| Parallelism | Sequential (slow) | Fully parallel (fast) |
| Long-range deps | Vanish with distance | Direct attention to any position |
| Memory | O(1) per step | O(T²) for self-attention |
| Training speed | Slow (no parallelism) | Fast (matrix multiply) |
Watch a 2-layer MLP learn to separate nonlinear data. Click to place points from two classes, then train the network. The heatmap shows the learned decision boundary.
Click to place points (toggle class below). Hit Train to run gradient descent on a 2-layer MLP with 16 hidden units and ReLU activations.
See how a 1D convolution extracts features from a signal. The kernel slides across the input, computing a dot product at each position. Different kernels detect different patterns.
The gray line is the input signal. The orange region is the kernel window. The teal line is the convolution output (feature map). Use the slider to move the kernel.
DNNs are the backbone of modern machine learning. Every subsequent chapter in Murphy builds on them.
| Concept from this chapter | Where it leads |
|---|---|
| MLPs | Building block for every architecture; encoder/decoder in autoencoders (Ch 20) |
| Backpropagation | Trains all models: GANs, VAEs, transformers, RL policies |
| CNNs | Feature extractors for GPs (Ch 17), deep metric learning (Ch 16) |
| Residual connections | Used in ResNets, Transformers, diffusion models |
| Self-attention | Core of GPT, BERT, Vision Transformers (ViT) |
| Dropout / weight decay | Bayesian neural networks approximate dropout (Gal & Ghahramani 2016) |
| Encoder-decoder | Autoencoders (Ch 20), seq2seq translation |
| Softmax output | Same as logistic regression (Ch 10), used in clustering (Ch 21) |
"What I cannot create, I do not understand." — Richard Feynman