Murphy, Chapters 13–15

Deep Neural Networks: MLPs, CNNs & Transformers

Stack linear layers with nonlinear activations and you can approximate any function. Add convolutions for images, attention for sequences, and you get modern deep learning.

Prerequisites: Linear Models (Ch 10–11). You need logistic regression, SGD, cross-entropy, and regularization.
12
Chapters
5
Simulations
12
Quizzes

Chapter 0: Why Go Deep?

Logistic regression draws a single hyperplane. What if the decision boundary is a spiral? A checkerboard? A face versus not-a-face? No single linear function can capture these patterns.

The fix is deceptively simple: stack multiple linear layers with nonlinear activations between them. Each layer transforms the representation, making previously inseparable patterns separable. This is a deep neural network.

The big picture: A DNN computes f(x) = fL(fL−1(…f1(x)…)), where each fl(z) = φ(Wlz + bl). The weight matrices Wl are learned, and φ is a nonlinear activation. Without φ, the composition collapses to a single linear map. With φ, the network can approximate any continuous function (universal approximation theorem).
MLPs (Ch 13)
Fully connected layers, backpropagation, training tricks, regularization
CNNs (Ch 14)
Convolutional layers, pooling, LeNet to ResNet, image classification
RNNs (Ch 15)
Recurrence, gating, LSTMs, GRUs, sequence-to-sequence
Transformers (Ch 15)
Self-attention, multi-head attention, positional encoding, LLMs
Why can't stacking multiple linear layers (without activations) learn nonlinear functions?

Chapter 1: Multilayer Perceptrons

An MLP is a sequence of fully connected layers. Each layer computes:

zl = Wl hl−1 + bl    hl = φ(zl)

where h0 = x is the input, zl is the pre-activation, and hl is the activation (post-nonlinearity). The final layer produces the output: for classification, a softmax; for regression, a linear output.

The XOR problem: A single linear layer cannot learn XOR (Murphy 13.2.1). But a 2-layer MLP with 2 hidden units can, by first transforming the inputs into a space where they become linearly separable. This is the core insight: hidden layers learn useful representations.

Consider a network with one hidden layer of H units for binary classification:

p(y=1 | x) = σ(w2T φ(W1x + b1) + b2)

The first layer maps x from D dimensions to H dimensions (learning features). The second layer is logistic regression on those learned features. Each additional layer learns more abstract features from the previous layer's output.

Universal approximation: A single hidden layer with enough units can approximate any continuous function on a compact set (Cybenko 1989, Hornik 1991). But "enough" may be exponentially many. Deeper networks can represent the same functions with exponentially fewer units — depth is more efficient than width.
ComponentRoleParameters
Input layerRaw features x ∈ RDNone
Hidden layer lφ(Wlhl−1 + bl)Wl ∈ RHl×Hl−1, bl ∈ RHl
Output layerSoftmax or linearWL ∈ RC×HL−1, bL ∈ RC
What role does the hidden layer play in an MLP?

Chapter 2: Activation Functions

The activation function φ is what makes neural networks nonlinear. Without it, any depth of layers collapses to W' = WLWL−1…W1. Murphy (13.2.3) surveys the major choices.

The classic sigmoid σ(a) = 1/(1+e−a) squashes inputs to [0,1]. Its cousin tanh(a) maps to [−1, 1]. Both suffer from saturation: for large |a|, the gradient is nearly zero, killing learning in deep networks.

The ReLU revolution: The Rectified Linear Unit, ReLU(a) = max(0, a), solved the vanishing gradient problem (Glorot et al. 2011). Its gradient is exactly 1 for a > 0 and 0 for a < 0. No saturation. Faster to compute. It became the default activation for hidden layers.
ActivationFormulaRangeKey Property
Sigmoid1/(1+e−a)[0, 1]Saturates both ends
Tanh(ea−e−a)/(ea+e−a)[−1, 1]Zero-centered, saturates
ReLUmax(0, a)[0, ∞)No saturation for a > 0, but "dead" if a < 0
Leaky ReLUmax(αa, a), α≈0.01(−∞, ∞)No dead neurons
ELUa if a>0, α(ea−1) else[−α, ∞)Smooth near zero
Swish/SiLUa · σ(a)≈[−0.28, ∞)Smooth, non-monotonic
GELUa · Φ(a)≈[−0.17, ∞)Used in Transformers
Dead ReLU problem: If a neuron's pre-activation is always negative (due to a bad weight initialization or large learning rate), its gradient is always zero and it never updates. Leaky ReLU, ELU, and Swish fix this by allowing small gradients for negative inputs.

For output layers, the choice depends on the task: softmax for multi-class classification, sigmoid for binary or multi-label, and linear (identity) for regression. The hidden layer activation and output activation serve fundamentally different purposes.

Why did ReLU largely replace sigmoid/tanh in hidden layers of deep networks?

Chapter 3: Backpropagation

How do we compute gradients in a network with millions of parameters across dozens of layers? By the chain rule, applied systematically. This is backpropagation (Rumelhart, Hinton, Williams, 1986).

Consider a loss L = NLL(f(x; θ), y). The forward pass computes h1, h2, …, hL, and the loss. The backward pass propagates gradients in reverse:

δl = ∂L / ∂zl    ∂L / ∂Wl = δl hl−1T    δl−1 = WlT δl ⊙ φ'(zl−1)

The vector δl is the error signal at layer l. It tells us how much each pre-activation contributes to the loss. We compute it by multiplying the downstream error by the transposed weight matrix and the local activation derivative.

Forward vs reverse mode: Murphy (13.3.1) distinguishes forward-mode and reverse-mode differentiation. Forward mode computes one column of the Jacobian per pass (cost: O(D) passes for D inputs). Reverse mode computes one row per pass (cost: O(1) passes for scalar loss). Since the loss is scalar and inputs are high-dimensional, reverse mode (backprop) is exponentially more efficient.
Computation graphs: Modern frameworks (PyTorch, JAX) generalize backprop to arbitrary computation graphs. Each operation records its inputs and local gradient. The backward pass traverses this graph in reverse topological order, accumulating gradients via the chain rule. This is automatic differentiation — you write the forward pass and get gradients for free.

The cost of backprop is roughly 2–3x the cost of the forward pass. The memory cost is higher: we must store all intermediate activations for the backward pass (or recompute them, trading time for memory via gradient checkpointing).

Why is reverse-mode differentiation (backpropagation) more efficient than forward-mode for training neural networks?

Chapter 4: Training & Regularization

Training a DNN is harder than training logistic regression. The loss landscape is non-convex with many local minima and saddle points. Murphy (13.4) covers the essential tricks.

Learning rate scheduling: Too large and training diverges. Too small and it takes forever. Common schedules include step decay, cosine annealing, and warmup followed by decay. Adam optimizer adapts per-parameter learning rates using first and second moment estimates.

Vanishing and exploding gradients (13.4.2): In deep networks, gradients are products of many Jacobians. If eigenvalues are < 1, gradients vanish exponentially. If > 1, they explode. Solutions: (1) careful initialization (He/Xavier), (2) non-saturating activations (ReLU), (3) normalization layers (BatchNorm, LayerNorm), (4) residual connections.

Residual connections (He et al. 2016): Instead of computing hl = f(hl−1), compute hl = hl−1 + f(hl−1). The gradient flows through the identity shortcut without any multiplication, solving the vanishing gradient problem. This enabled networks with hundreds of layers.

Weight decay (ℓ2): Same as ridge regression (Ch 11). Adds λ||w||² to the loss. Prevents any single weight from dominating. In SGD, equivalent to multiplying weights by (1 − ηλ) each step.
Dropout (13.5.4): Randomly set each hidden unit to zero with probability p during training. At test time, scale by (1−p). Forces the network to learn redundant representations. Approximately equivalent to an ensemble of 2H sub-networks.

Batch normalization (Ioffe & Szegedy 2015) normalizes each layer's pre-activations to zero mean and unit variance within each mini-batch, then applies a learned scale and shift. It stabilizes training, allows higher learning rates, and acts as a mild regularizer.

ProblemSolutionMurphy Section
Vanishing gradientsReLU, residual connections, normalization13.4.2–13.4.4
OverfittingDropout, weight decay, early stopping, data augmentation13.5
Slow convergenceAdam, learning rate warmup, BatchNorm13.4.1
Poor initializationHe init (ReLU), Xavier/Glorot init (tanh)13.4.5
How do residual connections help train very deep networks?

Chapter 5: Convolutional Layers

Images have spatial structure: nearby pixels are correlated, and patterns (edges, textures) can appear anywhere. An MLP ignores this, treating each pixel as an independent feature. A convolutional neural network (CNN) exploits it.

A convolutional layer applies a small kernel (filter) K of size k×k across the image, computing a dot product at each spatial position:

y[i, j] = ∑mn K[m, n] · x[i+m, j+n] + b

This is a cross-correlation (Murphy 14.2.1). The same kernel is applied at every position — weight sharing. This has two key benefits:

Translation equivariance: If the input shifts by (dx, dy), the output shifts by the same amount. The network detects patterns regardless of where they appear in the image.
Parameter efficiency: A 3×3 kernel has only 9 parameters regardless of image size. An MLP connecting every pixel to every hidden unit would need millions.

A typical conv layer has multiple kernels (say 64), each producing one feature map. The input is C channels (e.g., RGB = 3). So the kernel is actually k×k×C, and we have F such kernels, giving F output feature maps. Total parameters: F × k × k × C + F.

Pooling layers (Murphy 14.2.2) downsample feature maps, reducing spatial resolution. Max pooling takes the maximum over a small window (e.g., 2×2). This provides a degree of translation invariance and reduces computation.

Layer TypePurposeParameters
Conv2D(k, F)Detect local patternsF × k² × Cin + F
MaxPool(s)Downsample, add invariance0
BatchNormNormalize activations2 × F (scale + shift)
Global AvgPoolCollapse spatial dims0
What is the main advantage of weight sharing in convolutional layers?

Chapter 6: CNN Architectures

The history of CNNs is a story of going deeper and wider while managing gradients. Murphy (14.3) traces the key milestones.

LeNet-5 (LeCun 1998): Two conv layers, two pooling layers, three FC layers. Only ~60K parameters. Designed for handwritten digit recognition. The template for all CNNs that followed.

AlexNet (Krizhevsky 2012): Scaled up LeNet to 60M parameters, used ReLU instead of sigmoid, applied dropout and data augmentation. Won ImageNet 2012 by a huge margin, launching the deep learning revolution.

GoogLeNet/Inception (2014): Instead of choosing a kernel size, use all of them. An Inception module applies 1×1, 3×3, and 5×5 convolutions in parallel, then concatenates the results. 1×1 convolutions reduce channel dimensionality first (bottleneck), keeping computation manageable. 22 layers, only 5M parameters.
ResNet (He et al. 2015): The residual connection: y = x + F(x). Instead of learning the mapping directly, the network learns the residual (how to adjust the input). This is easier when the optimal mapping is close to identity. ResNet-152 has 152 layers — deeper than was thought possible before skip connections.
ArchitectureYearDepthKey Innovation
LeNet-519985Conv + pool template
AlexNet20128ReLU, dropout, GPU training
VGGNet201416–19Small 3×3 filters throughout
GoogLeNet201422Inception modules, 1×1 bottleneck
ResNet201550–152Residual connections
DenseNet2017121+Dense connections (all layers to all)

The pattern: Early layers learn low-level features (edges, colors). Middle layers learn textures and parts. Deep layers learn objects and scenes. This hierarchical feature learning is the core power of deep CNNs.

What is the key innovation in ResNet that enabled training networks with 100+ layers?

Chapter 7: Recurrent Neural Networks

Sequences have temporal structure: the meaning of a word depends on what came before. CNNs can handle fixed-size inputs, but sentences and time series have variable length. Recurrent neural networks (RNNs) process sequences one step at a time, maintaining a hidden state.

ht = φ(Whh ht−1 + Wxh xt + bh)

The hidden state ht is a compressed summary of the sequence so far. At each step, it combines the previous state with the new input. The same weights (Whh, Wxh) are used at every time step — this is weight sharing across time.

The vanishing gradient problem returns (15.2.6): Backpropagation through time (BPTT) unrolls the RNN across T steps. The gradient includes products of T Jacobians. If their spectral norm is < 1, gradients vanish. If > 1, they explode. Vanilla RNNs struggle with sequences longer than ~20 steps.

LSTMs (Hochreiter & Schmidhuber 1997) solve this with a cell state ct that flows through time with minimal modification. Three gates control information flow:

GateFormulaRole
Forget (ft)σ(Wf[ht−1, xt] + bf)What to erase from cell state
Input (it)σ(Wi[ht−1, xt] + bi)What to write to cell state
Output (ot)σ(Wo[ht−1, xt] + bo)What to expose as hidden state
ct = ft ⊙ ct−1 + it ⊙ tanh(Wc[ht−1, xt] + bc)
GRUs (Cho et al. 2014) simplify the LSTM by combining the forget and input gates into a single update gate, and merging the cell and hidden states. Fewer parameters, similar performance in many tasks.
What problem do LSTM gates solve that vanilla RNNs cannot?

Chapter 8: Attention & Transformers

RNNs process sequences left-to-right, compressing everything into a fixed-size hidden state. For long sequences, early information gets washed out. Attention (Bahdanau et al. 2014) fixes this by letting the model look back at all previous states.

Attention as soft dictionary lookup (15.4.1): Given a query q and a set of key-value pairs (ki, vi), attention computes a weighted sum of values: Attn(q, K, V) = ∑i αi vi, where αi = softmax(qTki / √d). The weights αi are high for keys similar to the query. Think of it as a soft, differentiable lookup table.

The Transformer (Vaswani et al. 2017) replaces recurrence entirely with self-attention. Each token attends to every other token in parallel:

Attention(Q, K, V) = softmax(QKT / √dk) V

where Q = XWQ, K = XWK, V = XWV are linear projections of the input sequence X. The √dk scaling prevents the dot products from becoming too large before softmax.

Multi-head attention (15.5.2): Instead of one attention function, use H parallel heads, each with its own Q, K, V projections of dimension d/H. Concatenate the outputs and project: MultiHead(X) = Concat(head1, …, headH)WO. Different heads can attend to different aspects (syntax, semantics, position).

Positional encoding: Self-attention is permutation-invariant — it cannot tell word order. We add positional information using sinusoidal or learned embeddings: xi' = xi + PE(i).

A Transformer block stacks: (1) multi-head self-attention, (2) LayerNorm + residual, (3) feed-forward MLP, (4) LayerNorm + residual. GPT stacks these for language generation. BERT uses bidirectional attention for understanding.

ArchitectureRNNTransformer
ParallelismSequential (slow)Fully parallel (fast)
Long-range depsVanish with distanceDirect attention to any position
MemoryO(1) per stepO(T²) for self-attention
Training speedSlow (no parallelism)Fast (matrix multiply)
Why does scaled dot-product attention divide by √dk?

Chapter 9: MLP Playground

Watch a 2-layer MLP learn to separate nonlinear data. Click to place points from two classes, then train the network. The heatmap shows the learned decision boundary.

MLP: 2D Nonlinear Classifier

Click to place points (toggle class below). Hit Train to run gradient descent on a 2-layer MLP with 16 hidden units and ReLU activations.

0 points
Hidden units16
What to try: Place points in a circle pattern (class 1 inside, class 0 outside). A linear model cannot separate them, but the MLP can learn a circular decision boundary. Try increasing hidden units to see how the boundary becomes smoother.
What enables the MLP to learn nonlinear decision boundaries that logistic regression cannot?

Chapter 10: CNN Feature Explorer

See how a 1D convolution extracts features from a signal. The kernel slides across the input, computing a dot product at each position. Different kernels detect different patterns.

1D Convolution: Kernel Sliding

The gray line is the input signal. The orange region is the kernel window. The teal line is the convolution output (feature map). Use the slider to move the kernel.

Kernel position20
Observe: The edge detection kernel produces large outputs where the signal changes rapidly. The smoothing kernel averages nearby values, blurring the signal. The sharpening kernel amplifies differences. In a CNN, the network learns which kernels are useful for the task.
In a CNN for image classification, what do early convolutional layers typically learn to detect?

Chapter 11: Connections

DNNs are the backbone of modern machine learning. Every subsequent chapter in Murphy builds on them.

Concept from this chapterWhere it leads
MLPsBuilding block for every architecture; encoder/decoder in autoencoders (Ch 20)
BackpropagationTrains all models: GANs, VAEs, transformers, RL policies
CNNsFeature extractors for GPs (Ch 17), deep metric learning (Ch 16)
Residual connectionsUsed in ResNets, Transformers, diffusion models
Self-attentionCore of GPT, BERT, Vision Transformers (ViT)
Dropout / weight decayBayesian neural networks approximate dropout (Gal & Ghahramani 2016)
Encoder-decoderAutoencoders (Ch 20), seq2seq translation
Softmax outputSame as logistic regression (Ch 10), used in clustering (Ch 21)
What we covered: MLPs with fully connected layers, activation functions (sigmoid, ReLU, GELU), backpropagation via reverse-mode autodiff, training tricks (BatchNorm, residual connections, Adam, dropout), convolutional layers with weight sharing and pooling, CNN architectures from LeNet to ResNet, RNNs with LSTM/GRU gating, and the Transformer with multi-head self-attention.
What comes next: Chapters 16–17 explore models that don't fix a parametric form at all. KNN, kernel methods, and Gaussian processes let the data speak directly. SVMs find the maximum margin boundary using the kernel trick. These nonparametric methods complement DNNs and sometimes outperform them on small datasets.

"What I cannot create, I do not understand." — Richard Feynman

What is the relationship between the Transformer and RNNs for sequence modeling?