Ch 13–15: Deep Neural Networks

Chapter 0: Why Go Deep?

Logistic regression draws a single hyperplane. What if the decision boundary is a spiral? A checkerboard? A face versus not-a-face? No single linear function can capture these patterns.

The fix is deceptively simple: stack multiple linear layers with nonlinear activations between them. Each layer transforms the representation, making previously inseparable patterns separable. This is a deep neural network.

The big picture: A DNN computes f(x) = f_L(f_L−1(…f₁(x)…)), where each f_l(z) = φ(W_lz + b_l). The weight matrices W_l are learned, and φ is a nonlinear activation. Without φ, the composition collapses to a single linear map. With φ, the network can approximate any continuous function (universal approximation theorem).

MLPs (Ch 13)

Fully connected layers, backpropagation, training tricks, regularization

↓

CNNs (Ch 14)

Convolutional layers, pooling, LeNet to ResNet, image classification

↓

RNNs (Ch 15)

Recurrence, gating, LSTMs, GRUs, sequence-to-sequence

↓

Transformers (Ch 15)

Self-attention, multi-head attention, positional encoding, LLMs

Why can't stacking multiple linear layers (without activations) learn nonlinear functions?

Because the composition of linear functions is still linear: W₂W₁x = W'x Because linear layers have too few parameters Because gradient descent cannot optimize multiple layers

Chapter 1: Multilayer Perceptrons

An MLP is a sequence of fully connected layers. Each layer computes:

z_l = W_l h_l−1 + b_l h_l = φ(z_l)

where h₀ = x is the input, z_l is the pre-activation, and h_l is the activation (post-nonlinearity). The final layer produces the output: for classification, a softmax; for regression, a linear output.

The XOR problem: A single linear layer cannot learn XOR (Murphy 13.2.1). But a 2-layer MLP with 2 hidden units can, by first transforming the inputs into a space where they become linearly separable. This is the core insight: hidden layers learn useful representations.

Consider a network with one hidden layer of H units for binary classification:

p(y=1 | x) = σ(w₂^T φ(W₁x + b₁) + b₂)

The first layer maps x from D dimensions to H dimensions (learning features). The second layer is logistic regression on those learned features. Each additional layer learns more abstract features from the previous layer's output.

Universal approximation: A single hidden layer with enough units can approximate any continuous function on a compact set (Cybenko 1989, Hornik 1991). But "enough" may be exponentially many. Deeper networks can represent the same functions with exponentially fewer units — depth is more efficient than width.

Component	Role	Parameters
Input layer	Raw features x ∈ R^D	None
Hidden layer l	φ(W_lh_l−1 + b_l)	W_l ∈ R^H_l×H_l−1, b_l ∈ R^H_l
Output layer	Softmax or linear	W_L ∈ R^C×H_L−1, b_L ∈ R^C

What role does the hidden layer play in an MLP?

It learns a nonlinear feature transformation that makes the data linearly separable for the output layer It stores the training data It reduces the dimensionality of the input

Chapter 2: Activation Functions

The activation function φ is what makes neural networks nonlinear. Without it, any depth of layers collapses to W' = W_LW_L−1…W₁. Murphy (13.2.3) surveys the major choices.

The classic sigmoid σ(a) = 1/(1+e^−a) squashes inputs to [0,1]. Its cousin tanh(a) maps to [−1, 1]. Both suffer from saturation: for large |a|, the gradient is nearly zero, killing learning in deep networks.

The ReLU revolution: The Rectified Linear Unit, ReLU(a) = max(0, a), solved the vanishing gradient problem (Glorot et al. 2011). Its gradient is exactly 1 for a > 0 and 0 for a < 0. No saturation. Faster to compute. It became the default activation for hidden layers.

Activation	Formula	Range	Key Property
Sigmoid	1/(1+e^−a)	[0, 1]	Saturates both ends
Tanh	(e^a−e^−a)/(e^a+e^−a)	[−1, 1]	Zero-centered, saturates
ReLU	max(0, a)	[0, ∞)	No saturation for a > 0, but "dead" if a < 0
Leaky ReLU	max(αa, a), α≈0.01	(−∞, ∞)	No dead neurons
ELU	a if a>0, α(e^a−1) else	[−α, ∞)	Smooth near zero
Swish/SiLU	a · σ(a)	≈[−0.28, ∞)	Smooth, non-monotonic
GELU	a · Φ(a)	≈[−0.17, ∞)	Used in Transformers

Dead ReLU problem: If a neuron's pre-activation is always negative (due to a bad weight initialization or large learning rate), its gradient is always zero and it never updates. Leaky ReLU, ELU, and Swish fix this by allowing small gradients for negative inputs.

For output layers, the choice depends on the task: softmax for multi-class classification, sigmoid for binary or multi-label, and linear (identity) for regression. The hidden layer activation and output activation serve fundamentally different purposes.

Why did ReLU largely replace sigmoid/tanh in hidden layers of deep networks?

Because ReLU outputs are always positive Because ReLU does not saturate for positive inputs, avoiding the vanishing gradient problem Because ReLU is differentiable everywhere

Chapter 3: Backpropagation

How do we compute gradients in a network with millions of parameters across dozens of layers? By the chain rule, applied systematically. This is backpropagation (Rumelhart, Hinton, Williams, 1986).

Consider a loss L = NLL(f(x; θ), y). The forward pass computes h₁, h₂, …, h_L, and the loss. The backward pass propagates gradients in reverse:

δ_l = ∂L / ∂z_l ∂L / ∂W_l = δ_l h_l−1^T δ_l−1 = W_l^T δ_l ⊙ φ'(z_l−1)

The vector δ_l is the error signal at layer l. It tells us how much each pre-activation contributes to the loss. We compute it by multiplying the downstream error by the transposed weight matrix and the local activation derivative.

Forward vs reverse mode: Murphy (13.3.1) distinguishes forward-mode and reverse-mode differentiation. Forward mode computes one column of the Jacobian per pass (cost: O(D) passes for D inputs). Reverse mode computes one row per pass (cost: O(1) passes for scalar loss). Since the loss is scalar and inputs are high-dimensional, reverse mode (backprop) is exponentially more efficient.

Computation graphs: Modern frameworks (PyTorch, JAX) generalize backprop to arbitrary computation graphs. Each operation records its inputs and local gradient. The backward pass traverses this graph in reverse topological order, accumulating gradients via the chain rule. This is automatic differentiation — you write the forward pass and get gradients for free.

The cost of backprop is roughly 2–3x the cost of the forward pass. The memory cost is higher: we must store all intermediate activations for the backward pass (or recompute them, trading time for memory via gradient checkpointing).

Why is reverse-mode differentiation (backpropagation) more efficient than forward-mode for training neural networks?

Because the loss is a scalar, so reverse mode computes all parameter gradients in a single backward pass, while forward mode would need one pass per parameter Because forward mode requires more memory Because reverse mode does not use the chain rule

Chapter 4: Training & Regularization

Training a DNN is harder than training logistic regression. The loss landscape is non-convex with many local minima and saddle points. Murphy (13.4) covers the essential tricks.

Learning rate scheduling: Too large and training diverges. Too small and it takes forever. Common schedules include step decay, cosine annealing, and warmup followed by decay. Adam optimizer adapts per-parameter learning rates using first and second moment estimates.

Vanishing and exploding gradients (13.4.2): In deep networks, gradients are products of many Jacobians. If eigenvalues are < 1, gradients vanish exponentially. If > 1, they explode. Solutions: (1) careful initialization (He/Xavier), (2) non-saturating activations (ReLU), (3) normalization layers (BatchNorm, LayerNorm), (4) residual connections.

Residual connections (He et al. 2016): Instead of computing h_l = f(h_l−1), compute h_l = h_l−1 + f(h_l−1). The gradient flows through the identity shortcut without any multiplication, solving the vanishing gradient problem. This enabled networks with hundreds of layers.

Weight decay (ℓ₂): Same as ridge regression (Ch 11). Adds λ||w||² to the loss. Prevents any single weight from dominating. In SGD, equivalent to multiplying weights by (1 − ηλ) each step.

Dropout (13.5.4): Randomly set each hidden unit to zero with probability p during training. At test time, scale by (1−p). Forces the network to learn redundant representations. Approximately equivalent to an ensemble of 2^H sub-networks.

Batch normalization (Ioffe & Szegedy 2015) normalizes each layer's pre-activations to zero mean and unit variance within each mini-batch, then applies a learned scale and shift. It stabilizes training, allows higher learning rates, and acts as a mild regularizer.

Problem	Solution	Murphy Section
Vanishing gradients	ReLU, residual connections, normalization	13.4.2–13.4.4
Overfitting	Dropout, weight decay, early stopping, data augmentation	13.5
Slow convergence	Adam, learning rate warmup, BatchNorm	13.4.1
Poor initialization	He init (ReLU), Xavier/Glorot init (tanh)	13.4.5

How do residual connections help train very deep networks?

They provide an identity shortcut for gradients to flow through, preventing vanishing gradients regardless of depth They reduce the number of parameters They make the loss function convex

Chapter 5: Convolutional Layers

Images have spatial structure: nearby pixels are correlated, and patterns (edges, textures) can appear anywhere. An MLP ignores this, treating each pixel as an independent feature. A convolutional neural network (CNN) exploits it.

A convolutional layer applies a small kernel (filter) K of size k×k across the image, computing a dot product at each spatial position:

y[i, j] = ∑_m ∑_n K[m, n] · x[i+m, j+n] + b

This is a cross-correlation (Murphy 14.2.1). The same kernel is applied at every position — weight sharing. This has two key benefits:

Translation equivariance: If the input shifts by (dx, dy), the output shifts by the same amount. The network detects patterns regardless of where they appear in the image.

Parameter efficiency: A 3×3 kernel has only 9 parameters regardless of image size. An MLP connecting every pixel to every hidden unit would need millions.

A typical conv layer has multiple kernels (say 64), each producing one feature map. The input is C channels (e.g., RGB = 3). So the kernel is actually k×k×C, and we have F such kernels, giving F output feature maps. Total parameters: F × k × k × C + F.

Pooling layers (Murphy 14.2.2) downsample feature maps, reducing spatial resolution. Max pooling takes the maximum over a small window (e.g., 2×2). This provides a degree of translation invariance and reduces computation.

Layer Type	Purpose	Parameters
Conv2D(k, F)	Detect local patterns	F × k² × C_in + F
MaxPool(s)	Downsample, add invariance	0
BatchNorm	Normalize activations	2 × F (scale + shift)
Global AvgPool	Collapse spatial dims	0

What is the main advantage of weight sharing in convolutional layers?

It makes training faster by reducing computation The same kernel detects the same pattern everywhere in the image, giving translation equivariance with far fewer parameters than a fully connected layer It prevents overfitting by limiting the number of layers

Chapter 6: CNN Architectures

The history of CNNs is a story of going deeper and wider while managing gradients. Murphy (14.3) traces the key milestones.

LeNet-5 (LeCun 1998): Two conv layers, two pooling layers, three FC layers. Only ~60K parameters. Designed for handwritten digit recognition. The template for all CNNs that followed.

AlexNet (Krizhevsky 2012): Scaled up LeNet to 60M parameters, used ReLU instead of sigmoid, applied dropout and data augmentation. Won ImageNet 2012 by a huge margin, launching the deep learning revolution.

GoogLeNet/Inception (2014): Instead of choosing a kernel size, use all of them. An Inception module applies 1×1, 3×3, and 5×5 convolutions in parallel, then concatenates the results. 1×1 convolutions reduce channel dimensionality first (bottleneck), keeping computation manageable. 22 layers, only 5M parameters.

ResNet (He et al. 2015): The residual connection: y = x + F(x). Instead of learning the mapping directly, the network learns the residual (how to adjust the input). This is easier when the optimal mapping is close to identity. ResNet-152 has 152 layers — deeper than was thought possible before skip connections.

Architecture	Year	Depth	Key Innovation
LeNet-5	1998	5	Conv + pool template
AlexNet	2012	8	ReLU, dropout, GPU training
VGGNet	2014	16–19	Small 3×3 filters throughout
GoogLeNet	2014	22	Inception modules, 1×1 bottleneck
ResNet	2015	50–152	Residual connections
DenseNet	2017	121+	Dense connections (all layers to all)

The pattern: Early layers learn low-level features (edges, colors). Middle layers learn textures and parts. Deep layers learn objects and scenes. This hierarchical feature learning is the core power of deep CNNs.

What is the key innovation in ResNet that enabled training networks with 100+ layers?

Skip (residual) connections that let the network learn residuals y = x + F(x), allowing gradients to flow through the identity shortcut Using larger convolutional kernels Removing all pooling layers

Chapter 7: Recurrent Neural Networks

Sequences have temporal structure: the meaning of a word depends on what came before. CNNs can handle fixed-size inputs, but sentences and time series have variable length. Recurrent neural networks (RNNs) process sequences one step at a time, maintaining a hidden state.

h_t = φ(W_hh h_t−1 + W_xh x_t + b_h)

The hidden state h_t is a compressed summary of the sequence so far. At each step, it combines the previous state with the new input. The same weights (W_hh, W_xh) are used at every time step — this is weight sharing across time.

The vanishing gradient problem returns (15.2.6): Backpropagation through time (BPTT) unrolls the RNN across T steps. The gradient includes products of T Jacobians. If their spectral norm is < 1, gradients vanish. If > 1, they explode. Vanilla RNNs struggle with sequences longer than ~20 steps.

LSTMs (Hochreiter & Schmidhuber 1997) solve this with a cell state c_t that flows through time with minimal modification. Three gates control information flow:

Gate	Formula	Role
Forget (f_t)	σ(W_f[h_t−1, x_t] + b_f)	What to erase from cell state
Input (i_t)	σ(W_i[h_t−1, x_t] + b_i)	What to write to cell state
Output (o_t)	σ(W_o[h_t−1, x_t] + b_o)	What to expose as hidden state

c_t = f_t ⊙ c_t−1 + i_t ⊙ tanh(W_c[h_t−1, x_t] + b_c)

GRUs (Cho et al. 2014) simplify the LSTM by combining the forget and input gates into a single update gate, and merging the cell and hidden states. Fewer parameters, similar performance in many tasks.

What problem do LSTM gates solve that vanilla RNNs cannot?

They allow the cell state to carry information across many time steps without the gradient vanishing, by using gates to control what is remembered and forgotten They make the network deeper They reduce the number of parameters

Chapter 8: Attention & Transformers

RNNs process sequences left-to-right, compressing everything into a fixed-size hidden state. For long sequences, early information gets washed out. Attention (Bahdanau et al. 2014) fixes this by letting the model look back at all previous states.

Attention as soft dictionary lookup (15.4.1): Given a query q and a set of key-value pairs (k_i, v_i), attention computes a weighted sum of values: Attn(q, K, V) = ∑_i α_i v_i, where α_i = softmax(q^Tk_i / √d). The weights α_i are high for keys similar to the query. Think of it as a soft, differentiable lookup table.

The Transformer (Vaswani et al. 2017) replaces recurrence entirely with self-attention. Each token attends to every other token in parallel:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

where Q = XW_Q, K = XW_K, V = XW_V are linear projections of the input sequence X. The √d_k scaling prevents the dot products from becoming too large before softmax.

Multi-head attention (15.5.2): Instead of one attention function, use H parallel heads, each with its own Q, K, V projections of dimension d/H. Concatenate the outputs and project: MultiHead(X) = Concat(head₁, …, head_H)W_O. Different heads can attend to different aspects (syntax, semantics, position).

Positional encoding: Self-attention is permutation-invariant — it cannot tell word order. We add positional information using sinusoidal or learned embeddings: x_i' = x_i + PE(i).

A Transformer block stacks: (1) multi-head self-attention, (2) LayerNorm + residual, (3) feed-forward MLP, (4) LayerNorm + residual. GPT stacks these for language generation. BERT uses bidirectional attention for understanding.

Architecture	RNN	Transformer
Parallelism	Sequential (slow)	Fully parallel (fast)
Long-range deps	Vanish with distance	Direct attention to any position
Memory	O(1) per step	O(T²) for self-attention
Training speed	Slow (no parallelism)	Fast (matrix multiply)

Why does scaled dot-product attention divide by √d_k?

To prevent dot products from growing too large with high dimensionality, which would push softmax into saturation (near-one-hot) and kill gradients To normalize the output to unit variance To reduce the number of parameters

Chapter 9: MLP Playground

Watch a 2-layer MLP learn to separate nonlinear data. Click to place points from two classes, then train the network. The heatmap shows the learned decision boundary.

MLP: 2D Nonlinear Classifier

Click to place points (toggle class below). Hit Train to run gradient descent on a 2-layer MLP with 16 hidden units and ReLU activations.

0 points

Hidden units16

What to try: Place points in a circle pattern (class 1 inside, class 0 outside). A linear model cannot separate them, but the MLP can learn a circular decision boundary. Try increasing hidden units to see how the boundary becomes smoother.

What enables the MLP to learn nonlinear decision boundaries that logistic regression cannot?

The hidden layer with ReLU activations learns a nonlinear feature transformation, making the data linearly separable in the hidden space The MLP uses a different loss function The MLP uses more training data

Chapter 10: CNN Feature Explorer

See how a 1D convolution extracts features from a signal. The kernel slides across the input, computing a dot product at each position. Different kernels detect different patterns.

1D Convolution: Kernel Sliding

The gray line is the input signal. The orange region is the kernel window. The teal line is the convolution output (feature map). Use the slider to move the kernel.

Kernel position20

Observe: The edge detection kernel produces large outputs where the signal changes rapidly. The smoothing kernel averages nearby values, blurring the signal. The sharpening kernel amplifies differences. In a CNN, the network learns which kernels are useful for the task.

In a CNN for image classification, what do early convolutional layers typically learn to detect?

Simple patterns like edges, gradients, and color blobs, which are combined into more complex features by deeper layers Entire objects like faces and cars Random noise patterns

Chapter 11: Connections

DNNs are the backbone of modern machine learning. Every subsequent chapter in Murphy builds on them.

Concept from this chapter	Where it leads
MLPs	Building block for every architecture; encoder/decoder in autoencoders (Ch 20)
Backpropagation	Trains all models: GANs, VAEs, transformers, RL policies
CNNs	Feature extractors for GPs (Ch 17), deep metric learning (Ch 16)
Residual connections	Used in ResNets, Transformers, diffusion models
Self-attention	Core of GPT, BERT, Vision Transformers (ViT)
Dropout / weight decay	Bayesian neural networks approximate dropout (Gal & Ghahramani 2016)
Encoder-decoder	Autoencoders (Ch 20), seq2seq translation
Softmax output	Same as logistic regression (Ch 10), used in clustering (Ch 21)

What we covered: MLPs with fully connected layers, activation functions (sigmoid, ReLU, GELU), backpropagation via reverse-mode autodiff, training tricks (BatchNorm, residual connections, Adam, dropout), convolutional layers with weight sharing and pooling, CNN architectures from LeNet to ResNet, RNNs with LSTM/GRU gating, and the Transformer with multi-head self-attention.

What comes next: Chapters 16–17 explore models that don't fix a parametric form at all. KNN, kernel methods, and Gaussian processes let the data speak directly. SVMs find the maximum margin boundary using the kernel trick. These nonparametric methods complement DNNs and sometimes outperform them on small datasets.

"What I cannot create, I do not understand." — Richard Feynman

What is the relationship between the Transformer and RNNs for sequence modeling?

Transformers replace sequential recurrence with parallel self-attention, achieving better long-range dependencies and faster training at the cost of O(T²) memory Transformers are a type of RNN RNNs are always better than Transformers

Deep Neural Networks: MLPs, CNNs & Transformers