Szeliski, Chapter 5

Deep Learning

From perceptrons to ResNets: supervised learning, neural network mechanics, CNNs, and generative models.

Prerequisites: Chapters 3-4 + basic probability. Some linear algebra helps.
10
Chapters
7+
Simulations
0
Assumed ML Knowledge

Chapter 0: Why Learning?

For decades, computer vision was dominated by hand-crafted features: edge detectors designed by humans, gradient histograms tuned by experts, and matching rules coded by hand. These worked for narrow tasks but failed to generalize.

The deep learning revolution changed everything. Instead of designing features, we learn them from data. Given millions of labeled images, a neural network discovers what patterns matter for the task — edges, textures, shapes, objects — all automatically.

The result: since 2012, deep learning has dominated every major vision benchmark. Image classification, object detection, segmentation, depth estimation, image generation — all state-of-the-art systems are now deep networks.

The paradigm shift: Hand-crafted features → learned features. Instead of telling the computer what to look for, you show it millions of examples and let it figure out what matters. This works because the feature hierarchy a deep network learns turns out to be remarkably similar to what neuroscientists observe in the visual cortex.
Feature Learning Hierarchy

A deep network learns increasingly abstract features. Early layers detect edges, middle layers detect textures and parts, deep layers detect whole objects.

What is the key advantage of learned features over hand-crafted features?

Chapter 1: Classical Machine Learning

Before deep learning, the standard pipeline was: extract hand-crafted features, then train a classifier on those features. Understanding these methods helps appreciate what deep learning replaces.

MethodKey IdeaLimitation
Nearest NeighborsClassify by finding the most similar training exampleSlow at test time, needs good distance metric
Logistic RegressionLinear decision boundary with probabilistic outputCannot learn nonlinear patterns
SVMsMaximum-margin linear separator, kernel trick for nonlinearityDoes not scale to millions of images
Decision Trees / ForestsAxis-aligned splits, ensemble for robustnessFeatures must be hand-designed
K-means and PCA: Unsupervised methods also played key roles. K-means was used to build visual vocabularies (Bag of Words). PCA reduced dimensionality (eigenfaces for face recognition). These ideas still appear inside modern systems as layers or preprocessing steps.
Decision Boundaries

Compare how different classifiers separate two classes in 2D feature space.

What was the main bottleneck of the classical feature + classifier pipeline?

Chapter 2: Neurons and Layers

A neural network is built from simple units: neurons. Each neuron computes a weighted sum of its inputs, adds a bias, and passes the result through a nonlinear activation function:

y = σ(wTx + b)

Stack neurons into layers, stack layers into a network. The key activation functions:

FunctionFormulaNotes
Sigmoidσ(z) = 1/(1+e−z)Squashes to [0,1]. Saturates → vanishing gradients.
Tanhtanh(z)Squashes to [-1,1]. Zero-centered but still saturates.
ReLUmax(0, z)No saturation for z>0. Dead neurons for z<0.
GELUz · Φ(z)Smooth ReLU. Default in transformers.
Why nonlinearity? Without activation functions, stacking linear layers collapses into a single linear transformation (matrix multiplication is associative). The nonlinearity is what gives networks their power — with enough neurons and nonlinear activations, a network can approximate any continuous function (universal approximation theorem).
Activation Functions

Compare activation function shapes. Notice how ReLU avoids saturation for positive inputs.

Why are activation functions essential in neural networks?

Chapter 3: Backpropagation

How does a network learn? By adjusting its weights to minimize a loss function. The loss measures the gap between the network's predictions and the true labels.

L = − ∑i yi log(pi)

This is the cross-entropy loss for classification. To minimize it, we need the gradient of the loss with respect to every weight in the network. Backpropagation computes these gradients efficiently using the chain rule, propagating error signals backward from the output to the input.

The update rule (gradient descent):

w ← w − η · ∂L/∂w

where η is the learning rate. In practice, we use stochastic gradient descent (SGD) on mini-batches, and modern optimizers like Adam adapt the learning rate per parameter.

Backprop is just the chain rule. For a composition f(g(h(x))), the derivative is f' · g' · h'. Backprop applies this systematically across millions of parameters. The computational cost is roughly 2x the forward pass — remarkably cheap for computing millions of gradients simultaneously.
Gradient Descent

Watch gradient descent descend a loss landscape. The learning rate controls step size.

Learning rate 0.10
What does backpropagation compute?

Chapter 4: Convolutional Neural Networks

A fully connected network treats each pixel independently. But images have spatial structure: nearby pixels are related. Convolutional neural networks (CNNs) exploit this with three key ideas:

output(x, y) = σ(∑i,j,k input(x+i, y+j, k) · filter(i, j, k) + b)

A CNN is just learned convolution (Chapter 3) followed by nonlinearity, repeated many times. Early layers learn edge detectors. Middle layers learn texture and part detectors. Deep layers learn object-level features.

The parameter savings: A 224×224 RGB image has 150,528 pixels. A fully connected layer to 1000 neurons needs 150 million parameters. A 3×3 convolutional layer with 64 filters needs only 1,728 parameters — a factor of 100,000 reduction. Weight sharing is the secret.
CNN Architecture

Watch data flow through conv → relu → pool → conv → relu → pool → fully connected.

What is the key insight of convolutional layers over fully connected layers for images?

Chapter 5: Architectures

The history of CNN architectures is a story of going deeper and finding ways to make depth work:

ArchitectureYearKey InnovationDepth
LeNet-51998First practical CNN (digit recognition)5
AlexNet2012ReLU, dropout, GPU training. Won ImageNet.8
VGGNet2014Small 3×3 filters stacked deep19
GoogLeNet2014Inception modules (parallel filter sizes)22
ResNet2015Skip connections solve vanishing gradients152
U-Net2015Encoder-decoder with skip connections for dense prediction~23
ResNet's insight: Instead of learning a mapping H(x), learn the residual F(x) = H(x) − x. Adding a skip connection (y = F(x) + x) means the network only needs to learn what to change, not the entire transformation. This allows training networks hundreds of layers deep.

Beyond CNNs, Vision Transformers (ViT) split images into patches and process them with self-attention. They now match or exceed CNNs on large datasets, showing that the inductive bias of convolution is not strictly necessary — just helpful when data is limited.

What problem do ResNet's skip connections solve?

Chapter 6: Training Techniques

Training a deep network well requires more than just backprop. Key techniques:

Regularization:

Optimization:

Batch normalization explained: During training, each mini-batch has slightly different statistics. BatchNorm normalizes each layer's inputs to zero mean and unit variance, then learns an affine transform. This reduces the "internal covariate shift" problem — each layer sees more stable inputs, allowing faster and more stable training.
Why is data augmentation often the most effective regularizer?

Chapter 7: Generative Models

So far, networks have been discriminative: input → label. Generative models learn to create new data. They model the data distribution and can sample from it.

ModelKey Idea
AutoencodersCompress data to a bottleneck, then reconstruct. The bottleneck learns a compact representation.
VAEsLike autoencoders but with a probabilistic latent space. Sample from it to generate new data.
GANsTwo networks compete: a generator creates fakes, a discriminator detects them. Both improve.
Diffusion modelsGradually add noise, then learn to reverse the process. State of the art for image generation.
The GAN game: The generator G tries to fool the discriminator D. D tries to distinguish real from fake. At equilibrium, G produces data indistinguishable from real. This adversarial training produces remarkably sharp, realistic images — but can be unstable (mode collapse, training oscillation).
Latent Space Interpolation

Drag between two points in latent space. A good generative model produces smooth transitions.

Interpolation t 0.00
What makes GANs different from autoencoders?

Chapter 8: Showcase — Network Playground

Let's visualize a small neural network learning to classify 2D points. Watch how the decision boundary evolves during training.

Neural Network Learning

A 2-layer network learns to separate two spiraling classes. Watch the decision boundary form.

Hidden units 8
Depth vs width: A network with 2 hidden units can only create linear boundaries. With 8+, it can form complex curved boundaries. But going deeper (more layers) is generally more parameter-efficient than going wider (more units per layer) for learning hierarchical features.

Chapter 9: Connections

Deep learning has become the default approach across all of computer vision:

ConceptUsed In
CNNs / feature extractionCh 6 (Recognition), Ch 7 (Learned features), Ch 9 (FlowNet)
ResNet / skip connectionsCh 6 (Detection backbones), Ch 12 (Stereo networks)
U-Net encoder-decoderCh 6 (Segmentation), Ch 12 (Depth estimation), Ch 10 (Super-resolution)
GANsCh 10 (Image synthesis), Ch 14 (Neural rendering)
BackpropagationEvery chapter that uses learned models
Batch normalizationNearly all modern architectures
Vision TransformersCh 6 (Classification), Ch 12 (Dense prediction)
Szeliski's perspective: "Deep learning is not a single technique but a toolkit. The combination of differentiable building blocks (convolution, pooling, attention, normalization) with gradient-based optimization on massive datasets has proven to be the most powerful approach to nearly every problem in computer vision."
Which deep learning architecture is the basis of most modern object detection and segmentation systems?