Deep Learning — Szeliski, Chapter 5

Chapter 0: Why Learning?

For decades, computer vision was dominated by hand-crafted features: edge detectors designed by humans, gradient histograms tuned by experts, and matching rules coded by hand. These worked for narrow tasks but failed to generalize.

The deep learning revolution changed everything. Instead of designing features, we learn them from data. Given millions of labeled images, a neural network discovers what patterns matter for the task — edges, textures, shapes, objects — all automatically.

The result: since 2012, deep learning has dominated every major vision benchmark. Image classification, object detection, segmentation, depth estimation, image generation — all state-of-the-art systems are now deep networks.

The paradigm shift: Hand-crafted features → learned features. Instead of telling the computer what to look for, you show it millions of examples and let it figure out what matters. This works because the feature hierarchy a deep network learns turns out to be remarkably similar to what neuroscientists observe in the visual cortex.

Feature Learning Hierarchy

A deep network learns increasingly abstract features. Early layers detect edges, middle layers detect textures and parts, deep layers detect whole objects.

What is the key advantage of learned features over hand-crafted features?

Learned features are always faster to compute They automatically discover what patterns matter for the task from data, instead of requiring manual design They use less memory

Chapter 1: Classical Machine Learning

Before deep learning, the standard pipeline was: extract hand-crafted features, then train a classifier on those features. Understanding these methods helps appreciate what deep learning replaces.

Method	Key Idea	Limitation
Nearest Neighbors	Classify by finding the most similar training example	Slow at test time, needs good distance metric
Logistic Regression	Linear decision boundary with probabilistic output	Cannot learn nonlinear patterns
SVMs	Maximum-margin linear separator, kernel trick for nonlinearity	Does not scale to millions of images
Decision Trees / Forests	Axis-aligned splits, ensemble for robustness	Features must be hand-designed

K-means and PCA: Unsupervised methods also played key roles. K-means was used to build visual vocabularies (Bag of Words). PCA reduced dimensionality (eigenfaces for face recognition). These ideas still appear inside modern systems as layers or preprocessing steps.

Decision Boundaries

Compare how different classifiers separate two classes in 2D feature space.

What was the main bottleneck of the classical feature + classifier pipeline?

The features had to be hand-designed by domain experts, limiting the system's ability to discover useful patterns from data The classifiers were too slow There was not enough training data

Chapter 2: Neurons and Layers

A neural network is built from simple units: neurons. Each neuron computes a weighted sum of its inputs, adds a bias, and passes the result through a nonlinear activation function:

y = σ(w^Tx + b)

Stack neurons into layers, stack layers into a network. The key activation functions:

Function	Formula	Notes
Sigmoid	σ(z) = 1/(1+e^−z)	Squashes to [0,1]. Saturates → vanishing gradients.
Tanh	tanh(z)	Squashes to [-1,1]. Zero-centered but still saturates.
ReLU	max(0, z)	No saturation for z>0. Dead neurons for z<0.
GELU	z · Φ(z)	Smooth ReLU. Default in transformers.

Why nonlinearity? Without activation functions, stacking linear layers collapses into a single linear transformation (matrix multiplication is associative). The nonlinearity is what gives networks their power — with enough neurons and nonlinear activations, a network can approximate any continuous function (universal approximation theorem).

Activation Functions

Compare activation function shapes. Notice how ReLU avoids saturation for positive inputs.

Why are activation functions essential in neural networks?

They make the network faster Without nonlinearity, stacked linear layers collapse into a single linear transformation, unable to learn complex patterns They prevent the weights from growing too large

Chapter 3: Backpropagation

How does a network learn? By adjusting its weights to minimize a loss function. The loss measures the gap between the network's predictions and the true labels.

L = − ∑_i y_i log(p_i)

This is the cross-entropy loss for classification. To minimize it, we need the gradient of the loss with respect to every weight in the network. Backpropagation computes these gradients efficiently using the chain rule, propagating error signals backward from the output to the input.

The update rule (gradient descent):

w ← w − η · ∂L/∂w

where η is the learning rate. In practice, we use stochastic gradient descent (SGD) on mini-batches, and modern optimizers like Adam adapt the learning rate per parameter.

Backprop is just the chain rule. For a composition f(g(h(x))), the derivative is f' · g' · h'. Backprop applies this systematically across millions of parameters. The computational cost is roughly 2x the forward pass — remarkably cheap for computing millions of gradients simultaneously.

Gradient Descent

Watch gradient descent descend a loss landscape. The learning rate controls step size.

Learning rate 0.10

What does backpropagation compute?

The gradient of the loss with respect to every weight, using the chain rule propagated backward through the network The forward pass predictions The optimal learning rate

Chapter 4: Convolutional Neural Networks

A fully connected network treats each pixel independently. But images have spatial structure: nearby pixels are related. Convolutional neural networks (CNNs) exploit this with three key ideas:

Local connectivity: Each neuron connects to a small local region (receptive field), not the whole image
Weight sharing: The same filter is applied at every position, dramatically reducing parameters
Pooling: Downsample feature maps to build translation invariance and reduce computation

output(x, y) = σ(∑_i,j,k input(x+i, y+j, k) · filter(i, j, k) + b)

A CNN is just learned convolution (Chapter 3) followed by nonlinearity, repeated many times. Early layers learn edge detectors. Middle layers learn texture and part detectors. Deep layers learn object-level features.

The parameter savings: A 224×224 RGB image has 150,528 pixels. A fully connected layer to 1000 neurons needs 150 million parameters. A 3×3 convolutional layer with 64 filters needs only 1,728 parameters — a factor of 100,000 reduction. Weight sharing is the secret.

CNN Architecture

Watch data flow through conv → relu → pool → conv → relu → pool → fully connected.

What is the key insight of convolutional layers over fully connected layers for images?

They are always faster Weight sharing exploits spatial structure — the same filter detects the same pattern everywhere, with far fewer parameters They do not need activation functions

Chapter 5: Architectures

The history of CNN architectures is a story of going deeper and finding ways to make depth work:

Architecture	Year	Key Innovation	Depth
LeNet-5	1998	First practical CNN (digit recognition)	5
AlexNet	2012	ReLU, dropout, GPU training. Won ImageNet.	8
VGGNet	2014	Small 3×3 filters stacked deep	19
GoogLeNet	2014	Inception modules (parallel filter sizes)	22
ResNet	2015	Skip connections solve vanishing gradients	152
U-Net	2015	Encoder-decoder with skip connections for dense prediction	~23

ResNet's insight: Instead of learning a mapping H(x), learn the residual F(x) = H(x) − x. Adding a skip connection (y = F(x) + x) means the network only needs to learn what to change, not the entire transformation. This allows training networks hundreds of layers deep.

Beyond CNNs, Vision Transformers (ViT) split images into patches and process them with self-attention. They now match or exceed CNNs on large datasets, showing that the inductive bias of convolution is not strictly necessary — just helpful when data is limited.

What problem do ResNet's skip connections solve?

The vanishing gradient problem — skip connections provide gradient highways, enabling training of very deep networks Overfitting on small datasets Slow inference speed

Chapter 6: Training Techniques

Training a deep network well requires more than just backprop. Key techniques:

Regularization:

Dropout: Randomly zero out neurons during training. Forces redundant representations. Like training an ensemble.
Weight decay: L2 regularization on weights (same as Chapter 4, but for neural nets).
Batch normalization: Normalize activations within each mini-batch. Stabilizes training, allows higher learning rates.
Data augmentation: Random crops, flips, color jitter. Cheapest regularizer — more virtual training data for free.

Optimization:

SGD with momentum: Accumulates gradient direction over time, smoothing out noise
Adam: Adaptive per-parameter learning rates based on first and second moment estimates
Learning rate scheduling: Start high, decay over training (cosine, step, warmup)

Batch normalization explained: During training, each mini-batch has slightly different statistics. BatchNorm normalizes each layer's inputs to zero mean and unit variance, then learns an affine transform. This reduces the "internal covariate shift" problem — each layer sees more stable inputs, allowing faster and more stable training.

Why is data augmentation often the most effective regularizer?

It effectively increases the training set size by creating plausible variations, teaching the model invariances for free It makes training faster It reduces the number of parameters

Chapter 7: Generative Models

So far, networks have been discriminative: input → label. Generative models learn to create new data. They model the data distribution and can sample from it.

Model	Key Idea
Autoencoders	Compress data to a bottleneck, then reconstruct. The bottleneck learns a compact representation.
VAEs	Like autoencoders but with a probabilistic latent space. Sample from it to generate new data.
GANs	Two networks compete: a generator creates fakes, a discriminator detects them. Both improve.
Diffusion models	Gradually add noise, then learn to reverse the process. State of the art for image generation.

The GAN game: The generator G tries to fool the discriminator D. D tries to distinguish real from fake. At equilibrium, G produces data indistinguishable from real. This adversarial training produces remarkably sharp, realistic images — but can be unstable (mode collapse, training oscillation).

Latent Space Interpolation

Drag between two points in latent space. A good generative model produces smooth transitions.

Interpolation t 0.00

What makes GANs different from autoencoders?

GANs are faster GANs use adversarial training — a generator and discriminator compete, producing sharper results than reconstruction-based methods GANs do not need training data

Chapter 8: Showcase — Network Playground

Let's visualize a small neural network learning to classify 2D points. Watch how the decision boundary evolves during training.

Neural Network Learning

A 2-layer network learns to separate two spiraling classes. Watch the decision boundary form.

Hidden units 8

Depth vs width: A network with 2 hidden units can only create linear boundaries. With 8+, it can form complex curved boundaries. But going deeper (more layers) is generally more parameter-efficient than going wider (more units per layer) for learning hierarchical features.

Chapter 9: Connections

Deep learning has become the default approach across all of computer vision:

Concept	Used In
CNNs / feature extraction	Ch 6 (Recognition), Ch 7 (Learned features), Ch 9 (FlowNet)
ResNet / skip connections	Ch 6 (Detection backbones), Ch 12 (Stereo networks)
U-Net encoder-decoder	Ch 6 (Segmentation), Ch 12 (Depth estimation), Ch 10 (Super-resolution)
GANs	Ch 10 (Image synthesis), Ch 14 (Neural rendering)
Backpropagation	Every chapter that uses learned models
Batch normalization	Nearly all modern architectures
Vision Transformers	Ch 6 (Classification), Ch 12 (Dense prediction)

Szeliski's perspective: "Deep learning is not a single technique but a toolkit. The combination of differentiable building blocks (convolution, pooling, attention, normalization) with gradient-based optimization on massive datasets has proven to be the most powerful approach to nearly every problem in computer vision."

Which deep learning architecture is the basis of most modern object detection and segmentation systems?

CNN-based architectures (ResNet, U-Net) used as backbones for feature extraction Autoencoders K-nearest neighbors