From perceptrons to ResNets: supervised learning, neural network mechanics, CNNs, and generative models.
For decades, computer vision was dominated by hand-crafted features: edge detectors designed by humans, gradient histograms tuned by experts, and matching rules coded by hand. These worked for narrow tasks but failed to generalize.
The deep learning revolution changed everything. Instead of designing features, we learn them from data. Given millions of labeled images, a neural network discovers what patterns matter for the task — edges, textures, shapes, objects — all automatically.
The result: since 2012, deep learning has dominated every major vision benchmark. Image classification, object detection, segmentation, depth estimation, image generation — all state-of-the-art systems are now deep networks.
A deep network learns increasingly abstract features. Early layers detect edges, middle layers detect textures and parts, deep layers detect whole objects.
Before deep learning, the standard pipeline was: extract hand-crafted features, then train a classifier on those features. Understanding these methods helps appreciate what deep learning replaces.
| Method | Key Idea | Limitation |
|---|---|---|
| Nearest Neighbors | Classify by finding the most similar training example | Slow at test time, needs good distance metric |
| Logistic Regression | Linear decision boundary with probabilistic output | Cannot learn nonlinear patterns |
| SVMs | Maximum-margin linear separator, kernel trick for nonlinearity | Does not scale to millions of images |
| Decision Trees / Forests | Axis-aligned splits, ensemble for robustness | Features must be hand-designed |
Compare how different classifiers separate two classes in 2D feature space.
A neural network is built from simple units: neurons. Each neuron computes a weighted sum of its inputs, adds a bias, and passes the result through a nonlinear activation function:
Stack neurons into layers, stack layers into a network. The key activation functions:
| Function | Formula | Notes |
|---|---|---|
| Sigmoid | σ(z) = 1/(1+e−z) | Squashes to [0,1]. Saturates → vanishing gradients. |
| Tanh | tanh(z) | Squashes to [-1,1]. Zero-centered but still saturates. |
| ReLU | max(0, z) | No saturation for z>0. Dead neurons for z<0. |
| GELU | z · Φ(z) | Smooth ReLU. Default in transformers. |
Compare activation function shapes. Notice how ReLU avoids saturation for positive inputs.
How does a network learn? By adjusting its weights to minimize a loss function. The loss measures the gap between the network's predictions and the true labels.
This is the cross-entropy loss for classification. To minimize it, we need the gradient of the loss with respect to every weight in the network. Backpropagation computes these gradients efficiently using the chain rule, propagating error signals backward from the output to the input.
The update rule (gradient descent):
where η is the learning rate. In practice, we use stochastic gradient descent (SGD) on mini-batches, and modern optimizers like Adam adapt the learning rate per parameter.
Watch gradient descent descend a loss landscape. The learning rate controls step size.
A fully connected network treats each pixel independently. But images have spatial structure: nearby pixels are related. Convolutional neural networks (CNNs) exploit this with three key ideas:
A CNN is just learned convolution (Chapter 3) followed by nonlinearity, repeated many times. Early layers learn edge detectors. Middle layers learn texture and part detectors. Deep layers learn object-level features.
Watch data flow through conv → relu → pool → conv → relu → pool → fully connected.
The history of CNN architectures is a story of going deeper and finding ways to make depth work:
| Architecture | Year | Key Innovation | Depth |
|---|---|---|---|
| LeNet-5 | 1998 | First practical CNN (digit recognition) | 5 |
| AlexNet | 2012 | ReLU, dropout, GPU training. Won ImageNet. | 8 |
| VGGNet | 2014 | Small 3×3 filters stacked deep | 19 |
| GoogLeNet | 2014 | Inception modules (parallel filter sizes) | 22 |
| ResNet | 2015 | Skip connections solve vanishing gradients | 152 |
| U-Net | 2015 | Encoder-decoder with skip connections for dense prediction | ~23 |
Beyond CNNs, Vision Transformers (ViT) split images into patches and process them with self-attention. They now match or exceed CNNs on large datasets, showing that the inductive bias of convolution is not strictly necessary — just helpful when data is limited.
Training a deep network well requires more than just backprop. Key techniques:
Regularization:
Optimization:
So far, networks have been discriminative: input → label. Generative models learn to create new data. They model the data distribution and can sample from it.
| Model | Key Idea |
|---|---|
| Autoencoders | Compress data to a bottleneck, then reconstruct. The bottleneck learns a compact representation. |
| VAEs | Like autoencoders but with a probabilistic latent space. Sample from it to generate new data. |
| GANs | Two networks compete: a generator creates fakes, a discriminator detects them. Both improve. |
| Diffusion models | Gradually add noise, then learn to reverse the process. State of the art for image generation. |
Drag between two points in latent space. A good generative model produces smooth transitions.
Let's visualize a small neural network learning to classify 2D points. Watch how the decision boundary evolves during training.
A 2-layer network learns to separate two spiraling classes. Watch the decision boundary form.
Deep learning has become the default approach across all of computer vision:
| Concept | Used In |
|---|---|
| CNNs / feature extraction | Ch 6 (Recognition), Ch 7 (Learned features), Ch 9 (FlowNet) |
| ResNet / skip connections | Ch 6 (Detection backbones), Ch 12 (Stereo networks) |
| U-Net encoder-decoder | Ch 6 (Segmentation), Ch 12 (Depth estimation), Ch 10 (Super-resolution) |
| GANs | Ch 10 (Image synthesis), Ch 14 (Neural rendering) |
| Backpropagation | Every chapter that uses learned models |
| Batch normalization | Nearly all modern architectures |
| Vision Transformers | Ch 6 (Classification), Ch 12 (Dense prediction) |