Computer vision, speech recognition, natural language processing, recommender systems, and the practical deployment of deep learning at scale.
The practical success of deep learning rests on three pillars: algorithms, data, and compute. This chapter focuses on the third — the engineering required to train and deploy models at the scale where they become useful.
GPU computing transformed deep learning. GPUs can perform thousands of matrix multiplications in parallel, making them 10-100x faster than CPUs for the linear algebra that dominates neural network training. Modern training uses multiple GPUs (data parallelism, model parallelism, pipeline parallelism) across multiple machines.
Data parallelism: Split the mini-batch across N GPUs. Each GPU computes gradients on its portion. Average the gradients across GPUs. Update the shared model. This scales almost linearly with the number of GPUs, up to the point where communication overhead dominates.
Model parallelism: When the model is too large to fit on one GPU, split it across devices. Different layers on different GPUs (pipeline parallelism) or different parts of a layer (tensor parallelism). Essential for training models with hundreds of billions of parameters.
Computer vision was the first domain where deep learning achieved superhuman performance. The path from LeNet to modern vision systems illustrates the power of architectural innovation and scaling.
Image classification was the proving ground. AlexNet (2012) cut the ImageNet top-5 error from 26% to 15%, then each year brought improvements: VGG (7.3%), GoogLeNet (6.7%), ResNet (3.6%), and EfficientNet (2.9%). Human performance is roughly 5%, so ResNet already surpassed it in 2015.
Object detection adds localization: not just "this image contains a cat" but "the cat is in this bounding box." Key architectures:
• R-CNN family (R-CNN, Fast R-CNN, Faster R-CNN): Two-stage approach. First propose candidate regions, then classify each one. Accurate but slower.
• YOLO (You Only Look Once): Single-stage approach. One forward pass predicts all bounding boxes and classes simultaneously. Fast enough for real-time video.
• DETR (DEtection TRansformer): Treats detection as a set prediction problem using transformers. Elegant, competitive, no hand-designed components like anchors or NMS.
Modern vision: Vision Transformers (ViT) apply self-attention to image patches, competing with CNNs. CLIP trains on (image, text) pairs and can classify images using natural language descriptions. SAM (Segment Anything) segments any object with a single click. Foundation models now dominate vision too.
The main computer vision tasks and how they relate. Each adds more spatial understanding.
Automatic speech recognition (ASR) converts audio waveforms to text. Before deep learning, this required elaborate pipelines: feature extraction (MFCCs), acoustic modeling (GMMs), language modeling (n-grams), and decoding (Viterbi). Each component was separately designed and optimized.
Deep learning simplified this dramatically:
CTC (Connectionist Temporal Classification): Graves et al. (2006) introduced CTC, which lets an RNN predict character sequences from variable-length audio without explicit alignment between audio frames and characters. The CTC loss marginalizes over all possible alignments, solving the sequence-to-sequence alignment problem.
End-to-end models: DeepSpeech (Hannun et al., 2014) showed that a deep RNN trained with CTC on raw spectrograms could match traditional ASR systems. The entire pipeline collapsed into one neural network. Later, attention-based encoder-decoder models (Listen-Attend-Spell) further improved performance.
Text-to-speech (TTS) went through a similar revolution. WaveNet (2016) generated raw audio waveforms using dilated causal convolutions. Tacotron (2017) used an encoder-decoder to convert text to spectrograms. Modern systems like VALL-E can clone a voice from 3 seconds of audio.
NLP has undergone the most dramatic transformation of any field from deep learning. Every component of the traditional NLP pipeline — tokenization, parsing, entity recognition, sentiment analysis, translation — has been replaced or vastly improved by neural networks.
Word embeddings (Word2Vec, GloVe) were the first breakthrough: mapping words to dense vectors where similar words are nearby. "King − Man + Woman ≈ Queen." These learned representations replaced hand-crafted features and became the input layer for all subsequent NLP models.
Sequence models: BiLSTMs became the dominant architecture for NLP tasks (2014-2018). The encoder read the text, the hidden states encoded contextual meaning, and a task-specific head made predictions. For translation, encoder-decoder LSTMs with attention (Bahdanau et al., 2014) were the standard.
Pretrain-then-finetune: BERT (2018) pretrained a transformer encoder on unlabeled text using masked language modeling, then fine-tuned on downstream tasks. It set new records on essentially every NLP benchmark simultaneously. GPT (2018) did the same with autoregressive pretraining (predict the next token). GPT-3 (2020) showed that very large models can perform tasks without fine-tuning (few-shot and zero-shot learning).
Large Language Models (LLMs): GPT-4, Claude, Gemini, and LLaMA represent the current frontier. Trained on trillions of tokens, they exhibit emergent capabilities: reasoning, code generation, tool use, and following complex instructions. They have become the foundation for a new paradigm of AI applications.
The timeline of NLP architectures from word embeddings to large language models.
Recommender systems predict which items (movies, products, songs, ads) a user will prefer. They are the revenue engine behind Netflix, Amazon, YouTube, and Spotify. Deep learning has transformed them from simple matrix factorization to complex multi-modal systems.
Collaborative filtering predicts preferences based on similar users or items. Matrix factorization embeds users and items into a shared low-dimensional space; the dot product of user and item embeddings predicts the rating. Neural extensions (NCF, DeepFM) replace the dot product with a neural network that can learn more complex interactions.
Deep learning for recommendations:
• Two-tower models: One neural network encodes the user (history, demographics), another encodes the item (features, description). The dot product of the two embeddings scores relevance. Used at Google, YouTube, and most large-scale systems.
• Sequential models: Treat the user's interaction history as a sequence. Use transformers (SASRec, BERT4Rec) to predict the next item. Captures temporal patterns in user behavior.
• Large-scale challenges: With millions of items, scoring every item for every user is infeasible. The solution: a retrieval stage (fast approximate nearest neighbor search over embeddings) followed by a ranking stage (more expensive but precise neural scorer).
Many real-world tasks require outputs with complex structure: parse trees, molecular graphs, game moves, robot trajectories. Deep learning handles these through careful output representation and loss design.
Sequence-to-sequence: Translation, summarization, and code generation all produce structured text. Encoder-decoder transformers generate one token at a time, conditioned on the input and all previously generated tokens. Beam search (maintaining the top-k partial sequences) improves output quality over greedy decoding.
Structured prediction: Named entity recognition tags each word (Person, Organization, Location, None). A BiLSTM or transformer produces per-position features, and a conditional random field (CRF) layer ensures the output tags form valid sequences (e.g., I-PER must follow B-PER, not B-LOC).
Graph neural networks (GNNs) operate on graph-structured data: molecules, social networks, knowledge graphs. They propagate information along edges using message passing: each node aggregates features from its neighbors, applies a transformation, and updates its representation. Used for drug discovery, chip design, and social network analysis.
Reinforcement learning + deep learning: Deep RL produces structured action sequences. AlphaGo used deep CNNs for policy and value estimation in Go. AlphaFold used transformers and equivariant networks to predict 3D protein structures — one of the most impactful applications of deep learning.
Training a model is only half the job. Deploying it reliably, efficiently, and safely in production is the other half — and often the harder half.
Model compression: Large models are expensive to serve. Techniques to reduce cost:
• Quantization: Reduce weight precision from FP32 to INT8 or INT4. Typically < 1% accuracy loss, 2-4x speedup, 2-4x memory reduction.
• Pruning: Remove weights close to zero. Structured pruning (removing entire filters or attention heads) gives real speedups; unstructured pruning gives theoretical compression but limited hardware acceleration.
• Knowledge distillation: Train a small "student" model to mimic a large "teacher" model. The student learns from the teacher's soft probability outputs, which contain more information than hard labels. DistilBERT achieves 97% of BERT's performance at 60% the size.
Monitoring and maintenance: Models degrade over time as the data distribution shifts. A spam classifier trained in 2020 fails on 2024 spam tactics. Monitoring tracks prediction distributions, confidence, and performance metrics in production. Retraining on fresh data is necessary periodically. A/B testing compares new models against the production model before full rollout.
Safety and fairness: Models can encode biases present in training data. Facial recognition systems perform worse on certain demographic groups. Language models can generate harmful content. Responsible deployment requires bias auditing, adversarial testing, content filtering, and clear documentation of model limitations.
The history of deep learning applications is a story of scaling and transfer. Explore the milestones that transformed each domain.
Hover over each milestone to see its impact. The vertical axis represents the approximate impact on the field.
This is the final chapter of Part II. Here is how everything connects:
| Domain | Architecture | Key Chapters |
|---|---|---|
| Image Classification | CNN (ResNet, EfficientNet, ViT) | Ch 6, 9 |
| Object Detection | CNN + detection head (YOLO, DETR) | Ch 9 |
| Speech Recognition | RNN/Transformer + CTC | Ch 10 |
| Machine Translation | Encoder-Decoder Transformer | Ch 10 |
| Language Modeling | Decoder-only Transformer (GPT) | Ch 10 |
| Recommendations | Two-tower embeddings + ranking | Ch 6, 8 |
| Image Generation | Diffusion models, GANs | Ch 6, 8 |
| Protein Structure | Equivariant transformers (AlphaFold) | Ch 6, 9, 10 |
Congratulations: you have completed Part II of Goodfellow et al.'s Deep Learning. You now have the conceptual foundation to understand any modern deep learning system. The architectures change, but the principles endure: learn representations from data, optimize with gradients, regularize for generalization, and scale.