Goodfellow et al., Chapter 11

Practical Methodology

Performance metrics, debugging, hyperparameter tuning, and the systematic approach to building deep learning systems that actually work.

Prerequisites: Chapters 6-8 (networks, regularization, optimization).
9
Chapters
3+
Simulations
9
Quizzes

Chapter 0: The ML Pipeline

Building a deep learning system is not just about choosing the right architecture. It is a systematic engineering process with many decisions that interact in subtle ways. Most time is spent not on modeling but on data preparation, debugging, and evaluation.

The practical reality: Research papers present clean results. In practice, 80% of the work is data cleaning, pipeline engineering, debugging mysterious failures, and hyperparameter tuning. The other 20% is architecture choice. This chapter is about the 80%.
1. Define the Problem
What metric matters? What is the minimum acceptable performance? What are the constraints (latency, memory, cost)?
2. Establish Baselines
Start with the simplest model that could work. Random baseline, linear model, then a small neural network.
3. Iterate
Diagnose: is the problem underfitting or overfitting? Act: add capacity or regularization. Measure. Repeat.
4. Tune & Deploy
Hyperparameter search, final evaluation on held-out test set, deployment considerations.

The most important principle: never skip the baseline. If a linear regression achieves 90% of the performance of your 100-layer deep network, the complexity is not justified. Start simple. Increase complexity only when you have evidence it helps.

What is the first thing you should do before building a complex deep learning model?

Chapter 1: Performance Metrics

Choosing the right metric is the most consequential decision in a project. The wrong metric means the model optimizes for the wrong thing, and no amount of architecture tuning can fix that.

Accuracy is the simplest metric but dangerously misleading for imbalanced data. If 99% of emails are not spam, a model that always predicts "not spam" achieves 99% accuracy while being completely useless.

Precision = TP / (TP + FP) — of all positive predictions, how many are correct? Important when false positives are costly (e.g., spam filter blocking important emails).

Recall = TP / (TP + FN) — of all actual positives, how many did we catch? Important when false negatives are costly (e.g., cancer screening missing a tumor).

The precision-recall tradeoff: You can always increase recall by predicting "positive" more aggressively, but this decreases precision. The F1 score = 2 · (P · R) / (P + R) balances the two. For different tradeoffs, use the PR curve or ROC curve and measure the area under the curve (AUC).

For regression, MSE (mean squared error) heavily penalizes large errors while MAE (mean absolute error) treats all errors equally. For ranking tasks, use NDCG or MAP. For generation tasks, metrics like BLEU (translation), ROUGE (summarization), or perplexity (language modeling) capture domain-specific quality.

Precision vs Recall Tradeoff

Adjust the classification threshold. Lower threshold increases recall but decreases precision.

Threshold0.50
Why is accuracy a poor metric for imbalanced datasets?

Chapter 2: Baselines

A baseline is the simplest model that gives a meaningful result. It establishes the performance floor and tells you whether your problem is easy or hard.

Build baselines in order of increasing complexity:

BaselinePurpose
RandomPredicts at random. Establishes absolute floor. If your model barely beats random, something is broken.
ConstantAlways predicts the most common class or the mean. Reveals class imbalance issues.
Linear modelLogistic regression / linear regression. Shows how much signal is in the features without nonlinearity.
Simple NN2-layer MLP. Shows the benefit of nonlinearity. If this barely beats linear, deep networks may not help.
Known architectureApply a standard architecture for the domain (ResNet for images, BERT for text). Establishes what "good" looks like.
Copy a working pipeline: If possible, start by reproducing a known result from a paper or benchmark on your data. This verifies your data pipeline, training loop, and evaluation code are correct. Most bugs hide in these "boring" parts, not in the model architecture.

Sanity checks before real training: (1) Can the model overfit a single batch? If not, there is a bug. (2) Does training loss decrease? If not, the learning rate is wrong or gradients are broken. (3) Does increasing model size help on training loss? If not, the capacity issue is elsewhere.

Why should you verify that your model can overfit a single mini-batch before real training?

Chapter 3: Diagnosing Problems

When your model does not work well, the first question is: is it underfitting or overfitting? This determines your next action.

Underfitting (high bias): Training loss is high. The model is too simple or the learning rate is wrong.
Fix: Increase model capacity (more layers, wider layers), train longer, reduce regularization, lower learning rate if oscillating.

Overfitting (high variance): Training loss is low but validation loss is high.
Fix: Add regularization (dropout, weight decay, data augmentation), get more data, reduce model capacity, early stopping.

Plot learning curves: training and validation loss vs. epoch. The gap between them tells you everything:

Diagnosing Under/Overfitting

Adjust model capacity and regularization. Watch how the train-val gap changes.

Model capacity5
Regularization0.5
Data amount5

Key diagnostic patterns:

• Both losses high and close together → underfitting → increase capacity.

• Train loss low, val loss high (large gap) → overfitting → more data or more regularization.

• Both losses plateau early → optimization issue → check learning rate, try different optimizer.

• Val loss increases while train loss decreases → overfitting → apply early stopping.

You observe low training loss but high validation loss. What is the most likely problem and solution?

Chapter 4: Hyperparameter Tuning

Deep learning models have many hyperparameters: learning rate, batch size, weight decay, dropout rate, hidden size, number of layers, and more. Tuning them efficiently is critical.

Priority order for tuning:

PriorityHyperparameterTypical Range
1 (critical)Learning rate1e-5 to 1e-1 (log scale)
2Batch size16, 32, 64, 128, 256
3Weight decay1e-5 to 1e-1 (log scale)
4Dropout rate0.0 to 0.5
5Architecture (layers, width)Task-dependent
Random search beats grid search. Bergstra & Bengio (2012) showed that random search is more efficient than grid search for hyperparameter tuning. The reason: not all hyperparameters are equally important. Grid search wastes trials on unimportant dimensions. Random search distributes trials across all dimensions, giving better coverage of the important ones.

Search on a log scale for learning rate and weight decay. The difference between 0.001 and 0.01 is far more important than between 0.091 and 0.1. Sample uniformly in log space: lr = 10uniform(-5, -1).

Bayesian optimization (e.g., Optuna, Hyperopt) builds a probabilistic model of the objective function and uses it to choose the most promising hyperparameter configurations. It is more sample-efficient than random search but adds complexity. Use it when each trial is expensive (hours of training).

Why does random search outperform grid search for hyperparameter tuning?

Chapter 5: Debugging Strategies

Deep learning systems fail silently. The model trains, the loss decreases, and the output looks plausible — but the results are subtly wrong. Debugging requires systematic techniques.

Common failure modes:

Data bugs: Mislabeled examples, preprocessing errors, data leakage (test data in training set), incorrect feature normalization. These are the most common and hardest to find.

Training bugs: Wrong loss function, gradients not flowing (dead neurons, disconnected layers), learning rate too high or too low.

Evaluation bugs: Different preprocessing at train vs. test time, batch normalization running-average issues, dropout left on during evaluation.

The debugging checklist:
1. Overfit one batch. If you cannot get near-zero training loss on 10 examples, the model or training loop is broken.
2. Check gradients. Are they NaN? Are they all zero? Are they all the same sign?
3. Visualize predictions. Look at the actual outputs, not just summary metrics. Plot a few examples.
4. Ablation study. Remove each component one at a time. Which one breaks the model?
5. Simplify. Reduce the model to the smallest version that should still work. Debug that.

Gradient checking verifies your backprop implementation. Compute the numerical gradient (f(x + ε) − f(x − ε)) / (2ε) and compare with the analytical gradient. They should agree to ~10−5. Use this when implementing custom layers.

Visualization is your best tool. Plot the loss curve, gradient distributions, weight distributions, activations, and predictions. Most bugs become obvious when visualized. Tensorboard, Weights & Biases, and similar tools make this easy.

Your model trains but achieves poor results. What is the single most effective first debugging step?

Chapter 6: When to Gather More Data

The answer to "how do I improve my model?" is often "get more data." But data collection is expensive. How do you know when it is worth it?

The learning curve test: Plot validation performance as a function of training set size. If the curve is still climbing steeply at your current data size, more data will help significantly. If it has plateaued, more data will not help much — you need a better model or features.

Learning Curve: Data vs Performance

The gap between training and validation shows whether more data would help. A large gap that narrows with more data = collect more data.

Model complexity5
When more data helps: If the train-val gap is large (overfitting), more data tightens it. The model has the capacity to learn but not enough examples to generalize. This is the most common scenario.

When more data does NOT help: If both train and val performance are low (underfitting), the model is too simple. More data will not fix an architecture that cannot learn the pattern. Increase model capacity first.

Alternatives to collecting more data:

Data augmentation (Ch 7) creates synthetic examples from existing ones.

Transfer learning uses a model pretrained on a large dataset and fine-tunes on your small dataset.

Semi-supervised learning uses large amounts of unlabeled data alongside small amounts of labeled data.

Synthetic data generation uses simulators, generative models, or templates to create training data.

How do you determine whether collecting more data will improve your model?

Chapter 7: Grid vs Random Search Simulator

See why random search finds better hyperparameters than grid search with the same budget. The true objective depends mostly on one parameter (learning rate) and barely on the other (weight decay).

Grid Search vs Random Search

Both methods try the same number of configurations. Random search explores more values of the important axis. Green = better performance.

Number of trials16
Experiments: (1) With 9 trials, grid search tests only 3 values per axis. Random search tests 9 unique values of the important axis. (2) Increase to 25 trials — grid covers 5 values, random covers 25. (3) Click "New Random Samples" repeatedly — random search usually finds a better configuration than grid.
In the simulation, why does random search find better hyperparameter values with the same number of trials?

Chapter 8: Connections

Practical methodology is the bridge between theory and working systems:

PracticeWhere It Appears
Learning curvesEssential for all model development. Diagnose under/overfitting. Decide data collection vs. model changes.
Hyperparameter searchAutomated via Optuna, Ray Tune, Weights & Biases sweeps. Critical for fair model comparisons.
BaselinesRequired in every paper. Ablation studies are the standard for proving each component contributes.
Debugging via overfitStandard first step in any training pipeline. Also used to verify data loading and preprocessing.
Precision vs RecallMedical diagnosis, fraud detection, information retrieval. Every classification task needs the right metric.
Transfer learningThe default for vision (ImageNet pretraining, Ch 9), NLP (BERT/GPT pretraining), and increasingly all domains.
Data augmentationStandard for vision (Ch 7, 9). Growing for NLP (back-translation, synonym replacement) and audio.
What you should take away: Always start with a simple baseline and a correct metric. Diagnose before prescribing: is the problem underfitting or overfitting? Use random search on a log scale for the learning rate. Verify your pipeline can overfit a single batch before training at scale. Plot learning curves to decide whether you need more data or a better model.

Up next: Chapter 12: Applications — how deep learning is applied to vision, language, speech, and more.

What is the single most important practical skill in deep learning?