Performance metrics, debugging, hyperparameter tuning, and the systematic approach to building deep learning systems that actually work.
Building a deep learning system is not just about choosing the right architecture. It is a systematic engineering process with many decisions that interact in subtle ways. Most time is spent not on modeling but on data preparation, debugging, and evaluation.
The most important principle: never skip the baseline. If a linear regression achieves 90% of the performance of your 100-layer deep network, the complexity is not justified. Start simple. Increase complexity only when you have evidence it helps.
Choosing the right metric is the most consequential decision in a project. The wrong metric means the model optimizes for the wrong thing, and no amount of architecture tuning can fix that.
Accuracy is the simplest metric but dangerously misleading for imbalanced data. If 99% of emails are not spam, a model that always predicts "not spam" achieves 99% accuracy while being completely useless.
Precision = TP / (TP + FP) — of all positive predictions, how many are correct? Important when false positives are costly (e.g., spam filter blocking important emails).
Recall = TP / (TP + FN) — of all actual positives, how many did we catch? Important when false negatives are costly (e.g., cancer screening missing a tumor).
For regression, MSE (mean squared error) heavily penalizes large errors while MAE (mean absolute error) treats all errors equally. For ranking tasks, use NDCG or MAP. For generation tasks, metrics like BLEU (translation), ROUGE (summarization), or perplexity (language modeling) capture domain-specific quality.
Adjust the classification threshold. Lower threshold increases recall but decreases precision.
A baseline is the simplest model that gives a meaningful result. It establishes the performance floor and tells you whether your problem is easy or hard.
Build baselines in order of increasing complexity:
| Baseline | Purpose |
|---|---|
| Random | Predicts at random. Establishes absolute floor. If your model barely beats random, something is broken. |
| Constant | Always predicts the most common class or the mean. Reveals class imbalance issues. |
| Linear model | Logistic regression / linear regression. Shows how much signal is in the features without nonlinearity. |
| Simple NN | 2-layer MLP. Shows the benefit of nonlinearity. If this barely beats linear, deep networks may not help. |
| Known architecture | Apply a standard architecture for the domain (ResNet for images, BERT for text). Establishes what "good" looks like. |
Sanity checks before real training: (1) Can the model overfit a single batch? If not, there is a bug. (2) Does training loss decrease? If not, the learning rate is wrong or gradients are broken. (3) Does increasing model size help on training loss? If not, the capacity issue is elsewhere.
When your model does not work well, the first question is: is it underfitting or overfitting? This determines your next action.
Plot learning curves: training and validation loss vs. epoch. The gap between them tells you everything:
Adjust model capacity and regularization. Watch how the train-val gap changes.
Key diagnostic patterns:
• Both losses high and close together → underfitting → increase capacity.
• Train loss low, val loss high (large gap) → overfitting → more data or more regularization.
• Both losses plateau early → optimization issue → check learning rate, try different optimizer.
• Val loss increases while train loss decreases → overfitting → apply early stopping.
Deep learning models have many hyperparameters: learning rate, batch size, weight decay, dropout rate, hidden size, number of layers, and more. Tuning them efficiently is critical.
Priority order for tuning:
| Priority | Hyperparameter | Typical Range |
|---|---|---|
| 1 (critical) | Learning rate | 1e-5 to 1e-1 (log scale) |
| 2 | Batch size | 16, 32, 64, 128, 256 |
| 3 | Weight decay | 1e-5 to 1e-1 (log scale) |
| 4 | Dropout rate | 0.0 to 0.5 |
| 5 | Architecture (layers, width) | Task-dependent |
Search on a log scale for learning rate and weight decay. The difference between 0.001 and 0.01 is far more important than between 0.091 and 0.1. Sample uniformly in log space: lr = 10uniform(-5, -1).
Bayesian optimization (e.g., Optuna, Hyperopt) builds a probabilistic model of the objective function and uses it to choose the most promising hyperparameter configurations. It is more sample-efficient than random search but adds complexity. Use it when each trial is expensive (hours of training).
Deep learning systems fail silently. The model trains, the loss decreases, and the output looks plausible — but the results are subtly wrong. Debugging requires systematic techniques.
Common failure modes:
• Data bugs: Mislabeled examples, preprocessing errors, data leakage (test data in training set), incorrect feature normalization. These are the most common and hardest to find.
• Training bugs: Wrong loss function, gradients not flowing (dead neurons, disconnected layers), learning rate too high or too low.
• Evaluation bugs: Different preprocessing at train vs. test time, batch normalization running-average issues, dropout left on during evaluation.
Gradient checking verifies your backprop implementation. Compute the numerical gradient (f(x + ε) − f(x − ε)) / (2ε) and compare with the analytical gradient. They should agree to ~10−5. Use this when implementing custom layers.
Visualization is your best tool. Plot the loss curve, gradient distributions, weight distributions, activations, and predictions. Most bugs become obvious when visualized. Tensorboard, Weights & Biases, and similar tools make this easy.
The answer to "how do I improve my model?" is often "get more data." But data collection is expensive. How do you know when it is worth it?
The learning curve test: Plot validation performance as a function of training set size. If the curve is still climbing steeply at your current data size, more data will help significantly. If it has plateaued, more data will not help much — you need a better model or features.
The gap between training and validation shows whether more data would help. A large gap that narrows with more data = collect more data.
Alternatives to collecting more data:
• Data augmentation (Ch 7) creates synthetic examples from existing ones.
• Transfer learning uses a model pretrained on a large dataset and fine-tunes on your small dataset.
• Semi-supervised learning uses large amounts of unlabeled data alongside small amounts of labeled data.
• Synthetic data generation uses simulators, generative models, or templates to create training data.
See why random search finds better hyperparameters than grid search with the same budget. The true objective depends mostly on one parameter (learning rate) and barely on the other (weight decay).
Both methods try the same number of configurations. Random search explores more values of the important axis. Green = better performance.
Practical methodology is the bridge between theory and working systems:
| Practice | Where It Appears |
|---|---|
| Learning curves | Essential for all model development. Diagnose under/overfitting. Decide data collection vs. model changes. |
| Hyperparameter search | Automated via Optuna, Ray Tune, Weights & Biases sweeps. Critical for fair model comparisons. |
| Baselines | Required in every paper. Ablation studies are the standard for proving each component contributes. |
| Debugging via overfit | Standard first step in any training pipeline. Also used to verify data loading and preprocessing. |
| Precision vs Recall | Medical diagnosis, fraud detection, information retrieval. Every classification task needs the right metric. |
| Transfer learning | The default for vision (ImageNet pretraining, Ch 9), NLP (BERT/GPT pretraining), and increasingly all domains. |
| Data augmentation | Standard for vision (Ch 7, 9). Growing for NLP (back-translation, synonym replacement) and audio. |
Up next: Chapter 12: Applications — how deep learning is applied to vision, language, speech, and more.