CS224N Lecture 6

Practical Methodology

Debugging, tuning, and shipping ML models — the 80% of the job nobody teaches.

Prerequisites: L03 Backprop + L05 Transformers (recommended). That's it.
8
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: Why Methodology?

You've built a Transformer, hit "train," and the loss is stuck at 2.3. Now what? You stare at the loss curve. You tweak the learning rate. You add more layers. Nothing helps. You start to wonder if the architecture is wrong, if the data is bad, or if you've got a bug somewhere in 2,000 lines of code.

This is where most ML practitioners spend most of their time. Not designing elegant architectures. Not writing papers. Debugging, tuning, and evaluating. The gap between understanding how a Transformer works and actually getting one to train well on your data is enormous — and almost nobody teaches it systematically.

Ask any ML engineer how they spend their weeks. The answer is remarkably consistent: about 10% designing the model, 10% writing code, a full 60% debugging and tuning, and 20% evaluating results and iterating. The architecture choice that feels like the main event? It's a small fraction of the work.

The simulation below shows this breakdown. Click each slice to see what that phase actually involves day-to-day. Notice how the "Debugging & Tuning" slice dwarfs everything else.

Where ML Time Actually Goes

Click a slice to see details. The pulsing slice is where you'll spend most of your career.

Compare that to how beginners think the time breaks down. Newcomers imagine it's 50% architecture design, 30% coding, 10% tuning, 10% evaluation. They're off by an order of magnitude on what matters most. The architecture is the easy part. Making it work is the job.

This lesson teaches the unglamorous but essential skills: how to build a data pipeline that doesn't lie to you, how to triage a broken model in five steps, how to search for hyperparameters efficiently, how to read a loss curve like a cockpit instrument panel, and how to evaluate a model beyond a single accuracy number.

The gap between "I understand Transformers" and "I can train one that works" is where careers are made. Theory gets you the interview. Methodology gets you the results. Every senior ML engineer has a mental toolkit for debugging — this lesson gives you yours.
What this lesson covers: Data quality and leakage detection. The five-step debugging triage. Random vs grid search for hyperparameters. The regularization toolkit (dropout, weight decay, early stopping, augmentation). Reading training dashboards — loss curves, gradient norms, LR schedules, parameter histograms. Evaluation beyond accuracy: confusion matrices, precision, recall, F1, and class imbalance.
What fraction of a typical ML project is spent debugging and tuning?

Chapter 1: Data Pipeline

Your model can only be as good as its data. A perfect architecture trained on noisy, mislabeled, or leaky data will produce noisy, mislabeled, or deceptively optimistic predictions. Data quality determines your ceiling — everything else determines how close you get to it.

Think of it this way. If your training data contains 5% mislabeled examples, your model's theoretical best accuracy is 95%, no matter how many layers you add. If your validation set accidentally contains training examples, your metrics will look great and your deployment will fail. If your data distribution shifts between training and production, your carefully tuned model will silently degrade.

The data pipeline has three stages, each with its own failure modes. Raw data contains noise: outliers, missing values, duplicates, and encoding errors. Cleaned data has been filtered, normalized, and validated. Augmented data has been expanded through transformations — flips, crops, paraphrases — to improve generalization. Each stage can introduce or fix problems.

The Three Stages

The widget below shows how data quality affects model performance. On the left, you see raw data with outliers highlighted in red. Toggle through the stages to see cleaning and augmentation. The slider controls data quality — watch how the decision boundary sharpens as quality improves.

Data Quality Pipeline

Toggle between data stages. Drag the quality slider to see the effect on the model's decision boundary.

Data Quality 30%
Stage: Raw — red points are outliers/mislabeled examples.

Data Leakage: The Silent Killer

Data leakage occurs when information from outside the training set bleeds into the model's training process. The most common form is when validation or test examples end up in the training set — either as exact duplicates or through subtle contamination like time-series overlap or shared patients across splits.

Here's a concrete example. You're building a medical classifier. Your dataset has 10 images per patient — different angles of the same scan. You randomly split images into train/val/test. Result: images from the same patient appear in both training and validation. The model learns patient-specific features (positioning, scanner artifacts), not disease features. Validation accuracy: 98%. Deployment accuracy: 72%.

The fix is to split at the patient level, not the image level. All images from one patient go into the same split. This seems obvious in retrospect, but it's the #1 cause of published results that don't reproduce.

Data leakage is the silent killer. If your validation set contains training information, your metrics are lies. Always split at the entity level (patient, user, document), not the sample level. If your validation accuracy seems too good to be true, it probably is.

The Train/Val/Test Split

Why three splits, not two? Training data teaches the model. Validation data guides your hyperparameter choices — you tune learning rate, regularization, and architecture based on validation performance. Test data gives you the final, unbiased estimate of how well the model will perform on truly unseen data.

If you tune hyperparameters on the test set, you've effectively used it for training. The test score becomes optimistic. This is why Kaggle competitions use a hidden test set — competitors can't overfit to it.

Typical splits: 80/10/10 for large datasets (100K+ examples), 60/20/20 for medium datasets (10K-100K), and cross-validation for small datasets (under 10K). With very large datasets (millions), you can shrink val/test to 1% each — 10,000 examples is usually enough for a reliable estimate.

Dataset SizeTrainValTestNotes
> 1M98%1%1%10K val/test is plenty
100K – 1M80%10%10%Standard split
10K – 100K60%20%20%Need more val for stability
< 10Kk-fold cross-validationEvery example validates once
What is data leakage?

Chapter 2: Debugging Models

Training loss won't go down. Before you change anything — before you swap architectures, add layers, or rewrite your data loader — answer three diagnostic questions: Can the model memorize a single batch? Does it overfit a tiny subset of the data? Does adding capacity help?

These three questions form the debugging triage, and they'll save you more hours than any other skill in ML. Each one isolates a different class of problem, from outright bugs to capacity issues to data problems.

Step 1: Overfit One Batch

Take 10 examples. Train on them for 1,000 iterations. Can the model drive the loss to zero (or near zero) on just these 10 examples? If yes, the model has enough capacity and your training loop works. If no, you have a bug — something is fundamentally broken in your code.

Common bugs this catches: wrong loss function (using MSE for classification), target labels in the wrong format (0/1 vs one-hot), data and labels misaligned (shuffled independently), learning rate exactly zero (optimizer not stepping), frozen layers (gradients disabled by mistake). None of these are model problems. They're implementation problems, and they look exactly like "the model can't learn."

Step 1: overfit a single batch. If you can't drive loss to zero on 10 examples, you have a bug. Not an underfitting problem. Not a capacity problem. A bug. Find it before doing anything else.

Step 2: Overfit a Small Subset

Passed step 1? Now try 100-500 examples. Can the model memorize them within a few hundred epochs? If yes, the architecture works. If no, the model might not have enough capacity for this task — try adding layers or width.

Step 3: Scale Up and Check the Gap

Now train on the full dataset. Plot both training loss and validation loss. Three outcomes are possible:

Both losses are high — the model can't fit even the training data. This is underfitting. You need more capacity (bigger model), a better architecture, or more/better features.

Training loss is low, validation loss is high — the model memorizes training data but doesn't generalize. This is overfitting. You need regularization, more training data, or less model capacity.

Both losses are low and close together — good fit. Ship it (after proper evaluation).

The Loss Curve Viewer

The widget below lets you explore these three regimes. Drag the complexity slider to change the model's capacity. Watch how the gap between training loss and validation loss changes. Low complexity: both losses are high (underfitting). Medium complexity: both are low (good). High complexity: training drops to zero while validation rises (overfitting).

Underfitting vs Overfitting Viewer

Drag the complexity slider. Watch training loss (orange) and validation loss (teal) respond.

Model Complexity 50
Good fit — both losses are low and close together.

The Five-Step Triage

Step 1: Overfit 1 batch
10 examples, 1000 iters. Loss → 0? If no: you have a bug.
Step 2: Overfit small subset
100–500 examples. Memorizes? If no: model too small.
Step 3: Full data, plot curves
Train + val loss. Both high = underfit. Train low, val high = overfit.
Step 4: Check gradients
Gradient norms near zero = vanishing. Gradient norms exploding = NaNs incoming.
Step 5: Check data
Visualize inputs and labels. Are they correctly paired? Are labels noisy?

Follow this order. Steps 1 and 2 catch bugs. Step 3 diagnoses the fit. Step 4 checks the optimization. Step 5 catches data problems. Most issues are resolved by step 2.

Training loss is very low, but validation loss is high and rising. Is this underfitting or overfitting?

Chapter 3: Hyperparameter Tuning

You have 10 hyperparameters, each with 3 plausible values. Testing every combination means 310 = 59,049 experiments. At 30 minutes per run, that's 3.4 years of GPU time. You need a smarter approach.

The two main strategies are grid search and random search. Grid search evaluates a regular lattice of points: learning rate in {0.001, 0.01, 0.1} crossed with weight decay in {0, 0.001, 0.01} gives 9 experiments. Random search samples each hyperparameter independently from a range: learning rate from log-uniform(0.0001, 0.1), weight decay from log-uniform(0, 0.01).

At first glance, they seem equivalent. Both test N configurations. But in 2012, Bergstra and Bengio proved that random search is almost always better. The key insight is that hyperparameters have different importances. If learning rate matters a lot and weight decay barely matters, grid search wastes most of its budget exploring weight decay values that make almost no difference. Random search, by contrast, tests a different learning rate for every single trial.

Why Random Beats Grid

Consider a concrete example. You have 2 hyperparameters and budget for 9 trials. Grid search places them in a 3×3 grid. If only the x-axis parameter matters (say, learning rate), grid search tests only 3 unique values of it — the 3 columns of the grid. The 3 rows within each column test different values of the unimportant parameter, giving you no new information.

Random search with 9 trials tests 9 different values of the important parameter. That's 3× more exploration along the axis that actually matters, with the same compute budget.

Grid Search vs Random Search

A heatmap shows the true loss landscape (dark = low loss). Toggle between grid and random to see where each strategy places its trials. Adjust the number of trials with the slider.

Trials 9
Grid mode — 9 trials. Unique LR values tested: 3.

The counter at the bottom tells the story. With 9 grid trials, you test 3 unique LR values. With 9 random trials, you test 9. With 25 trials, grid gives you 5 unique LR values while random gives you 25. The advantage grows with the budget.

Random search works because most hyperparameters have different importance. Grid search wastes budget on unimportant dimensions. Random search explores every dimension independently, so it tests more unique values of the important parameters with the same budget.

Learning Rate Is King

If you can tune only one hyperparameter, tune the learning rate. It controls the size of each gradient step. Too high: the loss oscillates or diverges. Too low: the loss decreases agonizingly slowly and gets stuck in local minima. A factor of 10 in either direction can mean the difference between convergence and failure.

Practical advice: start with a log-uniform sweep from 10−5 to 10−1. Use 20-30 random trials. Find the rough range where training succeeds, then do a finer sweep within that range. For Adam optimizer, the sweet spot is usually between 10−4 and 10−3. For SGD with momentum, it's typically between 10−2 and 10−1.

HyperparameterTypical RangeImportanceSearch Type
Learning rate10−5 – 10−1CriticalLog-uniform
Batch size16 – 512HighPowers of 2
Weight decay0 – 0.1MediumLog-uniform
Dropout rate0 – 0.5MediumUniform
Hidden size64 – 2048MediumPowers of 2
Number of layers1 – 12Low-MediumInteger
Warmup steps100 – 10KLowLinear
Why does random search beat grid search with the same budget?

Chapter 4: Regularization Toolkit

Your model memorizes the training set perfectly but fails on new data. The training loss is near zero, the validation loss is climbing, and every epoch makes things worse. You need to constrain the model — force it to learn general patterns rather than specific examples. This is regularization.

There are four main tools in the regularization toolkit, each attacking overfitting from a different angle. Think of them as different ways to tell the model: "don't be so sure of yourself."

1. Dropout

Dropout randomly sets a fraction of neuron activations to zero during training. Each forward pass uses a different random subset of the network. This prevents co-adaptation — the tendency of neurons to become overly specialized and depend on specific other neurons. With dropout, each neuron must learn to be useful on its own, because any of its partners might be absent.

At test time, no neurons are dropped. Instead, all activations are scaled by (1 − p) to compensate for the fact that more neurons are active. Typical dropout rates: 0.1–0.3 for Transformers, 0.5 for fully connected layers in CNNs.

2. Weight Decay (L2 Regularization)

Weight decay adds a penalty proportional to the squared magnitude of the weights to the loss function. This pushes weights toward zero, preventing any single weight from becoming too large. The effect is smooth and continuous — every weight feels a gentle pull toward zero at every step.

Ltotal = Ldata + λ ∑i wi2

The hyperparameter λ controls the strength. Too small: no effect. Too large: the model is so constrained it underfits. Typical values: 0.01 to 0.0001.

3. Early Stopping

Early stopping monitors validation loss during training and stops when it starts to increase. The model's parameters are reverted to the checkpoint with the best validation loss. This is the simplest regularizer — it costs nothing extra and requires no changes to the model or loss function.

In practice, you set a patience parameter: how many epochs of no improvement to tolerate before stopping. Patience of 5–10 is typical. Too little patience: you stop too early during a temporary plateau. Too much patience: you waste compute on a model that's clearly overfitting.

4. Data Augmentation

Data augmentation creates new training examples by applying transformations to existing ones. For images: random flips, rotations, crops, color jitter. For text: synonym replacement, back-translation, random deletion. The model sees more variation, which forces it to learn invariant features rather than memorizing specific examples.

Data augmentation is the only regularizer that actually adds information — the others just constrain the model. This is why it's often the single most effective tool against overfitting.

See Them in Action

The widget below shows a small neural network. Toggle each regularization technique to see its effect: dropout grays out neurons, weight decay shrinks connection weights (arrow thickness), early stopping places a marker on the loss curve, and data augmentation adds new examples. The inset loss curve shows the overfitting gap shrinking with each technique.

Regularization Toolkit

Toggle each technique to see its effect on the network and the overfitting gap.

All regularizers off. Overfitting gap: large.
Dropout is noisy but powerful. Weight decay is smooth and always-on. Early stopping is free. Data augmentation adds real information. In practice, use all of them together — they're complementary, not competing.
How does dropout prevent overfitting?

Chapter 5: Training Diagnostics

A training dashboard is your cockpit. The instruments tell you everything — if you know how to read them. A single loss curve is like flying with only an altimeter. You need the full panel: loss curves, gradient norms, learning rate schedule, and parameter distributions.

Each instrument reveals a different class of problem. Loss curves show fit quality. Gradient norms show optimization health. The LR schedule shows where you are in the training recipe. Parameter histograms show whether weights are alive, dead, or exploding.

The Six Pathologies

Every broken training run falls into one of six patterns. Each has a characteristic signature across multiple instruments. Learning to read these signatures is like learning to read an ECG — once you know the patterns, diagnosis is instant.

PathologyLoss SignatureGradient SignatureLR Clue
HealthyTrain & val both decrease, convergeStable, moderate normNormal schedule
LR too highLoss oscillates wildly or divergesSpiky, high variance
LR too lowLoss barely decreases, plateaus earlyVery small, stable
OverfittingTrain drops, val rises after initial dropModerate, stableNormal
Exploding gradientsLoss spikes to NaNNorm shoots to infinity
Vanishing gradientsLoss flat, barely movesNorm near zero

The Dashboard

The simulation below is a full training dashboard with four panels. Select a pathology from the dropdown and click "Train" to see its characteristic signatures across all four instruments. Then use the sliders to try to fix it — adjust the learning rate and batch size and re-run. Can you turn a sick training run into a healthy one?

Training Dashboard (Showcase)

Select a pathology, click Train, and read the instruments. Adjust LR and batch size to diagnose and fix.

Learning Rate 1e-3
Batch Size 32
Select a pathology and click Train.

Here's what to look for in each panel:

Loss curves (top-left): Smooth, converging curves are healthy. Oscillation means the LR is too high. Flat lines mean LR is too low or gradients are vanishing. A widening gap between train and val means overfitting.

Gradient norm (top-right): Should be stable and moderate (roughly 0.1–10). Spikes to 103+ mean exploding gradients — add gradient clipping. Near-zero means vanishing gradients — check your initialization and consider residual connections.

LR schedule (bottom-left): Shows the current learning rate. A warmup ramp followed by decay is standard. If you're debugging, check that the schedule matches your intention — a bug here (e.g., LR stuck at zero) can look like a vanishing gradient problem.

Parameter histogram (bottom-right): Weight values should form a bell curve centered near zero. If the distribution collapses to a spike at zero, weights are dead. If it spreads to extreme values, weights are exploding.

One instrument is not enough. A flat loss curve could be vanishing gradients OR a too-low learning rate OR a data bug. Only by cross-referencing loss, gradient norms, LR schedule, and parameter distributions can you make a confident diagnosis.

Chapter 6: Evaluation & Error Analysis

95% accuracy sounds great — until you realize it's a medical test and the 5% it misses are all the cancer patients. Accuracy counts the fraction of correct predictions, but it tells you nothing about which predictions are wrong. When classes are imbalanced — 98% healthy, 2% sick — a model that always predicts "healthy" gets 98% accuracy and saves exactly zero lives.

This is why experienced ML engineers never report accuracy alone. They use a confusion matrix to see where the model succeeds and fails, and they compute per-class metrics that can't be gamed by ignoring rare classes.

The Confusion Matrix

A confusion matrix is an N×N grid where rows are true labels and columns are predicted labels. The diagonal shows correct predictions. Off-diagonal cells show mistakes — and specifically, which classes get confused with which others.

For binary classification (positive/negative), the four cells have special names:

Predicted PositivePredicted Negative
Actually PositiveTrue Positive (TP)False Negative (FN) — "miss"
Actually NegativeFalse Positive (FP) — "false alarm"True Negative (TN)

Precision, Recall, and F1

Precision answers: "Of the examples I predicted positive, how many actually are?" It's TP / (TP + FP). High precision means few false alarms.

Recall answers: "Of the examples that actually are positive, how many did I catch?" It's TP / (TP + FN). High recall means few misses.

There's a fundamental tension between precision and recall. You can get 100% recall by predicting everything as positive — but your precision drops to the base rate. You can get 100% precision by only predicting positive when you're absolutely certain — but you'll miss most cases.

F1 score is the harmonic mean of precision and recall. It's high only when both are high:

F1 = 2 · (Precision · Recall) / (Precision + Recall)

Interactive Confusion Matrix

The widget below shows a 4-class confusion matrix. Use the sliders to adjust each class's performance. Watch how precision, recall, and F1 change per class. Toggle between "accuracy view" (a single number) and "per-class view" to see how accuracy can hide catastrophic failure on rare classes.

Confusion Matrix Explorer

Adjust per-class accuracy. Toggle views to see how 95% accuracy can hide 0% recall on class D.

Class A accuracy 95%
Class B accuracy 90%
Class C accuracy 85%
Class D accuracy 10%
Overall accuracy: 93.2% — but Class D recall is only 10%.

Try setting Class D accuracy to 0% while keeping others at 95%. The overall accuracy barely drops — because Class D is rare (only 2% of examples). But the model completely fails on the class that might matter most. This is the fundamental problem with accuracy as a metric.

Accuracy is the most dangerous metric in ML. It hides class imbalance and rewards trivial predictions. Always look at the confusion matrix. Always compute per-class precision and recall. If one class has low recall and that class matters, your model is broken — no matter what the accuracy says.
A medical model predicts "healthy" for every patient and achieves 98% accuracy. What's wrong?

Chapter 7: Connections

Every architecture lesson assumes you can train it. This lesson bridges theory and practice — the skills here apply whether you're training a logistic regression, an LSTM, or a billion-parameter Transformer. The architectures change; the debugging methodology doesn't.

How This Fits Together

L03: Backpropagation
How gradients flow → why they vanish/explode → Ch 5 gradient diagnostics
L05: Transformers
The architecture → but training it well requires all of Ch 2–4
L06: This Lesson
Data, debugging, tuning, regularization, evaluation → the practitioner's toolkit
L07: Pretraining
Scaling these ideas to massive models → new challenges at scale

Pathology Quick Reference

SymptomDiagnosisFirst Thing to Try
Can't overfit 1 batchBug in codeCheck loss function, data alignment, frozen layers
Both losses highUnderfittingIncrease model size, check features
Train low, val highOverfittingAdd dropout, weight decay, more data
Loss oscillates wildlyLR too highReduce learning rate by 10×
Loss barely movesLR too low / vanishing gradsIncrease LR, check gradient norms
Loss spikes to NaNExploding gradientsAdd gradient clipping, reduce LR
Val loss rises after epoch 5Overfitting onsetEarly stopping with patience 5
95% accuracy, rare class missedClass imbalanceCheck per-class recall, reweight loss

Where to Go Next

L03: Backpropagation — Understand the gradient flow that underpins debugging (Ch 5 diagnostics assume this).

L05: Transformers — The architecture you'll most often apply these methods to.

L07: Pretraining — How these principles scale to models trained on terabytes of data.

"The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'" — Isaac Asimov. Debugging is noticing when something is funny, and having the methodology to figure out why.
Loss is decreasing smoothly but validation loss starts rising after epoch 5. What do you try first?