You have a loss function. You have weights. Now how do you actually find the best weights — and prevent them from memorizing your training data?
You've trained a classifier. On your training set, it nails 99% of the images. You proudly test it on new images — and it drops to 65%. What happened?
Your model didn't learn the underlying pattern. It memorized the training data — every quirk, every noise artifact, every coincidence. This is called overfitting: the model fits the training data too well, and that tight fit doesn't generalize to data it hasn't seen.
Imagine fitting a curve to a handful of data points. A straight line (simple model) might miss some points, but it captures the general trend. A high-degree polynomial (complex model) passes through every point perfectly — but oscillates wildly between them, giving terrible predictions for new inputs.
Training is a tug-of-war between two goals: (1) fit the training data well, and (2) don't fit it so well that you're fitting noise instead of signal. Every technique in this lecture — regularization, careful optimization, learning rate schedules — is about managing this tension.
This tension has a formal name: the bias-variance tradeoff.
Bias is systematic error — the model is too simple to capture the true relationship. A linear classifier trying to learn a circular decision boundary has high bias. It underfits.
Variance is sensitivity to the specific training data. A 100-degree polynomial changes dramatically if you add or remove one data point. It has high variance. It overfits.
A model generalizes when it performs well on data it has never seen during training. The gap between training performance and test performance is the generalization gap. Our goal is to minimize this gap while keeping training performance high.
| Regime | Training Loss | Test Loss | Problem |
|---|---|---|---|
| Underfitting | High | High | Model too simple (high bias) |
| Good fit | Low | Low | None — the sweet spot |
| Overfitting | Very low | High | Model memorizes (high variance) |
You train a linear classifier on CIFAR-10: training accuracy 40%, test accuracy 38%. That's underfitting — both are bad. You switch to a deep neural network: training accuracy 99%, test accuracy 72%. The 27-point gap screams overfitting. You need regularization.
A linear classifier on 32×32×3 images has 3,072 input dimensions. A neural network with millions of parameters has enormous capacity to memorize. Without explicit constraints, deep networks will overfit almost every dataset given enough training time.
Regularization is the antidote to overfitting. The idea is beautifully simple: add a penalty to the loss function that punishes complex models. The total loss becomes:
The hyperparameter λ (lambda) controls the tradeoff. Too small: no effect, we still overfit. Too large: we over-penalize and underfit. Just right: the model learns the true pattern and ignores noise.
The most common form. L2 penalizes the sum of squared weights:
What does this do? It pushes all weights toward zero — but not to zero. More precisely, it prefers weight vectors where the values are spread out rather than concentrated in a few large entries.
Consider input x = [1, 1, 1, 1]. Two weight vectors produce the same score:
w1 = [1, 0, 0, 0] → dot product = 1
w2 = [0.25, 0.25, 0.25, 0.25] → dot product = 1
L2 penalty: ||w1||2 = 1.0, ||w2||2 = 0.25. L2 prefers w2 by a factor of 4. Why? Because w2 uses all features equally rather than relying on a single feature. This makes the model more robust — if feature 1 is noisy, w1 breaks, but w2 barely notices.
William of Ockham (1285–1347): "Among competing hypotheses, the simplest is best." L2 regularization is Occam's Razor in equation form. By penalizing large weights, you're saying: "I'd rather have a simple model that's slightly wrong on the training data than a complex model that's perfect on training but fails on test."
L1 penalizes the sum of absolute values:
L1 has a dramatically different effect: it drives many weights exactly to zero. The resulting weight matrix is sparse — most entries are zero, and only a few features are used.
A weight vector is sparse when most of its entries are exactly zero. L1 regularization encourages sparsity because its gradient has constant magnitude regardless of the weight's size — it pushes small weights all the way to zero rather than just shrinking them.
| Property | L2 (Weight Decay) | L1 (Sparsity) |
|---|---|---|
| Penalty | Σ W2 | Σ |W| |
| Gradient contribution | 2W (proportional to weight) | sign(W) (constant magnitude) |
| Effect on weights | Shrinks toward zero, never reaches it | Drives many to exactly zero |
| Preference | Spread-out, small weights | Sparse, few nonzero weights |
| Use case | Default for neural networks | Feature selection |
Why choose? Elastic net combines both: R(W) = β1 |W| + β2 W2. You get sparsity from L1 plus the smoothness of L2.
L1 and L2 are just the simplest forms of regularization. Modern deep learning uses many other tricks: dropout (randomly zero out neurons during training), batch normalization (normalize layer activations), data augmentation (artificially expand the training set), and early stopping (stop training before overfitting kicks in). All serve the same purpose: prevent memorization, encourage generalization.
Last lecture introduced the SVM loss (hinge loss). Now we'll meet the other major loss for classification: cross-entropy loss, built on the softmax function. It's the standard loss for modern neural networks.
Your linear classifier outputs a vector of raw scores — one per class. For a 3-class problem, you might get scores [3.2, 1.3, 2.2]. But these are just numbers. How confident is the model? Is 3.2 way better than 2.2, or barely?
The softmax function converts raw scores into a probability distribution:
Scores: [3.2, 1.3, 2.2]
Exponentials: [e3.2, e1.3, e2.2] = [24.53, 3.67, 9.03]
Sum: 24.53 + 3.67 + 9.03 = 37.23
Probabilities: [24.53/37.23, 3.67/37.23, 9.03/37.23] = [0.659, 0.099, 0.243]
The model is 65.9% confident in class 0. These probabilities are meaningful — you can compare them, threshold them, and use them downstream.
Now that we have probabilities, how do we penalize wrong predictions? Take the negative log of the probability assigned to the correct class:
Why negative log? When P(correct) = 1.0, the loss is −log(1) = 0 (perfect). When P(correct) = 0.01, the loss is −log(0.01) = 4.6 (terrible). When P(correct) → 0, the loss → ∞. The negative log penalizes confidently wrong predictions very harshly.
SVM (hinge) loss says: "I only care that the correct class score beats the others by a margin of 1. Beyond that, I'm happy." Cross-entropy says: "I always want more probability on the correct class. Even if you're at 99%, I'll keep pushing toward 100%." Cross-entropy never stops caring — it always produces gradients, which makes it better for training deep networks.
There's a practical trap. If scores are large (say s = [1000, 999, 998]), then e1000 overflows to infinity in floating point. The fix is simple and elegant: subtract the maximum score from all scores before exponentiating.
esk − C / Σj esj − C = (esk · e−C) / (Σj esj · e−C) = esk / Σj esj. The e−C factors cancel. Setting C = max(s) ensures all exponents are ≤ 0, so all values are in [0, 1]. No overflow possible.
Never implement softmax without the max-subtraction trick. In deep networks, score magnitudes can be unpredictable. One NaN from an overflow will corrupt your entire training run. PyTorch's F.cross_entropy and TensorFlow's tf.nn.softmax_cross_entropy_with_logits handle this automatically — they take raw scores (logits) and apply the stable version internally.
You have a loss function L(W). You want to find the weights W that minimize it. How do you navigate the loss landscape — a surface in million-dimensional space — toward the lowest point?
Imagine standing on a foggy hillside. You can't see the valley floor. But you can feel the slope beneath your feet. The steepest downhill direction is the negative gradient. Take a step in that direction. Repeat.
That's it. The entire optimization algorithm fits in one line. Everything else in this lecture is about making this one step work better.
The learning rate controls how big each step is. Too large: you overshoot the minimum and the loss explodes. Too small: training takes forever and you get stuck in bad local optima. Choosing the right learning rate is one of the most important decisions in training neural networks.
A 2D contour plot of a loss function. Watch gradient descent trace a path from the starting point toward the minimum. Adjust the learning rate to see how it affects convergence.
For a linear classifier with, say, 3072 input features and 10 classes, the weight matrix has 30,720 entries. The loss is a function of all 30,720 numbers — a surface in 30,720-dimensional space. We can't visualize that, but we can build intuition from 2D slices.
In 2D, the loss landscape looks like a topographic map. Contour lines connect points of equal loss. The gradient at any point is perpendicular to the contour line, pointing uphill. We walk in the opposite direction — downhill.
Local minima: A valley that isn't the deepest valley. Gradient descent can get trapped — the gradient is zero, so it stops. In practice, local minima are less of a problem than you'd think in high dimensions, because most "flat" spots are saddle points, not true minima.
Saddle points: Points where the surface curves up in some directions and down in others — like a mountain pass. The gradient is zero, but it's not a minimum. In high-dimensional spaces, saddle points vastly outnumber local minima (Dauphin et al., 2014). This is actually a bigger concern than local minima.
Ill-conditioning: When the loss changes quickly in one direction and slowly in another, gradient descent oscillates along the steep direction while barely making progress along the flat one. Think of a narrow, elongated valley — the gradient zigzags side to side instead of heading straight for the bottom.
You could compute gradients by bumping each weight by a tiny amount h and measuring how the loss changes: ∇L ≈ (L(W+h) − L(W)) / h. This numerical gradient is slow (one forward pass per weight!) but easy to implement. The analytic gradient uses calculus (backpropagation) to compute all gradients in one backward pass. In practice: always use analytic gradients, but verify them against numerical gradients as a debugging check.
There's a problem with vanilla gradient descent. The loss is a sum over all N training examples:
With N = 1,000,000 images, computing the gradient requires a forward and backward pass through every single example before you can take one step. That's absurdly expensive.
Stochastic Gradient Descent (SGD) approximates the full gradient using a small random subset — a mini-batch — of the training data:
Instead of one expensive step using all 1,000,000 examples, you take many cheap steps using 64 examples each. Each individual step is noisy — the mini-batch gradient only approximates the true gradient — but the noise averages out over many steps, and you make progress much faster in wall-clock time.
The stochasticity in SGD isn't just a compromise for speed — it actually helps. The random fluctuations can kick the optimizer out of sharp, narrow minima (which generalize poorly) and into broad, flat minima (which generalize well). Adding noise to optimization is one of the few cases where imprecision makes things better.
| Batch Size | Gradient Quality | Speed | Generalization |
|---|---|---|---|
| B = 1 (pure SGD) | Very noisy | Slow (poor GPU utilization) | Good (lots of noise) |
| B = 32–128 | Good approximation | Fast (GPU parallelism) | Good sweet spot |
| B = N (full batch) | Exact gradient | Slow per step | Can overfit to sharp minima |
Dataset: 50,000 images (CIFAR-10). Full-batch GD: one gradient step requires all 50,000 forward+backward passes. SGD with B=64: one step requires 64 passes. In the time full-batch GD takes one step, SGD takes ~781 steps. Each step is noisier, but 781 noisy steps beat one precise step. Every time.
(1) Ill-conditioning: When the loss surface is an elongated valley, SGD oscillates across the narrow dimension and crawls along the long one. (2) Saddle points: The gradient is near-zero, so SGD stalls. (3) Noisy gradients: Mini-batch randomness causes the path to jitter. All three problems motivate the next chapter: momentum.
Think of optimization as rolling a ball down a hilly landscape. Plain SGD is like placing a ball on the surface and teleporting it in the steepest downhill direction at each step. There's no inertia — the ball has no memory of which direction it was going.
Momentum gives the ball mass. Instead of teleporting, the ball accelerates. If the gradient keeps pointing in the same direction, the ball speeds up. If the gradient oscillates, the ball's inertia smooths out the zigzag.
The velocity v is a running average of recent gradients. The parameter ρ (rho) acts like friction — it determines how much of the previous velocity is retained. With ρ = 0.9, the ball "remembers" about the last 10 gradients.
Side-by-side optimization on a narrow, elongated valley. SGD oscillates wildly; momentum smooths the path. Click Run to watch them race.
(1) Ill-conditioning: Oscillations across the narrow dimension cancel out in the velocity; consistent motion along the long dimension accumulates. (2) Saddle points: Even if the gradient is near zero, the accumulated velocity carries the ball through. (3) Noise: Random fluctuations average out in the running mean, producing smoother updates.
Standard momentum computes the gradient at the current position, then combines it with velocity. Nesterov momentum does something smarter: first "look ahead" to where the velocity would take you, then compute the gradient there.
The intuition: if the velocity is about to carry you past the minimum, the lookahead gradient "sees" the uphill slope and applies a corrective force before you get there. It's like a driver who looks at the road ahead, not just under the hood.
In practice, we can't easily evaluate the gradient at W + ρv if our model expects inputs at W. The standard trick: define W̃ = W + ρv. Then the update in terms of W̃ becomes: v ← ρv − α∇L(W̃), then W̃ ← W̃ − ρvold + (1+ρ)vnew. This way we always evaluate gradients at the current "effective" position W̃.
| Method | Gradient Evaluated At | Behavior |
|---|---|---|
| SGD | Current position W | Teleports downhill; no memory |
| SGD + Momentum | Current position W | Rolls downhill with inertia |
| Nesterov | Lookahead position W + ρv | Rolls with anticipation; better braking |
Momentum solves the "same learning rate for all directions" problem by accumulating velocity. But what if different parameters need different learning rates? A weight that rarely gets large gradients needs a bigger step size. A weight that gets huge gradients needs a smaller one.
Adaptive methods give each parameter its own effective learning rate, automatically tuned based on the history of gradients for that parameter.
The simplest adaptive method. Keep a running sum of squared gradients for each parameter, then scale the learning rate by the inverse of that sum:
Parameters with large accumulated gradients (steep directions) get smaller effective learning rates. Parameters with small accumulated gradients (flat directions) get larger ones. This equalizes progress across all directions.
The denominator h only grows — it never shrinks. Over time, the effective learning rate for every parameter decays toward zero. Training eventually stalls. AdaGrad works well for sparse problems (NLP, recommender systems) but poorly for deep learning's long training runs.
RMSProp fixes AdaGrad's decay problem with a simple change: use a leaky (exponentially weighted) running average of squared gradients instead of the cumulative sum:
The "leaky" average means old gradient information gradually fades. If a parameter stops getting large gradients, its learning rate recovers. No more death spiral.
Adam (Adaptive Moment Estimation) combines the best of both worlds: momentum's velocity accumulation and RMSProp's per-parameter scaling. It maintains two running averages:
First moment (m): an exponentially weighted average of gradients. This is momentum.
Second moment (v): an exponentially weighted average of squared gradients. This is RMSProp.
Both m and v are initialized to zero. At the first step (t=1), m = (1 − β1) g — which is much smaller than g because β1 = 0.9 means you're keeping 90% of nothing. The estimates are biased toward zero in early training.
After t steps: mt = (1−β1) Σi=1t β1t−i gi. Taking expectations: E[mt] = E[g] · (1−β1) Σi=1t β1t−i = E[g] · (1 − β1t). So mt underestimates E[g] by a factor of (1 − β1t). Dividing by this factor corrects the bias. At t=1 with β1=0.9: correction = 1/(1−0.9) = 10×. At t=100: correction ≈ 1.0 (bias is negligible). The same logic applies to vt with β2.
Adam combines three critical ingredients. Momentum (β1) smooths noisy gradients and accelerates through flat regions. Adaptive scaling (β2) gives each parameter its own learning rate, handling ill-conditioning. Bias correction ensures sensible updates from the very first step. The default hyperparameters (β1=0.9, β2=0.999, α=1e-3 or 5e-4) work well across a huge range of architectures. You rarely need to tune them.
There's a subtle interaction between L2 regularization and Adam. Standard Adam adds the L2 gradient (λW) to g, which then gets scaled by the adaptive learning rate. This means weight decay is also adapted per-parameter — not what we want.
AdamW (Loshchilov & Hutter, 2019) fixes this by applying weight decay after the Adam update, directly to the weights:
AdamW consistently outperforms standard Adam + L2 regularization, especially for training large models like BERT and GPT. It's the default in modern deep learning.
| Optimizer | First Moment | Second Moment | Bias Correction | Key Feature |
|---|---|---|---|---|
| SGD | — | — | — | Simple, needs tuning |
| Momentum | ✓ | — | — | Smooths oscillations |
| AdaGrad | — | ✓ (cumulative) | — | Adaptive LR, dies over time |
| RMSProp | — | ✓ (leaky) | — | Adaptive LR, doesn't die |
| Adam | ✓ | ✓ (leaky) | ✓ | Best of momentum + RMSProp |
| AdamW | ✓ | ✓ (leaky) | ✓ | Adam + decoupled weight decay |
The learning rate α doesn't have to stay constant. In fact, the best strategy is almost always to change it during training. Start with a relatively large learning rate to make fast initial progress, then decay it so the optimizer can settle into a precise minimum.
The simplest schedule: multiply the learning rate by a factor (e.g., 0.1) at specific epochs. For ResNets on ImageNet, the classic recipe is: start at α = 0.1, multiply by 0.1 at epochs 30, 60, and 90.
A smooth, gradual decay following a cosine curve from the initial learning rate down to near zero:
Cosine annealing has become the default for training transformers and vision models. It avoids the discontinuities of step decay while still reaching very small learning rates by the end of training.
Starting with a large learning rate can destabilize early training. Linear warmup gradually increases the learning rate from 0 to α0 over the first few thousand steps:
Why does warmup help? In early training, the gradients can be very large and erratic because the model hasn't learned anything yet. A large learning rate amplifies this chaos. Warmup gives the optimizer time to "find its footing" before taking big steps.
The standard schedule for transformers in 2024-2025: linear warmup for ~5,000 steps, then cosine decay to zero. This combines the stability of warmup with the smooth convergence of cosine annealing. An empirical rule of thumb: if you increase the batch size by N, also scale the initial learning rate by N (Goyal et al., 2017).
See how different schedules change the learning rate over training. Toggle between constant, step decay, cosine annealing, and warmup+cosine.
| Schedule | Formula | Used By |
|---|---|---|
| Linear decay | αt = α0(1 − t/T) | BERT, GPT-2 |
| Inverse sqrt | αt = α0 / √t | Original Transformer (Vaswani et al.) |
| Cosine + restarts | Cosine with periodic resets | SGDR (Loshchilov & Hutter) |
| 1-cycle | Warmup to high LR, then cosine down | Super-convergence (Smith & Topin) |
Time to see everything in action. Four optimizers — SGD, Momentum, RMSProp, and Adam — all start from the same point and race to minimize the same loss function. Watch how their paths differ. Who reaches the minimum first? Who oscillates? Who takes the smoothest path?
Four optimizers race across a 2D loss landscape. Adjust learning rates and watch their behavior change. The contour lines show constant-loss curves; darker = lower loss.
SGD (red): oscillates in narrow valleys, gets stuck at saddle points. Momentum (blue): smoother path, can overshoot but recovers. RMSProp (green): adapts step sizes per dimension, good at narrow valleys but no momentum to escape saddle points. Adam (gold): combines both advantages — typically fastest convergence with the smoothest path.
Try increasing the learning rate and watch SGD diverge while Adam stays stable. Try the saddle point landscape and watch SGD stall while momentum carries through. Try the narrow valley (Beale) and watch RMSProp navigate it smoothly while SGD zigzags.
Pay attention to the number of steps each optimizer needs. Adam typically converges in far fewer steps, but each step is slightly more expensive (maintaining two running averages). In practice, the per-step overhead is negligible compared to the forward and backward passes, so Adam's faster convergence translates directly to faster training.
Also notice how the paths diverge more dramatically as you increase the learning rate. SGD is the first to blow up; Momentum can overshoot but usually recovers; RMSProp's adaptive scaling keeps it stable; Adam is the last to lose stability. This robustness to learning rate choice is a major practical advantage.
Beale landscape: A narrow valley that curves. SGD oscillates across the valley walls. Momentum helps but can overshoot the curve. Adam handles both the narrowness (via RMSProp component) and the curve (via momentum component). Rosenbrock: A long, banana-shaped valley. Progress is easy in the steep direction but agonizingly slow along the flat bottom. Saddle point: Flat in one direction, curved in another. Pure SGD stalls; momentum pushes through.
Before optimization even begins, you have to initialize the weights. This choice matters far more than you might expect. Bad initialization can make training impossible.
If every weight starts at zero, every neuron in each layer computes the exact same output. Gradients are identical too, so all weights update identically. They stay equal forever. The network has effectively collapsed to a single neuron per layer — it can never learn diverse features. This is called the symmetry problem.
All-zero initialization permanently breaks neural networks (except biases, which can safely be zero). You must break symmetry by using random initialization. But how much randomness?
If weights are initialized too small, signals shrink as they pass through layers. By the time you reach layer 50, activations are essentially zero. Gradients vanish. Training dies.
If weights are initialized too large, signals explode. Activations saturate, gradients explode, and the loss becomes NaN after a few steps.
The sweet spot: initialize so that the variance of activations stays roughly constant across layers. This is the principle behind both Xavier and He initialization.
For layers with tanh or sigmoid activations, Glorot & Bengio (2010) derived:
ReLU activations kill half the signal (all negative values become zero). He et al. (2015) compensated by doubling the variance:
Using ReLU (or GELU, Swish, etc.)? Use Kaiming/He initialization. Using tanh or sigmoid? Use Xavier/Glorot. In PyTorch: nn.init.kaiming_normal_(layer.weight) or nn.init.xavier_normal_(layer.weight). Modern frameworks like PyTorch use Kaiming by default for linear layers.
For a layer y = Wx, if entries of x have variance Var(x) and entries of W have variance Var(W), then each output yj = Σi Wji xi has variance Var(y) = nin · Var(W) · Var(x). For Var(y) = Var(x), we need Var(W) = 1/nin. After ReLU, half the outputs are zeroed, halving the effective variance. So we need Var(W) = 2/nin to compensate.
| Situation | Recommendation |
|---|---|
| Starting a new project | AdamW with default hyperparameters. It often works without tuning. |
| Squeezing out last 1% accuracy | SGD + Momentum with careful LR schedule. Can outperform Adam with tuning. |
| Full-batch optimization (small dataset) | Consider L-BFGS — a second-order method that converges very fast when you can afford full gradients. |
| Training is unstable (loss spikes) | Reduce learning rate. Add warmup. Check initialization. |
| Large generalization gap | Increase weight decay. Add dropout. Use data augmentation. |
Everything we've discussed is first-order optimization: we only use the gradient (first derivative). Second-order methods also use the Hessian (matrix of second derivatives) to model the curvature of the loss surface. Newton's method jumps directly to the estimated minimum of a quadratic approximation. The problem? The Hessian has O(N2) entries and inverting it takes O(N3) operations. With N = 100 million parameters, that's utterly infeasible.
Approximations like L-BFGS avoid storing the full Hessian, but they still require deterministic full-batch gradients. In the stochastic mini-batch world of deep learning, first-order methods (especially Adam) reign supreme.
Backpropagation (Lecture 4): How do you actually compute ∇L? The chain rule applied recursively through the network. Without backprop, gradient descent is useless — you can't compute the gradient for neural networks with millions of parameters any other way.
Neural Networks: We've been optimizing linear classifiers (f = Wx + b). Next: stack multiple layers with nonlinearities to learn arbitrary functions. The optimization tools from this lecture — Adam, momentum, learning rate schedules — carry over directly.
Batch Normalization, Dropout, Data Augmentation: More regularization techniques, designed specifically for deep networks. These complement weight decay and are covered in later lectures.
| Hyperparameter | Adam/AdamW | SGD+Momentum |
|---|---|---|
| Learning rate α | 1e-3 or 5e-4 | 0.1 |
| Momentum ρ | 0.9 (β1) | 0.9 |
| RMSProp decay β2 | 0.999 | N/A |
| ε | 1e-8 | N/A |
| Weight decay λ | 0.01–0.1 | 1e-4 |
| Warmup steps | 1,000–5,000 | Optional |
| LR schedule | Cosine | Step decay or cosine |
Here is what training setup looks like in practice for a typical vision transformer:
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.05)
scheduler = CosineAnnealingLR(optimizer, T_max=300, eta_min=1e-6)
The optimizer handles the per-step weight updates (momentum, adaptive scaling, bias correction, decoupled weight decay). The scheduler handles the epoch-level learning rate decay (cosine annealing from 5e-4 down to 1e-6 over 300 epochs). Together they implement the full modern training recipe.
| Year | Milestone | Key Idea |
|---|---|---|
| 1847 | Cauchy proposes gradient descent | Follow the negative gradient |
| 1951 | Robbins & Monro: stochastic approximation | Use noisy gradients |
| 1983 | Nesterov: accelerated gradient | Lookahead momentum |
| 2010 | Glorot & Bengio: Xavier initialization | Preserve signal variance |
| 2011 | Duchi et al.: AdaGrad | Per-parameter learning rates |
| 2012 | Tieleman & Hinton: RMSProp | Leaky AdaGrad |
| 2015 | Kingma & Ba: Adam | Momentum + RMSProp + bias correction |
| 2015 | He et al.: Kaiming initialization | Compensate for ReLU |
| 2017 | Loshchilov & Hutter: cosine annealing, AdamW | Smooth LR decay, decoupled weight decay |
Regularization prevents overfitting by penalizing complexity; optimization finds good weights by following gradients intelligently. Adam + weight decay + cosine schedule is the modern default.