Training Foundations

Optimizers

The algorithms that turn gradients into weight updates — from vanilla SGD to the adaptive methods behind every modern neural network.

Prerequisites: Basic calculus (derivatives) + What a neural network does. That's it.
10
Chapters
14+
Simulations
0
Assumed Knowledge

Chapter 0: The Landscape Problem

You're standing on a foggy mountain. You can't see the valley floor — can't see anything more than a meter ahead. But you can feel the slope under your feet. Steep tilt to the left? Step left. Gentle slope forward? Step forward. That's all the information you have: the local tilt of the ground beneath you.

Training a neural network is exactly this problem. The "mountain" is the loss landscape — a surface where every point represents a particular set of model weights, and the height at that point is the loss (how wrong the model is). The "valley floor" is a minimum where the loss is low. Your job is to walk downhill using only what you can feel locally.

The tool that tells you the local slope is called the gradient.

What a Gradient Actually Is

Take a function with one input — say f(x) = (x − 3)². The graph is a parabola with its minimum at x = 3. At any point x, the derivative f′(x) = 2(x − 3) tells you two things:

The derivative points in the direction of steepest ascent — uphill. We want to go downhill, so we step in the opposite direction. That's it. That's gradient descent.

When the function has millions of inputs (the weights of a neural network), the derivative generalises to the gradient — a vector of partial derivatives, one per weight. Each entry says "if you nudge this particular weight up a tiny bit, here's how much the loss increases." The gradient vector points uphill in weight-space; we step opposite to it.

The Simplest Update Rule

One step of gradient descent is:

θ ← θ − η · ∇L(θ)

Let's unpack every symbol:

SymbolNameMeaning
θParametersAll the model's weights, packed into one vector
ηLearning rateStep size — how far we walk per update (a scalar, e.g. 0.01)
∇L(θ)GradientThe slope of the loss with respect to every weight, evaluated at the current θ
AssignmentReplace the old θ with the new value

The minus sign is critical: the gradient points uphill, so subtracting it moves us downhill. The learning rate η scales how big each step is. That single scalar is going to cause us enormous trouble.

Hand Calculation: Walking Downhill

Let's trace gradient descent on f(x) = (x − 3)², starting at x = 0 with learning rate η = 0.1. The gradient (derivative) is f′(x) = 2(x − 3).

We want to reach x = 3, where f(x) = 0. Let's see how gradient descent gets there — one step at a time, showing every intermediate number.

Step 0 — starting point:

The gradient is −6 (negative, pointing left — meaning the function is decreasing as x increases). We subtract it, which pushes us to the right toward the minimum. Makes sense.

Step 1:

Loss dropped from 9 to 5.76. The gradient is smaller now (−4.8 vs −6) because we're closer to the minimum and the slope is gentler. Each step gets smaller — this is a nice property of gradient descent on smooth functions.

Step 2:

Step 3:

Step 4:

After 5 steps: x went from 0 → 0.6 → 1.08 → 1.464 → 1.771 → 2.017. Loss dropped from 9.0 → 0.967. We're converging, but slowly — each step is 80% the size of the previous one (because the gradient shrinks by a factor of 0.8 each time — can you see why?).

The 0.8 factor. Each step multiplies x's distance from the minimum by (1 − 2η) = (1 − 0.2) = 0.8. So convergence is geometric: after n steps, the remaining distance is 0.8ⁿ times the original. At η = 0.1, you need about 11 steps to halve the distance, ~33 steps to reduce it by 10×. Increase η and you converge faster — until you don't.

The Learning Rate Dilemma

What happens if we crank η up? Let's replay the same problem with η = 0.5:

One step! We jumped straight to the minimum. η = 0.5 is perfect for this particular parabola. But what about η = 1.1?

We're oscillating and diverging. Each step overshoots the minimum by more than the last. The loss is 9 → 12.96 → 18.66 → ... going up. The learning rate is too high. For this simple parabola, anything above η = 1.0 diverges.

And if η is too small — say η = 0.001?

After 100 steps we'd be at roughly x ≈ 0.55. After 1000 steps, x ≈ 2.97. It works, but it's agonisingly slow. In a real network with millions of parameters and expensive forward/backward passes, this waste is measured in GPU-hours and dollars.

Explore the Landscape

The widget below lets you drop a ball onto three different loss landscapes and watch gradient descent in action. Drag the learning rate slider and see how the behaviour changes.

Loss Landscape Explorer

Choose a landscape shape, set the learning rate, click "Drop Ball", and watch gradient descent try to reach the minimum. Click anywhere on the landscape to reposition the ball.

Learning Rate η 0.100

Three Ways to Fail

Play with the widget above and you'll discover the three failure modes that plague vanilla gradient descent:

1. Divergence (η too high). On the Bowl landscape, crank the learning rate above ~1.0. The ball overshoots the minimum, lands on the opposite slope, overshoots again, and each bounce is larger than the last. The loss explodes to infinity. In real training, you see "loss: NaN" in your terminal and your run is dead.

2. Crawling (η too low). Set η to 0.005. The ball inches forward. On the Bowl this is merely slow, but on the Ravine it's catastrophic: the ball needs thousands of steps to traverse the long flat floor of the valley, and you'll run out of patience (or compute budget) long before it arrives.

3. Saddle points. Switch to the Saddle landscape. A saddle point is a place where the gradient is zero even though it's not a minimum — like the centre of a horse saddle, curving up in one direction and down in another. The gradient is zero at the saddle, so gradient descent stops dead. In high-dimensional spaces (neural networks have millions of dimensions), saddle points vastly outnumber local minima. They are the dominant obstacle.

Why saddle points dominate in high dimensions. At a critical point (zero gradient), each dimension can curve either up or down. A true local minimum needs ALL dimensions to curve up. In d dimensions, the probability of that is roughly (1/2)ᵈ. With d = 1,000,000, the chance of a true local minimum is astronomically small. Almost every critical point is a saddle — some dimensions curve up, others curve down. This is why "getting stuck in local minima" is mostly a myth for large neural networks. The real enemy is saddle points and flat regions.

One Number Rules Everything

Here is the uncomfortable truth: with vanilla gradient descent, your entire training outcome depends on picking the right value for a single scalar η. Too high and you diverge. Too low and you waste days of compute. And the "right" value changes during training — early on, when gradients are large and you're far from any minimum, you can afford a big learning rate. Later, when you're near a minimum and need to settle in, you need a small one.

Different parameters may want different learning rates. A parameter in an early layer of a deep network gets tiny gradients (the vanishing gradient problem); it needs a bigger step. A parameter in the final layer gets huge gradients; it needs a smaller step. One global η is a sledgehammer.

The learning rate IS the optimizer. Every technique in this lesson — momentum, RMSProp, Adam, learning rate schedules — exists because a single fixed learning rate breaks on real loss landscapes. Each technique is a different answer to the same question: how do we adapt the step size automatically?

In the chapters ahead, we'll build up from the simplest fix (using random subsets of data) through momentum (accumulating velocity) to fully adaptive methods (per-parameter learning rates). Each one solves a specific failure mode. By the end, you'll understand exactly what torch.optim.Adam is doing under the hood, and when to reach for something else.

If gradient descent oscillates wildly back and forth across a valley, which adjustment would help?

Chapter 1: Vanilla SGD — One Step at a Time

Computing the exact gradient over millions of examples takes forever. What if we estimated it from a handful?

In the last chapter, the gradient ∇L(θ) was the true gradient — computed over the entire dataset. For our toy parabola that was a single number, no big deal. But in practice, the loss is an average over every training example:

L(θ) = (1/N) · Σᵢ₌₁ᴺ ℓ(f(xᵢ; θ), yᵢ)

where ℓ is the per-example loss (e.g., cross-entropy or MSE), f is the model's prediction, and N is the number of training examples.

The Full-Batch Problem

To compute ∇L(θ) exactly, you need to:

  1. Run every example through the model (forward pass).
  2. Compute the loss for every example.
  3. Backpropagate through the model once per example (or in one batched pass).
  4. Average all the per-example gradients.

ImageNet has 1.2 million images. A single forward+backward pass for one image on a ResNet-50 takes roughly 10ms on a GPU. One full-batch gradient step = 1,200,000 × 10ms = 12,000 seconds = 3.3 hours per single step. And you might need 100,000 steps to converge. That's 38 years. Obviously nobody does this.

The Mini-Batch Insight

Here's the key observation: the full gradient is an average over N examples. An average can be estimated by a random sample. If you pick B random examples (a mini-batch), the average of their gradients is a noisy but unbiased estimate of the true gradient:

∇L̃(θ) = (1/B) · Σⱼ₌₁ᴮ ∇ℓ(f(x_πⱼ; θ), y_πⱼ)

where π is a random permutation and we grab the first B indices. "Unbiased" means the expected value of this estimate equals the true gradient: 𝔼[∇L̃] = ∇L. Any single estimate will be off, but on average (over many random draws), it's correct.

This is Stochastic Gradient Descent — "stochastic" because the gradient is random (depends on which mini-batch we happened to grab). The update rule looks almost identical:

θ ← θ − η · ∇L̃(θ; x_batch)

The only difference is that ∇L̃ is computed on a mini-batch of B examples instead of all N. With B = 32, one gradient step takes 32 × 10ms = 0.32 seconds. In the time that full-batch does 1 step, SGD does 37,500 steps. Each step is noisier, but 37,500 noisy steps beat 1 clean step by a landslide.

Full Dataset (N examples)
All training data in memory
Shuffle
Random permutation of indices
Split into mini-batches
Chunks of B examples each
For each batch: compute ∇L̃
Forward + backward on B examples
Update θ ← θ − η · ∇L̃
One parameter update per batch
Repeat (one pass = one epoch)
N/B updates per epoch

Hand Calculation: Full-Batch vs Mini-Batch

Let's make the noise concrete with a tiny example. We have 4 data points and a dead-simple model: y = θ · x (a line through the origin). We'll use MSE loss: ℓ = (y − θx)².

ixᵢyᵢ
112
224
335
448

The true relationship is roughly y ≈ 2x (not exactly — point 3 is a bit low, point 4 is a bit high). Let's start at θ = 1.0.

Full-batch gradient:

The per-example loss is ℓᵢ = (yᵢ − θxᵢ)². Its gradient with respect to θ is:

∂ℓᵢ/∂θ = −2xᵢ(yᵢ − θxᵢ)

At θ = 1.0, let's compute each:

Full-batch gradient = average = (−2 + (−8) + (−12) + (−32)) / 4 = −54 / 4 = −13.5

With η = 0.01: θ_new = 1.0 − 0.01 × (−13.5) = 1.0 + 0.135 = 1.135

Mini-batch 1: examples {1, 3} — points (1,2) and (3,5):

Batch gradient = (−2 + (−12)) / 2 = −14 / 2 = −7.0

With η = 0.01: θ_new = 1.0 − 0.01 × (−7.0) = 1.07

Mini-batch 2: examples {2, 4} — points (2,4) and (4,8):

Batch gradient = (−8 + (−32)) / 2 = −40 / 2 = −20.0

With η = 0.01: θ_new = 1.07 − 0.01 × (−20.0) = 1.27

Compare the two paths.
Full-batch (1 step): θ = 1.0 → 1.135
SGD (2 steps, same data): θ = 1.0 → 1.07 → 1.27

SGD made more progress! Two small noisy steps overshot the full-batch update. The mini-batch gradients were −7.0 and −20.0 — their average is −13.5, exactly the full-batch gradient. Each individual estimate is noisy, but the average is correct. This is the power of unbiased estimation.

Notice the variance: one batch said "step by 7" and the other said "step by 20." That's because batch 2 happened to contain the two large-x points, where the error is bigger and the gradient is steeper. This variance is the price we pay — and the gift we receive.

Why the Noise Helps

At first, gradient noise seems like a pure cost — we're getting an inaccurate gradient, which means we're not walking straight downhill. But three important phenomena make this noise valuable:

1. Escaping sharp minima. A sharp minimum is a narrow pit in the loss landscape — the loss is low, but only in a tiny region. Any small change to the weights makes the loss spike. These minima overfit: they memorize training data but don't generalise. Mini-batch noise effectively adds random jitter to each step. This jitter can bounce you out of a sharp pit, but not out of a wide, flat basin. So SGD naturally avoids sharp minima and settles into flat ones.

2. Flat minima generalise better. Flat minima are regions where the loss stays low even when the weights change slightly. Since test data is slightly different from training data, a flat minimum gives similar loss on both. Sharp minima give low training loss but high test loss. By favouring flat minima, SGD acts as an implicit regulariser — it reduces overfitting without you adding any explicit penalty.

3. Breaking symmetry at saddle points. We saw in Chapter 0 that the gradient is zero at saddle points. With full-batch gradient descent, you'd get stuck. But mini-batch noise means the estimated gradient is almost never exactly zero — there's always some random direction to push you off the saddle. More noise (smaller batches) means faster escape.

Mini-batch noise is a feature, not a bug. Sharp minima that fit training data perfectly but fail on test data get escaped naturally. Flat minima that generalise well are stable. This is implicit regularisation — SGD prefers flat minima because that's where the noise can't push you out. Cranking up the batch size reduces noise and can actually hurt generalisation.

See the Noise

The widget below shows gradient descent on a 2D loss landscape. You can toggle between full-batch (one clean arrow per step) and mini-batch (noisy arrows that scatter around the true direction). Change the batch size and watch the noise change.

Full-Batch vs Mini-Batch Gradients

Each arrow shows the gradient direction for one step. Full-batch always points the same way from a given location. Mini-batch arrows scatter — smaller batch = more scatter. Hit "Resample" to draw new random batches.

Batch Size 8

The Batch Size Spectrum

Batch size B controls the noise level. Here's the spectrum:

Batch SizeNameNoiseSteps/EpochGPU UtilisationGeneralisation
1Pure SGDMaximumNVery lowGood (often too noisy)
16–64Small mini-batchHighN/BModerateGood
128–512Standard mini-batchModerateN/BHighGood
1024–8192Large mini-batchLowN/BVery highNeeds tuning
NFull-batch (GD)None1MaximumCan overfit

In practice, batch size 32–256 is the sweet spot for most tasks. Below 16, the noise is so extreme that convergence is erratic. Above 4096, you lose the implicit regularisation benefit and need to carefully tune the learning rate (typically by scaling it linearly with batch size — the "linear scaling rule").

Shuffling matters enormously. Between epochs, you must randomly re-shuffle the data. Without shuffling, the same mini-batches repeat in the same order every epoch, and the model can memorise the sequence of gradient updates rather than the data itself. This is a subtle form of overfitting that's easy to miss. Every deep learning framework shuffles by default — but if you're implementing your own data loader, don't forget it.

SGD From Scratch

Let's implement SGD in pure Python, then see the one-liner version. No magic — just the update rule we derived.

python
import numpy as np

def sgd_step(params, grads, lr):
    """One step of vanilla SGD."""
    for p, g in zip(params, grads):
        p -= lr * g   # θ ← θ − η · ∇L̃

# Example: linear regression y = θ·x
theta = np.array([1.0])   # starting weight
lr = 0.01
X = np.array([1, 2, 3, 4], dtype=float)
y = np.array([2, 4, 5, 8], dtype=float)

for epoch in range(50):
    indices = np.random.permutation(len(X))  # shuffle!
    for start in range(0, len(X), 2):           # batch_size=2
        batch = indices[start:start+2]
        xb, yb = X[batch], y[batch]
        pred = theta[0] * xb                    # forward pass
        error = pred - yb                          # residuals
        grad = np.mean(2 * xb * error)             # ∂L/∂θ
        theta[0] -= lr * grad                     # update

print(f"Learned θ = {theta[0]:.4f}")    # ≈ 1.90

And the PyTorch equivalent — same algorithm, one line:

python
import torch

model = torch.nn.Linear(1, 1, bias=False)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for xb, yb in dataloader:       # dataloader handles batching + shuffling
    loss = (model(xb) - yb).pow(2).mean()
    loss.backward()               # computes ∇L̃ and stores in .grad
    optimizer.step()              # θ ← θ − η · ∇L̃
    optimizer.zero_grad()         # reset .grad to zero for next batch

The optimizer.step() call is doing exactly what our 1-line p -= lr * g does, just for every parameter in the model. The zero_grad() call is essential — PyTorch accumulates gradients by default (adds new gradients to existing .grad), so you must zero them before each batch. Forgetting this is one of the most common PyTorch bugs.

SGD's Remaining Problems

SGD is fast and simple, but it still has one global learning rate η for all parameters. And it has no memory — each step is based entirely on the current mini-batch gradient, with no awareness of previous steps. This leads to two problems:

Problem 1: Oscillation in ravines. When the loss landscape is a narrow valley (steep walls, shallow floor), SGD bounces back and forth across the walls while making slow progress along the floor. Each mini-batch gradient has a large component pointing across the valley (high curvature direction) and a small component pointing along the valley (low curvature direction). The large component causes oscillation; the small component causes crawling.

Problem 2: No adaptation. Some parameters need large steps (small gradients, flat curvature) and others need small steps (large gradients, sharp curvature). A single η can't serve both. This is especially bad in deep networks where gradient magnitudes vary by orders of magnitude across layers.

The next chapter tackles Problem 1 with a beautifully simple idea: give the ball mass.

Why does SGD use a random subset of data instead of all data?

Chapter 2: Momentum — The Rolling Ball

SGD in a narrow valley is like a pinball — bouncing wall to wall, wasting energy oscillating sideways while barely making progress forward. What if the ball had mass?

Picture rolling a ping-pong ball down a half-pipe. It rattles side to side, never building much speed in any direction. Now picture rolling a bowling ball down the same half-pipe. Its inertia carries it past the small oscillations — the side-to-side bumps barely register because the ball's velocity is dominated by the cumulative downhill direction. That inertia is what we're going to give our optimizer.

The Ravine Problem, Precisely

Why does SGD oscillate in narrow valleys? Consider a 2D loss surface shaped like a long, thin trough — steep in the x-direction, shallow in the y-direction. The gradient at any point inside the trough has two components:

In vanilla SGD, the x-component dominates every step, causing wild zig-zagging, while the y-component (the direction we actually want to go) contributes only a tiny nudge per step. After 100 steps, the x-oscillations cancel out (equal time going left and right) and the net displacement is almost entirely in the y-direction — but agonisingly slowly.

The fix is elegant: accumulate a running average of past gradients. The x-components flip sign every step, so they cancel in the average. The y-components are consistent, so they add up. The running average amplifies the consistent signal and damps the oscillating noise. This running average is called velocity, and the technique is called momentum.

The Momentum Update Rule

Momentum adds one new variable: a velocity vector v that has the same shape as θ (one velocity per parameter). At each step:

v_t = β · v_(t−1) + g_t
θ_t = θ_(t−1) − η · v_t

where g_t = ∇L̃(θ_(t−1)) is the current mini-batch gradient. Let's unpack the new pieces:

SymbolNameMeaning
v_tVelocityAccumulated gradient history — the "inertia" of our ball
βMomentum coefficientHow much of the old velocity we keep. Typically 0.9
g_tCurrent gradientThe fresh mini-batch gradient at this step
v₀Initial velocityZero vector — the ball starts at rest

The first line says: "Take 90% of my previous velocity (momentum) and add the current gradient." The second line says: "Step in the direction of this combined velocity."

When β = 0, the velocity is just the current gradient — momentum disappears and we recover vanilla SGD. When β = 0.9, the velocity is a weighted blend of all past gradients, with recent ones weighted more heavily.

Momentum as Exponential Moving Average

Let's expand the recursion to see what v_t really contains. Writing it out step by step:

v₁ = β · 0 + g₁ = g₁
v₂ = β · g₁ + g₂
v₃ = β²·g₁ + β·g₂ + g₃
v_t = β^(t−1)·g₁ + β^(t−2)·g₂ + ... + β·g_(t−1) + g_t

This is an exponential moving average (EMA) of gradients. The current gradient g_t has weight 1. The gradient from 1 step ago has weight β. From 2 steps ago: β². From k steps ago: β^k. Since β < 1, older gradients fade exponentially.

How far back does the memory reach? A gradient from k steps ago contributes β^k of its original magnitude. At β = 0.9, a gradient from 10 steps ago has weight 0.9¹⁰ ≈ 0.35 — still significant. From 20 steps ago: 0.9²⁰ ≈ 0.12 — fading. From 50 steps ago: 0.9⁵⁰ ≈ 0.005 — negligible.

The effective window is approximately 1/(1 − β). At β = 0.9, the window is ~10 steps. At β = 0.99, it's ~100 steps. At β = 0.5, it's ~2 steps. This gives you a knob: higher β means longer memory, more smoothing, more inertia.

The EMA intuition. Think of momentum as a weighted poll of your recent gradient history. With β = 0.9, you're asking: "What has the gradient been doing over roughly the last 10 steps?" If it's been consistently pointing south, the velocity builds up southward. If it's been alternating east/west (oscillation), those cancel out in the average and the velocity stays small in the east-west direction.

Hand Calculation: 5 Steps of Momentum

Let's trace momentum SGD on our familiar f(x) = (x − 3)². We'll start at x = 0, use η = 0.01, β = 0.9, and compare against vanilla SGD at each step. The gradient is f′(x) = 2(x − 3).

Why η = 0.01 instead of 0.1? Momentum amplifies the effective step size. With β = 0.9, the velocity builds up to roughly 1/(1 − β) = 10× a single gradient. If we kept η = 0.1, the effective step would be ~10 × 0.1 × gradient, causing divergence. So we reduce η to compensate. This is a standard practice: when you turn on momentum, reduce the learning rate.

Step 1:

Identical so far — with v₀ = 0, the first step is the same.

Step 2:

Now momentum is ahead. The velocity v₂ = −11.28 is almost double the current gradient (−5.88) because it's accumulated the previous gradient too. The momentum step (0.1128) is almost twice the vanilla step (0.0588).

Step 3:

Step 4:

Step 5:

Summary after 5 steps:

StepMomentum θVanilla θMomentum Velocity
00.0000.0000.000
10.0600.060−6.000
20.1730.119−11.280
30.3310.176−15.806
40.5270.233−19.564
50.7520.288−22.555

After 5 steps, momentum has reached θ = 0.752 while vanilla SGD is only at θ = 0.288. Momentum is 2.6× further along. And the velocity is still building — each step is larger than the last because the gradient has been consistently negative (pointing toward the minimum at x = 3). In a consistent direction, momentum accelerates.

See It: SGD vs Momentum

The widget below shows side-by-side trajectories on an elongated ravine. Vanilla SGD (left) zig-zags across the narrow dimension. Momentum (right) smooths out the oscillations and accelerates along the floor. Adjust β and watch the behaviour change.

SGD vs Momentum — Side by Side

Both optimizers start at the same point on a ravine-shaped loss surface. Watch how momentum damps the zig-zag and accelerates toward the minimum.

Momentum β 0.90
Learning Rate η 0.020

What β Controls

The momentum coefficient β is the second critical hyperparameter (after the learning rate). Here's what happens at different values:

β = 0: No momentum. v_t = g_t, and we recover vanilla SGD. The ball has no mass — each step depends only on the current gradient.

β = 0.5: Light momentum. Effective window ≈ 2 steps. Some smoothing, but the optimizer is still quite reactive to individual gradients. Rarely used in practice.

β = 0.9: Standard momentum. Effective window ≈ 10 steps. This is the default in almost every deep learning setup. It provides significant smoothing while still responding reasonably quickly to changes in the gradient direction. When someone says "SGD with momentum," they almost always mean β = 0.9.

β = 0.99: Heavy momentum. Effective window ≈ 100 steps. The optimizer is very smooth but very slow to change direction. If the gradient reverses (you've passed the minimum), it takes ~100 steps for the velocity to reverse. This can cause severe overshoot. Used occasionally with very noisy gradients where extreme smoothing is needed.

β = 0.999: Almost never used for vanilla momentum (but this value appears in Adam's second moment — we'll see why later). The velocity barely responds to new information. 1000-step memory.

Nesterov Accelerated Gradient

Momentum has a subtle flaw: it evaluates the gradient at the current position θ, then adds it to the velocity. But if the velocity is large, the next position will be far from θ — so the gradient at θ might be a poor estimate of the gradient at the place we're actually going to land.

Nesterov momentum (NAG) fixes this with a clever trick: first take a step in the direction of the current velocity (the "lookahead"), then compute the gradient at that lookahead point. The analogy: instead of standing still and looking downhill to decide your next step, take a running start in the direction you've been going, and then look downhill from where you've arrived.

The equations:

θ_lookahead = θ_(t−1) − η · β · v_(t−1)
g = ∇L(θ_lookahead)
v_t = β · v_(t−1) + g
θ_t = θ_(t−1) − η · v_t

The only difference from classical momentum is in where we evaluate the gradient. Classical evaluates at θ_(t−1) (where we are). Nesterov evaluates at θ_(t−1) − η·β·v_(t−1) (where we're about to go). This "lookahead" makes Nesterov more responsive to curvature changes.

Why does this help? Consider what happens near a minimum. Classical momentum has built up a large velocity pointing toward (and past) the minimum. It evaluates the gradient at the current position, which might still say "keep going." So it overshoots. Nesterov first jumps ahead to where it would land, computes the gradient there (which says "you've gone too far, come back"), and uses that to correct the velocity before committing to the step. It's a form of error correction built into the update.

Classical vs Nesterov, step by step.
Classical: (1) compute gradient at current position → (2) update velocity → (3) step.
Nesterov: (1) jump ahead by the old velocity → (2) compute gradient at the lookahead position → (3) update velocity → (4) step.

The cost is the same (one gradient computation per step). The benefit is better gradient information — we're asking "what's the slope where I'm about to land?" instead of "what's the slope where I am now?"

Nesterov Visualised

The widget below shows how the lookahead step changes the trajectory near a minimum. Classical momentum overshoots and oscillates. Nesterov anticipates the overshoot and corrects earlier.

Nesterov Lookahead

Watch the ghost position (lookahead point) peek ahead of the ball. Near the minimum, the lookahead's gradient says "you've gone too far" and corrects the velocity before the actual step.

Momentum's Dark Side: Overshoot

Momentum is not free. The same inertia that helps you accelerate through consistent slopes also makes it hard to stop.

When the gradient direction reverses — you've passed the minimum and the slope now points the other way — the velocity still carries you in the old direction. How long until the velocity reverses? The velocity decays by a factor of β each step (in the absence of reinforcing gradients), so it takes roughly 1/(1 − β) steps to die off. At β = 0.9, that's ~10 steps of overshoot. At β = 0.99, it's ~100 steps. That's a lot of wasted motion.

Momentum's dark side: it overshoots. When the gradient direction reverses (you've passed the minimum), momentum keeps pushing in the old direction. At β = 0.99, it takes ~100 steps to turn around. This is why Nesterov helps — it peeks ahead before committing, catching the reversal earlier. And it's why β = 0.9 is the standard default: enough memory to smooth noise, not so much that course-correction is slow.

In practice, momentum overshoot manifests as the loss increasing temporarily after a decrease. You'll see the training loss curve dip, then bump up, then dip lower — a characteristic sawtooth pattern. This is normal with momentum and doesn't mean anything is wrong. If the bumps grow rather than shrink, your learning rate is too high.

The Smoothing Interpretation

Momentum β = 0.9 means 90% memory, 10% new information. The optimizer is saying: "I've been going this way for a while, and one noisy gradient won't change my mind." This is exactly the smoothing that SGD needs. In a ravine, the oscillating component (across the valley) flips sign every step — so it averages to zero in the EMA. The consistent component (along the valley) doesn't flip — so it accumulates. Momentum automatically amplifies the signal and damps the noise, without knowing anything about the landscape shape.

This is why momentum works so well in practice: it doesn't need to know which directions oscillate and which are consistent. The exponential average figures it out automatically from the gradient history. Consistent directions accumulate velocity; oscillating directions cancel. It's a beautifully simple algorithm that solves the ravine problem with just one extra hyperparameter and one extra vector of memory.

Momentum From Scratch

python
import numpy as np

def sgd_momentum(params, grads, velocities, lr, beta=0.9):
    """One step of SGD with momentum."""
    for i in range(len(params)):
        velocities[i] = beta * velocities[i] + grads[i]   # accumulate
        params[i] -= lr * velocities[i]                    # step

def sgd_nesterov(params, grads_fn, velocities, lr, beta=0.9):
    """One step of Nesterov momentum.
    grads_fn(params) returns gradients evaluated at given params."""
    # Step 1: lookahead
    lookahead = [p - lr * beta * v for p, v in zip(params, velocities)]
    # Step 2: gradient at lookahead
    grads = grads_fn(lookahead)
    # Step 3: update velocity and params
    for i in range(len(params)):
        velocities[i] = beta * velocities[i] + grads[i]
        params[i] -= lr * velocities[i]

And the PyTorch one-liner:

python
# Classical momentum
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Nesterov momentum
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)

Notice that PyTorch uses the exact same SGD class for all three variants: vanilla (momentum=0), classical momentum, and Nesterov. The only difference is the flags you pass. Under the hood, the implementation matches the equations we derived above — one velocity buffer per parameter, one multiply-accumulate per step.

When to Use What

VariantWhen to UseTypical Settings
Vanilla SGD Simple problems, debugging, when you want the simplest baseline η = 0.1–0.001
SGD + Momentum Most vision tasks (ResNet, ConvNets), when you want strong generalisation η = 0.1, β = 0.9, with LR schedule
SGD + Nesterov Same as momentum but slightly better convergence; default recommendation η = 0.1, β = 0.9, with LR schedule

SGD with Nesterov momentum remains the optimizer of choice for supervised image classification (ImageNet). It generalises better than Adam in many settings — we'll explore why in a later chapter. The catch: it requires careful learning rate scheduling (warmup + cosine decay is standard). Adam is more forgiving of the learning rate choice, which is why it dominates in NLP and generative models where hyperparameter tuning budgets are limited.

What Momentum Doesn't Solve

Momentum fixes the oscillation problem: it damps zig-zagging in ravines and accelerates along consistent slopes. But it still uses a single global learning rate for all parameters. A parameter that consistently receives tiny gradients (deep layer in a network) and a parameter that receives huge gradients (final layer) both get the same η.

What if we could give each parameter its own learning rate, automatically scaled by how large its gradients typically are? That's exactly what adaptive learning rate methods do — and we'll build the first one, AdaGrad, in the next chapter.

What happens to a momentum optimizer when the gradient direction suddenly reverses?

Chapter 3: AdaGrad & RMSProp — Per-Parameter Learning Rates

Some weights update frequently. Others barely update at all. A single learning rate can't serve both.

Picture a word embedding matrix with 50,000 rows — one row per word. During training, the word "the" appears in almost every batch. Its embedding row gets a gradient update on nearly every step. But the word "serendipitous" appears once every thousand batches. Its row sits frozen for hundreds of steps, then gets a single gradient.

If you tune the learning rate for "the" — small enough that it doesn't oscillate — it's far too small for "serendipitous." That rare word needs a big step to learn anything at all in its few chances. But if you crank up the learning rate for the rare words, the frequent words overshoot wildly.

This is the parameter heterogeneity problem. It's not just an NLP thing. In any neural network, some parameters sit in busy pathways (getting updated constantly) while others sit in quiet corners (updated rarely). A single global learning rate is a compromise that serves nobody well.

The insight: What if we tracked how much each parameter has been updated, and gave less-updated parameters bigger steps? Parameters that have already received lots of gradient signal get a smaller learning rate. Parameters that have barely been touched keep a larger one. The learning rate adapts per parameter.

AdaGrad: Track the Cumulative History

AdaGrad (Adaptive Gradient, Duchi et al., 2011) implements this idea with a beautifully simple mechanism. For each parameter, it keeps a running sum of all squared gradients it has ever received.

Here's the rule. At each step t, for each parameter:

Gt = Gt-1 + gt²

Where gt is the gradient at step t. G is a running sum of squared gradients — it only ever grows. Then update the parameter:

θt = θt-1 - (η / √(Gt + ε)) · gt

Where η is the base learning rate and ε (typically 1e-8) prevents division by zero. The key quantity is the effective learning rate:

effective LR = η / √(Gt + ε)

A parameter that has received many large gradients has a large G, so its effective LR is small. A parameter that has received few or small gradients has a small G, so its effective LR stays large. The learning rate adapts automatically, per parameter, based on history.

Hand Calculation: Two Parameters, Four Steps

Let's see this in action. Two parameters, η=0.1, ε=1e-8 (we'll ignore ε in the display since it's negligible).

Parameter A — frequent, consistent gradients: [3.0, 2.5, 2.8, 3.1]

Parameter B — rare, then a spike: [0.0, 0.0, 0.0, 5.0]

Step 1:

A: G = 0 + 3.0² = 9.0 → effective LR = 0.1/√9.0 = 0.1/3.0 = 0.0333 → update = 0.0333 × 3.0 = 0.100

B: G = 0 + 0.0² = 0.0 → update = 0 (gradient is zero)

Step 2:

A: G = 9.0 + 2.5² = 9.0 + 6.25 = 15.25 → effective LR = 0.1/√15.25 = 0.1/3.906 = 0.0256

B: G = 0 → update = 0

Step 3:

A: G = 15.25 + 2.8² = 15.25 + 7.84 = 23.09 → effective LR = 0.1/√23.09 = 0.1/4.805 = 0.0208

B: G = 0 → update = 0

Step 4:

A: G = 23.09 + 3.1² = 23.09 + 9.61 = 32.70 → effective LR = 0.1/√32.70 = 0.1/5.718 = 0.0175

B: G = 0 + 5.0² = 25.0 → effective LR = 0.1/√25.0 = 0.1/5.0 = 0.0200

StepA: GA: eff. LRB: GB: eff. LR
19.000.03330.00
215.250.02560.00
323.090.02080.00
432.700.017525.000.0200

Look at step 4. After four rounds of updates, parameter A's effective LR has decayed from 0.0333 to 0.0175 — nearly halved. But parameter B, receiving its very first real gradient, gets an effective LR of 0.0200 — larger than A's despite appearing later. When B finally sees a gradient, AdaGrad gives it a big step. That's the feature.

AdaGrad's Fatal Flaw

Stare at G in that table. It goes 9 → 15 → 23 → 33. It only grows. It can never shrink. After 100 steps, G might be 900. After 10,000 steps, G might be 90,000. The effective LR becomes:

0.1 / √90,000 = 0.1 / 300 = 0.000333

After a million steps? The effective LR is essentially zero. Training stops. The optimizer has committed suicide — it accumulated so much history that it can no longer take meaningful steps in any direction.

This is fine for convex problems (where you're approaching a fixed minimum and want to slow down). But deep learning is non-convex, runs for millions of steps, and the loss landscape changes as other parameters update. A learning rate that decays to zero is catastrophic.

RMSProp: Forget the Past

RMSProp (Root Mean Square Propagation) was proposed by Geoffrey Hinton in a Coursera lecture — not even a published paper! It fixes AdaGrad with one simple change: instead of summing all squared gradients forever, use an exponential moving average.

Gt = ρ · Gt-1 + (1 - ρ) · gt²

Where ρ (rho, typically 0.9 or 0.99) controls how fast old gradients fade. With ρ=0.9, the contribution of a gradient from 10 steps ago is 0.910 = 0.35 of its original weight. From 50 steps ago: 0.950 = 0.005. Practically gone.

The update rule is identical to AdaGrad:

θt = θt-1 - (η / √(Gt + ε)) · gt

But now G is a running average instead of a running sum. It stays bounded. Old history fades. The effective learning rate remains healthy indefinitely.

Hand Calculation: RMSProp with ρ=0.9

Same parameter A, same gradients [3.0, 2.5, 2.8, 3.1], η=0.1, ρ=0.9:

Step 1: G = 0.9 × 0 + 0.1 × 9.0 = 0.9 → eff. LR = 0.1/√0.9 = 0.1/0.949 = 0.1054

Step 2: G = 0.9 × 0.9 + 0.1 × 6.25 = 0.81 + 0.625 = 1.435 → eff. LR = 0.1/1.198 = 0.0835

Step 3: G = 0.9 × 1.435 + 0.1 × 7.84 = 1.292 + 0.784 = 2.076 → eff. LR = 0.1/1.441 = 0.0694

Step 4: G = 0.9 × 2.076 + 0.1 × 9.61 = 1.868 + 0.961 = 2.829 → eff. LR = 0.1/1.682 = 0.0595

StepG (AdaGrad)G (RMSProp)eff. LR (AdaGrad)eff. LR (RMSProp)
19.000.900.03330.1054
215.251.440.02560.0835
323.092.080.02080.0694
432.702.830.01750.0595

AdaGrad's G: 9 → 15 → 23 → 33. Monotonically increasing, forever. RMSProp's G: 0.9 → 1.4 → 2.1 → 2.8. Bounded. After 10,000 steps, RMSProp's G would be roughly 8.5 (the EMA of g²), not 85,000 like AdaGrad's.

AdaGrad vs RMSProp on an Elliptical Landscape

A 2D loss surface where one axis has steep gradients and the other has gentle ones. The bars on the right show the effective learning rate for each axis.

Step 0
ρ (RMSProp) 0.90
Historical context: AdaGrad was designed for sparse data — like NLP, where some features (words) appear rarely. It's perfect for online learning and convex problems. But deep learning is non-convex and runs for millions of steps. RMSProp, proposed by Hinton in Lecture 6e of his Coursera course (not even a paper!), fixed this with one simple change. It became the de facto optimizer before Adam arrived.
Common confusion: AdaGrad and RMSProp don't add momentum. They only scale the learning rate per parameter. The direction is still the raw gradient — no smoothing, no velocity. This is why Adam (next chapter) combines them with momentum. Adaptive LR ≠ momentum.

From-Scratch Implementation

python
# AdaGrad — from scratch
def adagrad_step(params, grads, cache, lr=0.01, eps=1e-8):
    for i in range(len(params)):
        cache[i] += grads[i] ** 2                # accumulate squared grads
        params[i] -= lr * grads[i] / (cache[i] ** 0.5 + eps)

# RMSProp — from scratch
def rmsprop_step(params, grads, cache, lr=0.01, rho=0.9, eps=1e-8):
    for i in range(len(params)):
        cache[i] = rho * cache[i] + (1 - rho) * grads[i] ** 2  # EMA
        params[i] -= lr * grads[i] / (cache[i] ** 0.5 + eps)

The only difference: += (sum) vs = rho * old + (1-rho) * new (EMA). One character-level change fixes AdaGrad's fatal flaw.

python
# PyTorch equivalents
optimizer_ag = torch.optim.Adagrad(model.parameters(), lr=0.01)
optimizer_rms = torch.optim.RMSprop(model.parameters(), lr=0.01, alpha=0.9)
Effective Learning Rate Over Time

Both start with η=0.1. AdaGrad's effective LR decays toward zero. RMSProp's stabilizes.

Compute gradient gt
Standard backprop, same for both
Update Gt
AdaGrad: G += g² | RMSProp: G = ρG + (1-ρ)g²
Compute effective LR
η / √(G + ε)
Update θ
θ − effective_LR × gt
Why does AdaGrad's learning rate eventually become too small?

Chapter 4: Adam — The Best of Both Worlds

Momentum smooths the gradient direction. RMSProp scales the learning rate per parameter. What if we did both at once?

That's exactly what Adam (Adaptive Moment Estimation, Kingma & Ba, 2015) does. It maintains two running averages per parameter:

The Full Update Rule

At each step t:

mt = β1 · mt-1 + (1 - β1) · gt
vt = β2 · vt-1 + (1 - β2) · gt²
t = mt / (1 - β1t)
t = vt / (1 - β2t)
θt = θt-1 - η · m̂t / (√v̂t + ε)

That's five lines. Let's decode every piece:

SymbolNameDefaultRole
mtFirst moment estimateEMA of gradients (momentum direction)
vtSecond moment estimateEMA of squared gradients (adaptive scaling)
β1First moment decay0.9How much gradient history to keep
β2Second moment decay0.999How much squared-gradient history to keep
ηLearning rate0.001Base step size
εEpsilon1e-8Prevents division by zero
m̂, v̂Bias-corrected estimatesFix the initialization bias (explained below)

Why Bias Correction?

Both m and v are initialised to zero. At step 1:

m1 = 0.9 × 0 + 0.1 × g1 = 0.1 · g1

The true mean of the gradient is roughly g1, but m1 = 0.1 · g1. It's biased toward zero because we initialised m0 = 0 and the EMA hasn't had enough terms to wash out the zero start.

How bad is the bias? At step t, the expected value of mt is:

𝔼[mt] = (1 - β1t) · 𝔼[g]

At t=1 with β1=0.9: factor = 1 - 0.91 = 0.1. The estimate is 10% of the true value! At t=10: factor = 1 - 0.910 = 1 - 0.349 = 0.651. Still 35% too low. The fix is to divide by the factor:

t = mt / (1 - β1t)

At t=1: m̂1 = 0.1g1 / 0.1 = g1. The bias is gone. At t=10: m̂10 = m10 / 0.651. Still a meaningful correction. At t=1000: 1 - 0.91000 ≈ 1.0. The correction disappears.

The same logic applies to vt, but β2 = 0.999 so the bias lasts much longer — about 1000 steps before (1 - 0.999t) ≈ 0.63. Without bias correction, the first ~1000 steps of Adam would have wildly inflated step sizes (because v is too small → η/√v is too large).

Hand Calculation: 3 Steps of Adam

One parameter, gradients g = [4.0, -1.0, 3.0], η=0.001, β1=0.9, β2=0.999, ε=1e-8.

Step 1 (g1 = 4.0):

Notice something remarkable? The update is exactly 0.001 = η. With bias correction, Adam's first step is approximately η regardless of the gradient magnitude. Large gradient? m̂ is large but so is √v̂, and they cancel. This is why Adam's default η=0.001 "just works" across very different problems — the effective step size is automatically normalised.

Step 2 (g2 = -1.0):

The gradient flipped sign (from +4 to -1). The momentum m is still positive (0.26) — it remembers the +4 from step 1. But it's moving toward zero because the new gradient is negative. The step size dropped from 0.001 to 0.000469 because the momentum is conflicted (positive from history, but latest gradient says negative).

Step 3 (g3 = 3.0):

Adam Internals Visualiser

Watch how mt, vt, their bias-corrected versions, and the final update evolve over steps. Drag the step slider or change β values.

β1 0.90
β2 0.999
Step 1

Adam From Scratch

python
import numpy as np

def adam_step(params, grads, m, v, t, lr=1e-3,
              b1=0.9, b2=0.999, eps=1e-8):
    """One step of Adam. t is the 1-based step counter."""
    for i in range(len(params)):
        m[i] = b1 * m[i] + (1 - b1) * grads[i]          # momentum
        v[i] = b2 * v[i] + (1 - b2) * grads[i] ** 2     # RMSProp
        m_hat = m[i] / (1 - b1 ** t)                     # bias correction
        v_hat = v[i] / (1 - b2 ** t)
        params[i] -= lr * m_hat / (v_hat ** 0.5 + eps)  # update

PyTorch:

python
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999))

What Bias Correction Looks Like

The widget below shows two trajectories: one with bias correction (the real Adam) and one without (what happens if you skip the m̂/v̂ correction). Without correction, the first ~50 steps take wildly wrong step sizes.

Bias Correction Comparison

Teal = with correction (Adam). Red = without correction (naive). Watch how the naive version takes oversized early steps.

Step 1

Why Adam Dominates

Adam became the default optimizer for most of deep learning because of three properties:

1. Robust to learning rate choice. The m/√v normalisation means the effective step size is approximately η regardless of gradient scale. You can use η=0.001 on problems with tiny gradients or huge gradients and it works reasonably well. SGD with momentum requires tuning η over orders of magnitude.

2. Fast initial progress. The momentum term m smooths noisy gradients, while the adaptive scaling v ensures no parameter gets stuck with a too-large or too-small learning rate. The combination converges quickly in the early phase of training.

3. Handles sparse gradients. Like RMSProp, Adam gives larger effective learning rates to parameters that receive rare or small gradients. This is critical in NLP (rare tokens), recommendation systems (rare items), and any domain with sparse features.

When Adam struggles. Adam's fast early progress can be deceptive. On some tasks (notably ImageNet with ResNets), Adam converges quickly to a decent solution but then plateaus at a worse final accuracy than SGD+momentum with careful scheduling. The working hypothesis: Adam's adaptive per-parameter rates interfere with the implicit regularisation that SGD's uniform updates provide. This is an active area of research and the motivation behind AdamW, which we'll see next.
Why does Adam apply bias correction to m and v?

Chapter 5: AdamW — Decoupled Weight Decay

Adam has a subtle bug when combined with weight decay. To understand it, we first need to understand what weight decay is and why it matters.

What Is Weight Decay?

Large weights are suspicious. A model with weights of magnitude 1000 is fitting the training data with extreme precision — memorising noise, not learning patterns. Weight decay is a gentle force that pulls all weights toward zero, penalising complexity.

The simplest form is L2 regularisation: add a penalty proportional to the sum of squared weights:

Ltotal = Ldata + (λ/2) · ||θ||²

where λ is the regularisation strength. The gradient of the penalty term is:

∇ (λ/2 · ||θ||²) = λ · θ

So the gradient becomes gt + λθ, and the SGD update becomes:

θt = θt-1 - η · (gt + λθt-1) = (1 - ηλ) · θt-1 - η · gt

That factor (1 - ηλ) shrinks the weights slightly each step. With η=0.01 and λ=0.01, each weight is multiplied by 0.9999 per step — a gentle 0.01% decay. Over thousands of steps, this keeps weights from exploding.

The L2 vs Decoupled Problem

For SGD, L2 regularisation and weight decay are mathematically identical: adding λθ to the gradient has the same effect as shrinking θ by (1 - ηλ). But for Adam, they are NOT the same.

When you add the L2 term λθ to the gradient in Adam, it gets fed into both the first moment m and the second moment v. The adaptive scaling then modifies the effective weight decay differently for each parameter. Parameters with large accumulated gradients (large v) get less weight decay. Parameters with small accumulated gradients get more weight decay. The regularisation becomes non-uniform and unpredictable.

Let's see this concretely. With L2 in Adam, the effective weight decay for parameter i is:

effective decay ∝ λθi / √(vi)

A busy parameter (large vi) gets its decay divided by a large number. A quiet parameter (small vi) gets its decay amplified. This is backwards! Busy parameters — often in the final layers — tend to have larger weights and need more regularisation, not less.

The AdamW Fix

Loshchilov & Hutter (2019) proposed a simple fix: decouple the weight decay from the adaptive gradient scaling. Instead of adding λθ to the gradient (which then goes through Adam's machinery), apply weight decay directly to the parameter update:

mt = β1 · mt-1 + (1 - β1) · gt
vt = β2 · vt-1 + (1 - β2) · gt²
θt = (1 - ηλ) · θt-1 - η · m̂t / (√v̂t + ε)

The key change: gt is the pure data gradient (no L2 term added). The weight decay λθ is applied directly by multiplying θ by (1 - ηλ) outside of Adam's m/v machinery. Now every parameter gets exactly the same proportional decay, regardless of its gradient history.

Hand Calculation: Adam+L2 vs AdamW

One parameter: θ=5.0, g=2.0, λ=0.1, η=0.001. Assume we're at a step where bias correction is negligible (late training). Assume v stabilised at v = 4.0.

Adam + L2 regularisation:

AdamW (decoupled):

The effective weight decay in AdamW (0.01%) is 2× larger than in Adam+L2 (0.005%). The L2 version had its decay reduced by the adaptive scaling (dividing by √v = 2). AdamW applies the decay at its intended strength.

The practical impact is huge. In the original paper, switching from Adam+L2 to AdamW improved ImageNet top-1 accuracy by up to 1% — with zero other changes. The regularisation was just working as intended. Today, AdamW is the default optimizer in virtually all transformer training (GPT, BERT, ViT, LLaMA, etc.).
Adam+L2 vs AdamW Weight Trajectories

Two weights start at the same value. Watch how L2 regularisation interacts differently with Adam's adaptive scaling vs AdamW's decoupled decay.

λ (decay strength) 0.10
Step 0

AdamW From Scratch

python
def adamw_step(params, grads, m, v, t,
               lr=1e-3, b1=0.9, b2=0.999,
               eps=1e-8, wd=0.01):
    for i in range(len(params)):
        # 1. Weight decay (DECOUPLED — applied to params, not grad)
        params[i] *= (1 - lr * wd)
        # 2. Standard Adam on pure data gradient
        m[i] = b1 * m[i] + (1 - b1) * grads[i]
        v[i] = b2 * v[i] + (1 - b2) * grads[i] ** 2
        m_hat = m[i] / (1 - b1 ** t)
        v_hat = v[i] / (1 - b2 ** t)
        params[i] -= lr * m_hat / (v_hat ** 0.5 + eps)

PyTorch:

python
# AdamW (decoupled weight decay) — now the default
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

# WARNING: this is L2, NOT decoupled weight decay
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=0.01)
The PyTorch trap. torch.optim.Adam(weight_decay=0.01) applies L2 regularisation (adds λθ to gradient). torch.optim.AdamW(weight_decay=0.01) applies decoupled weight decay. They are NOT the same. If you're using Adam with weight decay, you almost certainly want AdamW.
Why does AdamW decouple weight decay from the gradient computation?

Chapter 6: Advanced Optimizers — Beyond Adam

Adam is excellent but not the end of the story. Researchers keep pushing for faster convergence, better generalisation, or lower memory. Here we survey the most important post-Adam optimizers — each one a variation on the themes we've already built.

LAMB & LARS: Layer-wise Scaling

Training with very large batch sizes (8192+) breaks Adam. The gradient noise drops so low that the optimizer converges to sharp minima. LARS (Layer-wise Adaptive Rate Scaling) and LAMB (Layer-wise Adaptive Moments for Batch) fix this by scaling the learning rate per layer, not just per parameter.

The idea: compute a "trust ratio" for each layer — the ratio of the weight norm to the gradient norm. If the gradient is huge relative to the weights, the trust ratio is small (take a cautious step). If the gradient is tiny, the trust ratio is large (take a bold step).

φl = ||θl|| / ||∇Ll||
θl ← θl - η · φl · updatel

LARS wraps SGD+momentum. LAMB wraps Adam. Google used LAMB to train BERT in 76 minutes on 1024 TPUs — a 76× speedup from the original 4 days. The key: LAMB made batch size 64K stable, so massive data parallelism worked.

RAdam: Rectified Adam

Adam's bias correction helps, but the variance of the adaptive learning rate is still high in early training. RAdam (Liu et al., 2020) tracks the variance of the second moment estimate and automatically switches from SGD (high variance regime) to Adam (low variance regime).

In practice: the first ~5 steps use SGD with momentum (no adaptive scaling), then smoothly transition to Adam as vt becomes reliable. This eliminates the need for learning rate warmup — RAdam warms itself up.

AdaFactor: Memory-Efficient

Adam stores two buffers (m and v) per parameter. For a 175B-parameter model like GPT-3, that's 350B extra floats = 1.4 TB of optimizer state in fp32. AdaFactor (Shazeer & Stern, 2018) reduces this dramatically by factorizing the second moment matrix.

For a weight matrix W of shape (d1 × d2), instead of storing a full d1×d2 second moment matrix, AdaFactor stores a row factor (d1) and a column factor (d2). Memory drops from O(d1·d2) to O(d1 + d2). For a 4096×4096 matrix, that's 8192 instead of 16.7M — a 2000× reduction.

Lion: Sign-Based Updates

Lion (EvoLved Sign Momentum, Chen et al., 2023) was discovered by having an AI search over the space of optimizer algorithms. The result is surprisingly simple — it uses only the sign of the momentum update, not its magnitude:

update = sign(β1 · mt-1 + (1 - β1) · gt)
mt = β2 · mt-1 + (1 - β2) · gt
θt = θt-1 - η · update

Every parameter moves by exactly ±η per step. No second moment, no adaptive scaling, no square roots. Memory: only one buffer (m) vs Adam's two (m + v). 50% less optimizer memory.

Lion's defaults: β1=0.9, β2=0.99, η=3×10-4 (3-10× smaller than Adam's because every update has the same magnitude). Weight decay is critical with Lion — typically 3-10× larger than with Adam.

Sophia: Second-Order Information

All methods so far use only first-order information (the gradient). The Hessian — the matrix of second derivatives — tells you about curvature. In a flat region, take large steps. In a sharply curved region, take small steps. Adam approximates this crudely (v captures gradient magnitude, which correlates with curvature). Sophia (Liu et al., 2023) uses a lightweight Hessian diagonal estimate for more precise scaling.

Sophia pre-conditions the gradient by dividing by a Hessian diagonal estimate ht (computed cheaply via a Hutchinson estimator every ~10 steps):

θt = θt-1 - η · clip(mt / max(ht, ε), ρ)

The clip operation caps the update magnitude, preventing the optimizer from taking catastrophically large steps where the Hessian estimate is inaccurate. Sophia achieves 2× the throughput of Adam on LLM training (same final loss in half the steps) while using similar memory.

The Optimizer Family Tree

Every optimizer we've discussed is a variation on a few core ideas. The tree below shows how they connect. Click any node for details.

Optimizer Family Tree

Click a node to see its key innovation. The tree shows evolutionary relationships.

OptimizerYearKey InnovationMemoryBest For
SGD1951Stochastic gradients0 extraBaselines, convex
SGD+Momentum1964Velocity accumulation1 bufferVision (ResNet, ConvNets)
AdaGrad2011Per-param LR (cumulative)1 bufferSparse data, online
RMSProp2012Per-param LR (EMA)1 bufferRNNs
Adam2015Momentum + adaptive LR2 buffersGeneral default
AdamW2019Decoupled weight decay2 buffersTransformers, LLMs
LAMB2020Layer-wise trust ratio2 buffersLarge-batch training
Lion2023Sign-based, AI-discovered1 bufferMemory-constrained LLMs
Sophia2023Hessian pre-conditioning2 buffersLLM pre-training
The trend is clear: optimizers are getting smarter about using curvature information (first-order → quasi-second-order), more memory-efficient (factorisation, sign-only updates), and more robust to hyperparameters (warmup-free, schedule-free). But the core equation is always the same: θ ← θ - step_size × direction.
What is Lion's key innovation compared to Adam?

Chapter 7: Learning Rate Schedules

Even with Adam, a fixed learning rate is suboptimal. Early in training, you want large steps to make rapid progress. Late in training, you want small steps to settle precisely into a minimum. A learning rate schedule adjusts η over the course of training.

This is separate from adaptive per-parameter scaling (Adam's v). The schedule changes the base learning rate that multiplies everything. Think of it as the global volume knob, while Adam's per-parameter scaling is the equaliser.

The Warmup Phase

Modern training almost always starts with a warmup phase: the learning rate starts near zero and linearly increases to its target value over the first few hundred or thousand steps.

Why? At initialisation, the model's weights are random. The gradients in the first few steps are large, noisy, and unrepresentative. Taking full-sized steps based on these wild early gradients can push the model into a bad region of the loss landscape from which it never recovers. Warmup says: "Take tiny steps while the gradients are chaotic. Ramp up to full speed once they stabilise."

ηt = ηmax · (t / Twarmup)     for t ≤ Twarmup

For BERT: Twarmup = 10,000 steps out of 1,000,000 total (1%). For GPT-3: Twarmup = 375 steps. For ViT: 10,000 steps. Typical range: 1-5% of total training.

Common Schedules

After warmup, several decay strategies are in common use:

1. Step Decay. Drop the LR by a fixed factor (e.g., ÷10) at predetermined epochs. Simple and effective for convnets: "Train at 0.1 for 30 epochs, 0.01 for 30 epochs, 0.001 for 30 epochs."

ηt = η0 · γ⌊t/s⌋

Where γ is the decay factor (typically 0.1) and s is the step interval.

2. Cosine Decay. The LR follows a half-cosine curve from ηmax down to ηmin (often 0 or ηmax/100). Smooth and widely used — the default for transformer training.

ηt = ηmin + (ηmax - ηmin) · (1 + cos(π · t / T)) / 2

Cosine decay is popular because it's smooth (no abrupt drops that cause loss spikes), has no hyperparameters beyond the min/max LR and total steps, and empirically works as well as or better than hand-tuned step schedules.

3. Linear Decay. The simplest: a straight line from ηmax to 0 (or some ηmin). Used by GPT-2 and many BERT fine-tuning recipes.

ηt = ηmax · (1 - t/T)

4. Inverse Square Root. Popular for training transformers (the original "Attention Is All You Need" paper):

ηt = ηmax · min(t-0.5, t · Twarmup-1.5)

This gives warmup followed by a gradual 1/√t decay. Slower than cosine decay — the LR stays higher for longer, which can help on very long training runs.

5. Warmup-Stable-Decay (WSD). A three-phase schedule used by recent large models (e.g., MiniCPM, some LLaMA variants): warmup to peak LR, hold constant for most of training, then rapidly decay at the end.

Hand Calculation: Cosine Schedule

Total steps T = 1000, ηmax = 0.001, ηmin = 0, warmup for first 100 steps.

StepPhaseη
0Warmup0
50Warmup0.001 × 50/100 = 0.0005
100Warmup complete0.001
325Cosine decay0.001 × (1+cos(π×225/900))/2 = 0.001 × 0.854 = 0.000854
550Cosine decay0.001 × (1+cos(π×450/900))/2 = 0.001 × 0.500 = 0.000500
775Cosine decay0.001 × (1+cos(π×675/900))/2 = 0.001 × 0.146 = 0.000146
1000End0.001 × (1+cos(π))/2 = 0

The cosine curve is front-loaded: half the decay happens in the first quarter of the post-warmup phase. This matches the empirical observation that most learning happens early and the last quarter of training is fine-tuning.

Learning Rate Schedule Visualiser

Choose a schedule type, adjust total steps and warmup percentage, and see the learning rate curve. The Y-axis shows η normalised to the peak value.

Total Steps 1000
Warmup % 10%

Schedule From Scratch

python
import math

def cosine_lr(step, total, warmup, lr_max, lr_min=0):
    """Warmup + cosine decay schedule."""
    if step < warmup:
        return lr_max * step / warmup           # linear warmup
    progress = (step - warmup) / (total - warmup)  # 0 to 1
    return lr_min + (lr_max - lr_min) * 0.5 * (
        1 + math.cos(math.pi * progress))      # cosine decay

# PyTorch equivalent
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=1000, eta_min=0)

# Combined with warmup (PyTorch 2.0+)
scheduler = torch.optim.lr_scheduler.SequentialLR(optimizer, [
    torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=1e-5, total_iters=100),
    torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=900)
], milestones=[100])
The most common beginner mistake: forgetting to call scheduler.step() after optimizer.step(). The learning rate stays constant and you wonder why your training curve looks weird. In PyTorch, the optimizer and scheduler are separate objects — you must step both.
Schedule-free optimizers. The latest research direction eliminates schedules entirely. Schedule-Free Adam (Defazio et al., 2024) achieves competitive results with zero LR schedule — the optimizer automatically adjusts its effective step size by maintaining a separate "evaluation" point. This removes the need to know total training steps in advance, enabling truly open-ended training.
Why do most modern training runs start with a warmup phase?

Chapter 8: The Optimizer Arena

Everything we've learned, in one simulation. Drop multiple optimizers onto the same loss landscape and watch them race. Each one runs its actual algorithm — the real update equations, not an approximation.

This is the payoff. You've learned what each optimizer does; now you'll see the differences in real time. Try different landscapes and watch how each optimizer handles ravines, saddle points, and noisy surfaces.

How to use the arena. Select a landscape, check the optimizers you want to race, set the learning rate (shared across all), and hit Play. Click anywhere on the landscape to set the starting point. The loss curves below track each optimizer's progress. Try the Ravine landscape first — it shows the biggest differences between SGD, Momentum, and Adam.
Optimizer Racing Arena

Race optimizers head-to-head on the same landscape. Each runs its real update equations.

Log LR 0.010
Speed 5
Loss Curves

Loss vs step for each active optimizer. Lower = better.

What to Notice

On the Ravine: SGD zig-zags badly. Momentum cuts through the valley much faster. RMSProp compensates for the uneven curvature by scaling per-axis. Adam combines both advantages and reaches the minimum first.

On Rosenbrock: The famous banana-shaped valley. SGD and Momentum get stuck navigating the curve. Adam tracks the curved valley floor because its per-parameter scaling handles the varying curvature.

On the Saddle Point: All optimizers struggle at the exact saddle (zero gradient). In practice, mini-batch noise would kick them off. In our deterministic simulation, only momentum-based methods escape (their accumulated velocity carries them past the zero-gradient point). Try starting slightly off-center to see normal behavior.

Learning rate sensitivity: Drag the LR slider and watch. SGD is the most sensitive — too high and it diverges, too low and it crawls. Adam is the most forgiving — it works across a much wider range of learning rates because the m/√v normalisation compensates. This robustness is why Adam is the default for most practitioners.

Don't over-interpret the arena. This is a 2D simulation with deterministic gradients. Real training happens in millions of dimensions with stochastic mini-batches. The qualitative behaviours translate (ravine oscillation, saddle stalling, adaptive scaling benefits), but the quantitative "which optimizer wins" depends heavily on the specific problem, model size, and hyperparameter tuning budget.

Chapter 9: Cheat Sheet & What to Use When

You now understand the full optimizer toolkit — from raw gradient descent to modern adaptive methods with schedules. This chapter is your practical reference. No new concepts. Just the decision guide you'll actually use.

The Decision Flowchart

Click the nodes below to follow the decision path for your situation.

Optimizer Decision Tree

Click nodes to navigate the tree. The recommended optimizer highlights at the bottom.

Quick Reference Table

SituationOptimizerLRScheduleWeight Decay
LLM pre-training (GPT, LLaMA)AdamW3e-4 to 1e-3Warmup + cosine to 0.1×peak0.01–0.1
Transformer fine-tuning (BERT)AdamW1e-5 to 5e-5Linear warmup + linear decay0.01
Vision (ResNet, ConvNet)SGD+Nesterov0.1Cosine or step decay1e-4
Vision Transformer (ViT)AdamW1e-3Warmup + cosine0.05–0.3
GAN trainingAdam1e-4 to 2e-4None or very slow decay0
Diffusion modelsAdamW1e-4Constant or slow cosine0.01
RL (PPO, SAC)Adam3e-4Linear decay or none0
Memory-constrained LLMLion3e-5 to 3e-4Same as AdamW0.1–1.0 (3-10× higher)
Large-batch distributedLAMBScaled linearly with batchWarmup + polynomial0.01

Hyperparameter Cheat Sheet

HyperparameterDefaultTypical RangeWhat it Controls
η (learning rate)1e-3 (Adam), 0.1 (SGD)1e-5 to 1.0Step size — the #1 most important hyperparameter
β10.90.8–0.95Momentum decay — how many steps of gradient history
β20.9990.99–0.9999Second moment decay — how many steps of variance history
ε1e-81e-8 to 1e-6Numerical stability — rarely needs tuning
weight decay0.010–0.3L2 penalty strength — higher = stronger regularisation
warmup steps1-5% of total100–10,000Steps to ramp LR from 0 to peak
batch size32–25616–65536Gradient noise level — larger = less noise

Tuning Priority

When you have limited time for hyperparameter tuning, this is the order of importance:

  1. Learning rate. Sweep over 5-10 values on a log scale (1e-5 to 1e-1). This alone determines 80% of training success.
  2. Weight decay. Try 0, 0.01, 0.1. More important for transformers than convnets.
  3. Batch size. Limited by GPU memory. Larger is faster per wall-clock but may hurt generalisation. Scale LR linearly if you increase batch size.
  4. Schedule. Cosine decay is a safe default. Try warmup ratios of 1%, 5%, 10%.
  5. β1, β2. Almost never need tuning. Defaults work.
  6. ε. Never needs tuning unless you're debugging NaN losses.
The 80-20 rule of optimizer tuning: Pick AdamW with cosine schedule. Sweep learning rate. You're 80% of the way to optimal. The remaining 20% requires weeks of tuning that's only worthwhile for large-scale production training (and even then, most teams just use the published hyperparameters from similar papers).

Summary of Everything

Ch 0: Loss Landscape
Gradient = local slope. Descent = walk downhill.
Ch 1: SGD
Mini-batch = fast + noisy. Noise is actually good.
Ch 2: Momentum
EMA of gradients = smooth direction + accelerate.
Ch 3: AdaGrad/RMSProp
Per-parameter LR from gradient history.
Ch 4: Adam
Momentum + adaptive LR + bias correction.
Ch 5: AdamW
Decouple weight decay from adaptive scaling.
Ch 6: Advanced
Lion (sign-only), Sophia (Hessian), LAMB (large batch).
Ch 7: Schedules
Warmup + cosine decay = standard recipe.
Ch 8: Arena
See them race — Adam wins on robustness.

Connections

Optimizers don't exist in isolation. Here's where to go next:

"The art of doing mathematics consists in finding that special case which contains all the germs of generality." — David Hilbert. Every optimizer is a special case of the same idea: move parameters in a direction that reduces loss. The rest is engineering.
You're fine-tuning a pre-trained BERT model. Which optimizer and learning rate would you choose?