The algorithms that turn gradients into weight updates — from vanilla SGD to the adaptive methods behind every modern neural network.
You're standing on a foggy mountain. You can't see the valley floor — can't see anything more than a meter ahead. But you can feel the slope under your feet. Steep tilt to the left? Step left. Gentle slope forward? Step forward. That's all the information you have: the local tilt of the ground beneath you.
Training a neural network is exactly this problem. The "mountain" is the loss landscape — a surface where every point represents a particular set of model weights, and the height at that point is the loss (how wrong the model is). The "valley floor" is a minimum where the loss is low. Your job is to walk downhill using only what you can feel locally.
The tool that tells you the local slope is called the gradient.
Take a function with one input — say f(x) = (x − 3)². The graph is a parabola with its minimum at x = 3. At any point x, the derivative f′(x) = 2(x − 3) tells you two things:
The derivative points in the direction of steepest ascent — uphill. We want to go downhill, so we step in the opposite direction. That's it. That's gradient descent.
When the function has millions of inputs (the weights of a neural network), the derivative generalises to the gradient — a vector of partial derivatives, one per weight. Each entry says "if you nudge this particular weight up a tiny bit, here's how much the loss increases." The gradient vector points uphill in weight-space; we step opposite to it.
One step of gradient descent is:
Let's unpack every symbol:
| Symbol | Name | Meaning |
|---|---|---|
| θ | Parameters | All the model's weights, packed into one vector |
| η | Learning rate | Step size — how far we walk per update (a scalar, e.g. 0.01) |
| ∇L(θ) | Gradient | The slope of the loss with respect to every weight, evaluated at the current θ |
| ← | Assignment | Replace the old θ with the new value |
The minus sign is critical: the gradient points uphill, so subtracting it moves us downhill. The learning rate η scales how big each step is. That single scalar is going to cause us enormous trouble.
Let's trace gradient descent on f(x) = (x − 3)², starting at x = 0 with learning rate η = 0.1. The gradient (derivative) is f′(x) = 2(x − 3).
Step 0 — starting point:
The gradient is −6 (negative, pointing left — meaning the function is decreasing as x increases). We subtract it, which pushes us to the right toward the minimum. Makes sense.
Step 1:
Loss dropped from 9 to 5.76. The gradient is smaller now (−4.8 vs −6) because we're closer to the minimum and the slope is gentler. Each step gets smaller — this is a nice property of gradient descent on smooth functions.
Step 2:
Step 3:
Step 4:
After 5 steps: x went from 0 → 0.6 → 1.08 → 1.464 → 1.771 → 2.017. Loss dropped from 9.0 → 0.967. We're converging, but slowly — each step is 80% the size of the previous one (because the gradient shrinks by a factor of 0.8 each time — can you see why?).
What happens if we crank η up? Let's replay the same problem with η = 0.5:
One step! We jumped straight to the minimum. η = 0.5 is perfect for this particular parabola. But what about η = 1.1?
We're oscillating and diverging. Each step overshoots the minimum by more than the last. The loss is 9 → 12.96 → 18.66 → ... going up. The learning rate is too high. For this simple parabola, anything above η = 1.0 diverges.
And if η is too small — say η = 0.001?
After 100 steps we'd be at roughly x ≈ 0.55. After 1000 steps, x ≈ 2.97. It works, but it's agonisingly slow. In a real network with millions of parameters and expensive forward/backward passes, this waste is measured in GPU-hours and dollars.
The widget below lets you drop a ball onto three different loss landscapes and watch gradient descent in action. Drag the learning rate slider and see how the behaviour changes.
Choose a landscape shape, set the learning rate, click "Drop Ball", and watch gradient descent try to reach the minimum. Click anywhere on the landscape to reposition the ball.
Play with the widget above and you'll discover the three failure modes that plague vanilla gradient descent:
1. Divergence (η too high). On the Bowl landscape, crank the learning rate above ~1.0. The ball overshoots the minimum, lands on the opposite slope, overshoots again, and each bounce is larger than the last. The loss explodes to infinity. In real training, you see "loss: NaN" in your terminal and your run is dead.
2. Crawling (η too low). Set η to 0.005. The ball inches forward. On the Bowl this is merely slow, but on the Ravine it's catastrophic: the ball needs thousands of steps to traverse the long flat floor of the valley, and you'll run out of patience (or compute budget) long before it arrives.
3. Saddle points. Switch to the Saddle landscape. A saddle point is a place where the gradient is zero even though it's not a minimum — like the centre of a horse saddle, curving up in one direction and down in another. The gradient is zero at the saddle, so gradient descent stops dead. In high-dimensional spaces (neural networks have millions of dimensions), saddle points vastly outnumber local minima. They are the dominant obstacle.
Here is the uncomfortable truth: with vanilla gradient descent, your entire training outcome depends on picking the right value for a single scalar η. Too high and you diverge. Too low and you waste days of compute. And the "right" value changes during training — early on, when gradients are large and you're far from any minimum, you can afford a big learning rate. Later, when you're near a minimum and need to settle in, you need a small one.
Different parameters may want different learning rates. A parameter in an early layer of a deep network gets tiny gradients (the vanishing gradient problem); it needs a bigger step. A parameter in the final layer gets huge gradients; it needs a smaller step. One global η is a sledgehammer.
In the chapters ahead, we'll build up from the simplest fix (using random
subsets of data) through momentum (accumulating velocity) to fully adaptive
methods (per-parameter learning rates). Each one solves a specific failure
mode. By the end, you'll understand exactly what torch.optim.Adam
is doing under the hood, and when to reach for something else.
Computing the exact gradient over millions of examples takes forever. What if we estimated it from a handful?
In the last chapter, the gradient ∇L(θ) was the true gradient — computed over the entire dataset. For our toy parabola that was a single number, no big deal. But in practice, the loss is an average over every training example:
where ℓ is the per-example loss (e.g., cross-entropy or MSE), f is the model's prediction, and N is the number of training examples.
To compute ∇L(θ) exactly, you need to:
ImageNet has 1.2 million images. A single forward+backward pass for one image on a ResNet-50 takes roughly 10ms on a GPU. One full-batch gradient step = 1,200,000 × 10ms = 12,000 seconds = 3.3 hours per single step. And you might need 100,000 steps to converge. That's 38 years. Obviously nobody does this.
Here's the key observation: the full gradient is an average over N examples. An average can be estimated by a random sample. If you pick B random examples (a mini-batch), the average of their gradients is a noisy but unbiased estimate of the true gradient:
where π is a random permutation and we grab the first B indices. "Unbiased" means the expected value of this estimate equals the true gradient: 𝔼[∇L̃] = ∇L. Any single estimate will be off, but on average (over many random draws), it's correct.
This is Stochastic Gradient Descent — "stochastic" because the gradient is random (depends on which mini-batch we happened to grab). The update rule looks almost identical:
The only difference is that ∇L̃ is computed on a mini-batch of B examples instead of all N. With B = 32, one gradient step takes 32 × 10ms = 0.32 seconds. In the time that full-batch does 1 step, SGD does 37,500 steps. Each step is noisier, but 37,500 noisy steps beat 1 clean step by a landslide.
Let's make the noise concrete with a tiny example. We have 4 data points and a dead-simple model: y = θ · x (a line through the origin). We'll use MSE loss: ℓ = (y − θx)².
| i | xᵢ | yᵢ |
|---|---|---|
| 1 | 1 | 2 |
| 2 | 2 | 4 |
| 3 | 3 | 5 |
| 4 | 4 | 8 |
The true relationship is roughly y ≈ 2x (not exactly — point 3 is a bit low, point 4 is a bit high). Let's start at θ = 1.0.
Full-batch gradient:
The per-example loss is ℓᵢ = (yᵢ − θxᵢ)². Its gradient with respect to θ is:
At θ = 1.0, let's compute each:
Full-batch gradient = average = (−2 + (−8) + (−12) + (−32)) / 4 = −54 / 4 = −13.5
With η = 0.01: θ_new = 1.0 − 0.01 × (−13.5) = 1.0 + 0.135 = 1.135
Mini-batch 1: examples {1, 3} — points (1,2) and (3,5):
Batch gradient = (−2 + (−12)) / 2 = −14 / 2 = −7.0
With η = 0.01: θ_new = 1.0 − 0.01 × (−7.0) = 1.07
Mini-batch 2: examples {2, 4} — points (2,4) and (4,8):
Batch gradient = (−8 + (−32)) / 2 = −40 / 2 = −20.0
With η = 0.01: θ_new = 1.07 − 0.01 × (−20.0) = 1.27
Notice the variance: one batch said "step by 7" and the other said "step by 20." That's because batch 2 happened to contain the two large-x points, where the error is bigger and the gradient is steeper. This variance is the price we pay — and the gift we receive.
At first, gradient noise seems like a pure cost — we're getting an inaccurate gradient, which means we're not walking straight downhill. But three important phenomena make this noise valuable:
1. Escaping sharp minima. A sharp minimum is a narrow pit in the loss landscape — the loss is low, but only in a tiny region. Any small change to the weights makes the loss spike. These minima overfit: they memorize training data but don't generalise. Mini-batch noise effectively adds random jitter to each step. This jitter can bounce you out of a sharp pit, but not out of a wide, flat basin. So SGD naturally avoids sharp minima and settles into flat ones.
2. Flat minima generalise better. Flat minima are regions where the loss stays low even when the weights change slightly. Since test data is slightly different from training data, a flat minimum gives similar loss on both. Sharp minima give low training loss but high test loss. By favouring flat minima, SGD acts as an implicit regulariser — it reduces overfitting without you adding any explicit penalty.
3. Breaking symmetry at saddle points. We saw in Chapter 0 that the gradient is zero at saddle points. With full-batch gradient descent, you'd get stuck. But mini-batch noise means the estimated gradient is almost never exactly zero — there's always some random direction to push you off the saddle. More noise (smaller batches) means faster escape.
The widget below shows gradient descent on a 2D loss landscape. You can toggle between full-batch (one clean arrow per step) and mini-batch (noisy arrows that scatter around the true direction). Change the batch size and watch the noise change.
Each arrow shows the gradient direction for one step. Full-batch always points the same way from a given location. Mini-batch arrows scatter — smaller batch = more scatter. Hit "Resample" to draw new random batches.
Batch size B controls the noise level. Here's the spectrum:
| Batch Size | Name | Noise | Steps/Epoch | GPU Utilisation | Generalisation |
|---|---|---|---|---|---|
| 1 | Pure SGD | Maximum | N | Very low | Good (often too noisy) |
| 16–64 | Small mini-batch | High | N/B | Moderate | Good |
| 128–512 | Standard mini-batch | Moderate | N/B | High | Good |
| 1024–8192 | Large mini-batch | Low | N/B | Very high | Needs tuning |
| N | Full-batch (GD) | None | 1 | Maximum | Can overfit |
In practice, batch size 32–256 is the sweet spot for most tasks. Below 16, the noise is so extreme that convergence is erratic. Above 4096, you lose the implicit regularisation benefit and need to carefully tune the learning rate (typically by scaling it linearly with batch size — the "linear scaling rule").
Let's implement SGD in pure Python, then see the one-liner version. No magic — just the update rule we derived.
python import numpy as np def sgd_step(params, grads, lr): """One step of vanilla SGD.""" for p, g in zip(params, grads): p -= lr * g # θ ← θ − η · ∇L̃ # Example: linear regression y = θ·x theta = np.array([1.0]) # starting weight lr = 0.01 X = np.array([1, 2, 3, 4], dtype=float) y = np.array([2, 4, 5, 8], dtype=float) for epoch in range(50): indices = np.random.permutation(len(X)) # shuffle! for start in range(0, len(X), 2): # batch_size=2 batch = indices[start:start+2] xb, yb = X[batch], y[batch] pred = theta[0] * xb # forward pass error = pred - yb # residuals grad = np.mean(2 * xb * error) # ∂L/∂θ theta[0] -= lr * grad # update print(f"Learned θ = {theta[0]:.4f}") # ≈ 1.90
And the PyTorch equivalent — same algorithm, one line:
python import torch model = torch.nn.Linear(1, 1, bias=False) optimizer = torch.optim.SGD(model.parameters(), lr=0.01) for xb, yb in dataloader: # dataloader handles batching + shuffling loss = (model(xb) - yb).pow(2).mean() loss.backward() # computes ∇L̃ and stores in .grad optimizer.step() # θ ← θ − η · ∇L̃ optimizer.zero_grad() # reset .grad to zero for next batch
The optimizer.step() call is doing exactly what our 1-line
p -= lr * g does, just for every parameter in the model. The
zero_grad() call is essential — PyTorch accumulates
gradients by default (adds new gradients to existing .grad), so you must
zero them before each batch. Forgetting this is one of the most common
PyTorch bugs.
SGD is fast and simple, but it still has one global learning rate η for all parameters. And it has no memory — each step is based entirely on the current mini-batch gradient, with no awareness of previous steps. This leads to two problems:
Problem 1: Oscillation in ravines. When the loss landscape is a narrow valley (steep walls, shallow floor), SGD bounces back and forth across the walls while making slow progress along the floor. Each mini-batch gradient has a large component pointing across the valley (high curvature direction) and a small component pointing along the valley (low curvature direction). The large component causes oscillation; the small component causes crawling.
Problem 2: No adaptation. Some parameters need large steps (small gradients, flat curvature) and others need small steps (large gradients, sharp curvature). A single η can't serve both. This is especially bad in deep networks where gradient magnitudes vary by orders of magnitude across layers.
The next chapter tackles Problem 1 with a beautifully simple idea: give the ball mass.
SGD in a narrow valley is like a pinball — bouncing wall to wall, wasting energy oscillating sideways while barely making progress forward. What if the ball had mass?
Picture rolling a ping-pong ball down a half-pipe. It rattles side to side, never building much speed in any direction. Now picture rolling a bowling ball down the same half-pipe. Its inertia carries it past the small oscillations — the side-to-side bumps barely register because the ball's velocity is dominated by the cumulative downhill direction. That inertia is what we're going to give our optimizer.
Why does SGD oscillate in narrow valleys? Consider a 2D loss surface shaped like a long, thin trough — steep in the x-direction, shallow in the y-direction. The gradient at any point inside the trough has two components:
In vanilla SGD, the x-component dominates every step, causing wild zig-zagging, while the y-component (the direction we actually want to go) contributes only a tiny nudge per step. After 100 steps, the x-oscillations cancel out (equal time going left and right) and the net displacement is almost entirely in the y-direction — but agonisingly slowly.
The fix is elegant: accumulate a running average of past gradients. The x-components flip sign every step, so they cancel in the average. The y-components are consistent, so they add up. The running average amplifies the consistent signal and damps the oscillating noise. This running average is called velocity, and the technique is called momentum.
Momentum adds one new variable: a velocity vector v that has the same shape as θ (one velocity per parameter). At each step:
where g_t = ∇L̃(θ_(t−1)) is the current mini-batch gradient. Let's unpack the new pieces:
| Symbol | Name | Meaning |
|---|---|---|
| v_t | Velocity | Accumulated gradient history — the "inertia" of our ball |
| β | Momentum coefficient | How much of the old velocity we keep. Typically 0.9 |
| g_t | Current gradient | The fresh mini-batch gradient at this step |
| v₀ | Initial velocity | Zero vector — the ball starts at rest |
The first line says: "Take 90% of my previous velocity (momentum) and add the current gradient." The second line says: "Step in the direction of this combined velocity."
When β = 0, the velocity is just the current gradient — momentum disappears and we recover vanilla SGD. When β = 0.9, the velocity is a weighted blend of all past gradients, with recent ones weighted more heavily.
Let's expand the recursion to see what v_t really contains. Writing it out step by step:
This is an exponential moving average (EMA) of gradients. The current gradient g_t has weight 1. The gradient from 1 step ago has weight β. From 2 steps ago: β². From k steps ago: β^k. Since β < 1, older gradients fade exponentially.
How far back does the memory reach? A gradient from k steps ago contributes β^k of its original magnitude. At β = 0.9, a gradient from 10 steps ago has weight 0.9¹⁰ ≈ 0.35 — still significant. From 20 steps ago: 0.9²⁰ ≈ 0.12 — fading. From 50 steps ago: 0.9⁵⁰ ≈ 0.005 — negligible.
The effective window is approximately 1/(1 − β). At β = 0.9, the window is ~10 steps. At β = 0.99, it's ~100 steps. At β = 0.5, it's ~2 steps. This gives you a knob: higher β means longer memory, more smoothing, more inertia.
Let's trace momentum SGD on our familiar f(x) = (x − 3)². We'll start at x = 0, use η = 0.01, β = 0.9, and compare against vanilla SGD at each step. The gradient is f′(x) = 2(x − 3).
Step 1:
Identical so far — with v₀ = 0, the first step is the same.
Step 2:
Now momentum is ahead. The velocity v₂ = −11.28 is almost double the current gradient (−5.88) because it's accumulated the previous gradient too. The momentum step (0.1128) is almost twice the vanilla step (0.0588).
Step 3:
Step 4:
Step 5:
Summary after 5 steps:
| Step | Momentum θ | Vanilla θ | Momentum Velocity |
|---|---|---|---|
| 0 | 0.000 | 0.000 | 0.000 |
| 1 | 0.060 | 0.060 | −6.000 |
| 2 | 0.173 | 0.119 | −11.280 |
| 3 | 0.331 | 0.176 | −15.806 |
| 4 | 0.527 | 0.233 | −19.564 |
| 5 | 0.752 | 0.288 | −22.555 |
After 5 steps, momentum has reached θ = 0.752 while vanilla SGD is only at θ = 0.288. Momentum is 2.6× further along. And the velocity is still building — each step is larger than the last because the gradient has been consistently negative (pointing toward the minimum at x = 3). In a consistent direction, momentum accelerates.
The widget below shows side-by-side trajectories on an elongated ravine. Vanilla SGD (left) zig-zags across the narrow dimension. Momentum (right) smooths out the oscillations and accelerates along the floor. Adjust β and watch the behaviour change.
Both optimizers start at the same point on a ravine-shaped loss surface. Watch how momentum damps the zig-zag and accelerates toward the minimum.
The momentum coefficient β is the second critical hyperparameter (after the learning rate). Here's what happens at different values:
β = 0: No momentum. v_t = g_t, and we recover vanilla SGD. The ball has no mass — each step depends only on the current gradient.
β = 0.5: Light momentum. Effective window ≈ 2 steps. Some smoothing, but the optimizer is still quite reactive to individual gradients. Rarely used in practice.
β = 0.9: Standard momentum. Effective window ≈ 10 steps. This is the default in almost every deep learning setup. It provides significant smoothing while still responding reasonably quickly to changes in the gradient direction. When someone says "SGD with momentum," they almost always mean β = 0.9.
β = 0.99: Heavy momentum. Effective window ≈ 100 steps. The optimizer is very smooth but very slow to change direction. If the gradient reverses (you've passed the minimum), it takes ~100 steps for the velocity to reverse. This can cause severe overshoot. Used occasionally with very noisy gradients where extreme smoothing is needed.
β = 0.999: Almost never used for vanilla momentum (but this value appears in Adam's second moment — we'll see why later). The velocity barely responds to new information. 1000-step memory.
Momentum has a subtle flaw: it evaluates the gradient at the current position θ, then adds it to the velocity. But if the velocity is large, the next position will be far from θ — so the gradient at θ might be a poor estimate of the gradient at the place we're actually going to land.
Nesterov momentum (NAG) fixes this with a clever trick: first take a step in the direction of the current velocity (the "lookahead"), then compute the gradient at that lookahead point. The analogy: instead of standing still and looking downhill to decide your next step, take a running start in the direction you've been going, and then look downhill from where you've arrived.
The equations:
The only difference from classical momentum is in where we evaluate the gradient. Classical evaluates at θ_(t−1) (where we are). Nesterov evaluates at θ_(t−1) − η·β·v_(t−1) (where we're about to go). This "lookahead" makes Nesterov more responsive to curvature changes.
Why does this help? Consider what happens near a minimum. Classical momentum has built up a large velocity pointing toward (and past) the minimum. It evaluates the gradient at the current position, which might still say "keep going." So it overshoots. Nesterov first jumps ahead to where it would land, computes the gradient there (which says "you've gone too far, come back"), and uses that to correct the velocity before committing to the step. It's a form of error correction built into the update.
The widget below shows how the lookahead step changes the trajectory near a minimum. Classical momentum overshoots and oscillates. Nesterov anticipates the overshoot and corrects earlier.
Watch the ghost position (lookahead point) peek ahead of the ball. Near the minimum, the lookahead's gradient says "you've gone too far" and corrects the velocity before the actual step.
Momentum is not free. The same inertia that helps you accelerate through consistent slopes also makes it hard to stop.
When the gradient direction reverses — you've passed the minimum and the slope now points the other way — the velocity still carries you in the old direction. How long until the velocity reverses? The velocity decays by a factor of β each step (in the absence of reinforcing gradients), so it takes roughly 1/(1 − β) steps to die off. At β = 0.9, that's ~10 steps of overshoot. At β = 0.99, it's ~100 steps. That's a lot of wasted motion.
In practice, momentum overshoot manifests as the loss increasing temporarily after a decrease. You'll see the training loss curve dip, then bump up, then dip lower — a characteristic sawtooth pattern. This is normal with momentum and doesn't mean anything is wrong. If the bumps grow rather than shrink, your learning rate is too high.
This is why momentum works so well in practice: it doesn't need to know which directions oscillate and which are consistent. The exponential average figures it out automatically from the gradient history. Consistent directions accumulate velocity; oscillating directions cancel. It's a beautifully simple algorithm that solves the ravine problem with just one extra hyperparameter and one extra vector of memory.
python import numpy as np def sgd_momentum(params, grads, velocities, lr, beta=0.9): """One step of SGD with momentum.""" for i in range(len(params)): velocities[i] = beta * velocities[i] + grads[i] # accumulate params[i] -= lr * velocities[i] # step def sgd_nesterov(params, grads_fn, velocities, lr, beta=0.9): """One step of Nesterov momentum. grads_fn(params) returns gradients evaluated at given params.""" # Step 1: lookahead lookahead = [p - lr * beta * v for p, v in zip(params, velocities)] # Step 2: gradient at lookahead grads = grads_fn(lookahead) # Step 3: update velocity and params for i in range(len(params)): velocities[i] = beta * velocities[i] + grads[i] params[i] -= lr * velocities[i]
And the PyTorch one-liner:
python # Classical momentum optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9) # Nesterov momentum optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)
Notice that PyTorch uses the exact same SGD class for all three
variants: vanilla (momentum=0), classical momentum, and Nesterov. The only
difference is the flags you pass. Under the hood, the implementation matches
the equations we derived above — one velocity buffer per parameter, one
multiply-accumulate per step.
| Variant | When to Use | Typical Settings |
|---|---|---|
| Vanilla SGD | Simple problems, debugging, when you want the simplest baseline | η = 0.1–0.001 |
| SGD + Momentum | Most vision tasks (ResNet, ConvNets), when you want strong generalisation | η = 0.1, β = 0.9, with LR schedule |
| SGD + Nesterov | Same as momentum but slightly better convergence; default recommendation | η = 0.1, β = 0.9, with LR schedule |
SGD with Nesterov momentum remains the optimizer of choice for supervised image classification (ImageNet). It generalises better than Adam in many settings — we'll explore why in a later chapter. The catch: it requires careful learning rate scheduling (warmup + cosine decay is standard). Adam is more forgiving of the learning rate choice, which is why it dominates in NLP and generative models where hyperparameter tuning budgets are limited.
Momentum fixes the oscillation problem: it damps zig-zagging in ravines and accelerates along consistent slopes. But it still uses a single global learning rate for all parameters. A parameter that consistently receives tiny gradients (deep layer in a network) and a parameter that receives huge gradients (final layer) both get the same η.
What if we could give each parameter its own learning rate, automatically scaled by how large its gradients typically are? That's exactly what adaptive learning rate methods do — and we'll build the first one, AdaGrad, in the next chapter.
Some weights update frequently. Others barely update at all. A single learning rate can't serve both.
Picture a word embedding matrix with 50,000 rows — one row per word. During training, the word "the" appears in almost every batch. Its embedding row gets a gradient update on nearly every step. But the word "serendipitous" appears once every thousand batches. Its row sits frozen for hundreds of steps, then gets a single gradient.
If you tune the learning rate for "the" — small enough that it doesn't oscillate — it's far too small for "serendipitous." That rare word needs a big step to learn anything at all in its few chances. But if you crank up the learning rate for the rare words, the frequent words overshoot wildly.
This is the parameter heterogeneity problem. It's not just an NLP thing. In any neural network, some parameters sit in busy pathways (getting updated constantly) while others sit in quiet corners (updated rarely). A single global learning rate is a compromise that serves nobody well.
AdaGrad (Adaptive Gradient, Duchi et al., 2011) implements this idea with a beautifully simple mechanism. For each parameter, it keeps a running sum of all squared gradients it has ever received.
Here's the rule. At each step t, for each parameter:
Where gt is the gradient at step t. G is a running sum of squared gradients — it only ever grows. Then update the parameter:
Where η is the base learning rate and ε (typically 1e-8) prevents division by zero. The key quantity is the effective learning rate:
A parameter that has received many large gradients has a large G, so its effective LR is small. A parameter that has received few or small gradients has a small G, so its effective LR stays large. The learning rate adapts automatically, per parameter, based on history.
Let's see this in action. Two parameters, η=0.1, ε=1e-8 (we'll ignore ε in the display since it's negligible).
Parameter A — frequent, consistent gradients: [3.0, 2.5, 2.8, 3.1]
Parameter B — rare, then a spike: [0.0, 0.0, 0.0, 5.0]
Step 1:
A: G = 0 + 3.0² = 9.0 → effective LR = 0.1/√9.0 = 0.1/3.0 = 0.0333 → update = 0.0333 × 3.0 = 0.100
B: G = 0 + 0.0² = 0.0 → update = 0 (gradient is zero)
Step 2:
A: G = 9.0 + 2.5² = 9.0 + 6.25 = 15.25 → effective LR = 0.1/√15.25 = 0.1/3.906 = 0.0256
B: G = 0 → update = 0
Step 3:
A: G = 15.25 + 2.8² = 15.25 + 7.84 = 23.09 → effective LR = 0.1/√23.09 = 0.1/4.805 = 0.0208
B: G = 0 → update = 0
Step 4:
A: G = 23.09 + 3.1² = 23.09 + 9.61 = 32.70 → effective LR = 0.1/√32.70 = 0.1/5.718 = 0.0175
B: G = 0 + 5.0² = 25.0 → effective LR = 0.1/√25.0 = 0.1/5.0 = 0.0200
| Step | A: G | A: eff. LR | B: G | B: eff. LR |
|---|---|---|---|---|
| 1 | 9.00 | 0.0333 | 0.00 | — |
| 2 | 15.25 | 0.0256 | 0.00 | — |
| 3 | 23.09 | 0.0208 | 0.00 | — |
| 4 | 32.70 | 0.0175 | 25.00 | 0.0200 |
Look at step 4. After four rounds of updates, parameter A's effective LR has decayed from 0.0333 to 0.0175 — nearly halved. But parameter B, receiving its very first real gradient, gets an effective LR of 0.0200 — larger than A's despite appearing later. When B finally sees a gradient, AdaGrad gives it a big step. That's the feature.
Stare at G in that table. It goes 9 → 15 → 23 → 33. It only grows. It can never shrink. After 100 steps, G might be 900. After 10,000 steps, G might be 90,000. The effective LR becomes:
After a million steps? The effective LR is essentially zero. Training stops. The optimizer has committed suicide — it accumulated so much history that it can no longer take meaningful steps in any direction.
This is fine for convex problems (where you're approaching a fixed minimum and want to slow down). But deep learning is non-convex, runs for millions of steps, and the loss landscape changes as other parameters update. A learning rate that decays to zero is catastrophic.
RMSProp (Root Mean Square Propagation) was proposed by Geoffrey Hinton in a Coursera lecture — not even a published paper! It fixes AdaGrad with one simple change: instead of summing all squared gradients forever, use an exponential moving average.
Where ρ (rho, typically 0.9 or 0.99) controls how fast old gradients fade. With ρ=0.9, the contribution of a gradient from 10 steps ago is 0.910 = 0.35 of its original weight. From 50 steps ago: 0.950 = 0.005. Practically gone.
The update rule is identical to AdaGrad:
But now G is a running average instead of a running sum. It stays bounded. Old history fades. The effective learning rate remains healthy indefinitely.
Same parameter A, same gradients [3.0, 2.5, 2.8, 3.1], η=0.1, ρ=0.9:
Step 1: G = 0.9 × 0 + 0.1 × 9.0 = 0.9 → eff. LR = 0.1/√0.9 = 0.1/0.949 = 0.1054
Step 2: G = 0.9 × 0.9 + 0.1 × 6.25 = 0.81 + 0.625 = 1.435 → eff. LR = 0.1/1.198 = 0.0835
Step 3: G = 0.9 × 1.435 + 0.1 × 7.84 = 1.292 + 0.784 = 2.076 → eff. LR = 0.1/1.441 = 0.0694
Step 4: G = 0.9 × 2.076 + 0.1 × 9.61 = 1.868 + 0.961 = 2.829 → eff. LR = 0.1/1.682 = 0.0595
| Step | G (AdaGrad) | G (RMSProp) | eff. LR (AdaGrad) | eff. LR (RMSProp) |
|---|---|---|---|---|
| 1 | 9.00 | 0.90 | 0.0333 | 0.1054 |
| 2 | 15.25 | 1.44 | 0.0256 | 0.0835 |
| 3 | 23.09 | 2.08 | 0.0208 | 0.0694 |
| 4 | 32.70 | 2.83 | 0.0175 | 0.0595 |
AdaGrad's G: 9 → 15 → 23 → 33. Monotonically increasing, forever. RMSProp's G: 0.9 → 1.4 → 2.1 → 2.8. Bounded. After 10,000 steps, RMSProp's G would be roughly 8.5 (the EMA of g²), not 85,000 like AdaGrad's.
A 2D loss surface where one axis has steep gradients and the other has gentle ones. The bars on the right show the effective learning rate for each axis.
python # AdaGrad — from scratch def adagrad_step(params, grads, cache, lr=0.01, eps=1e-8): for i in range(len(params)): cache[i] += grads[i] ** 2 # accumulate squared grads params[i] -= lr * grads[i] / (cache[i] ** 0.5 + eps) # RMSProp — from scratch def rmsprop_step(params, grads, cache, lr=0.01, rho=0.9, eps=1e-8): for i in range(len(params)): cache[i] = rho * cache[i] + (1 - rho) * grads[i] ** 2 # EMA params[i] -= lr * grads[i] / (cache[i] ** 0.5 + eps)
The only difference: += (sum) vs = rho * old + (1-rho) * new (EMA). One character-level change fixes AdaGrad's fatal flaw.
python # PyTorch equivalents optimizer_ag = torch.optim.Adagrad(model.parameters(), lr=0.01) optimizer_rms = torch.optim.RMSprop(model.parameters(), lr=0.01, alpha=0.9)
Both start with η=0.1. AdaGrad's effective LR decays toward zero. RMSProp's stabilizes.
Momentum smooths the gradient direction. RMSProp scales the learning rate per parameter. What if we did both at once?
That's exactly what Adam (Adaptive Moment Estimation, Kingma & Ba, 2015) does. It maintains two running averages per parameter:
At each step t:
That's five lines. Let's decode every piece:
| Symbol | Name | Default | Role |
|---|---|---|---|
| mt | First moment estimate | — | EMA of gradients (momentum direction) |
| vt | Second moment estimate | — | EMA of squared gradients (adaptive scaling) |
| β1 | First moment decay | 0.9 | How much gradient history to keep |
| β2 | Second moment decay | 0.999 | How much squared-gradient history to keep |
| η | Learning rate | 0.001 | Base step size |
| ε | Epsilon | 1e-8 | Prevents division by zero |
| m̂, v̂ | Bias-corrected estimates | — | Fix the initialization bias (explained below) |
Both m and v are initialised to zero. At step 1:
The true mean of the gradient is roughly g1, but m1 = 0.1 · g1. It's biased toward zero because we initialised m0 = 0 and the EMA hasn't had enough terms to wash out the zero start.
How bad is the bias? At step t, the expected value of mt is:
At t=1 with β1=0.9: factor = 1 - 0.91 = 0.1. The estimate is 10% of the true value! At t=10: factor = 1 - 0.910 = 1 - 0.349 = 0.651. Still 35% too low. The fix is to divide by the factor:
At t=1: m̂1 = 0.1g1 / 0.1 = g1. The bias is gone. At t=10: m̂10 = m10 / 0.651. Still a meaningful correction. At t=1000: 1 - 0.91000 ≈ 1.0. The correction disappears.
The same logic applies to vt, but β2 = 0.999 so the bias lasts much longer — about 1000 steps before (1 - 0.999t) ≈ 0.63. Without bias correction, the first ~1000 steps of Adam would have wildly inflated step sizes (because v is too small → η/√v is too large).
One parameter, gradients g = [4.0, -1.0, 3.0], η=0.001, β1=0.9, β2=0.999, ε=1e-8.
Step 1 (g1 = 4.0):
Step 2 (g2 = -1.0):
The gradient flipped sign (from +4 to -1). The momentum m is still positive (0.26) — it remembers the +4 from step 1. But it's moving toward zero because the new gradient is negative. The step size dropped from 0.001 to 0.000469 because the momentum is conflicted (positive from history, but latest gradient says negative).
Step 3 (g3 = 3.0):
Watch how mt, vt, their bias-corrected versions, and the final update evolve over steps. Drag the step slider or change β values.
python import numpy as np def adam_step(params, grads, m, v, t, lr=1e-3, b1=0.9, b2=0.999, eps=1e-8): """One step of Adam. t is the 1-based step counter.""" for i in range(len(params)): m[i] = b1 * m[i] + (1 - b1) * grads[i] # momentum v[i] = b2 * v[i] + (1 - b2) * grads[i] ** 2 # RMSProp m_hat = m[i] / (1 - b1 ** t) # bias correction v_hat = v[i] / (1 - b2 ** t) params[i] -= lr * m_hat / (v_hat ** 0.5 + eps) # update
PyTorch:
python optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999))
The widget below shows two trajectories: one with bias correction (the real Adam) and one without (what happens if you skip the m̂/v̂ correction). Without correction, the first ~50 steps take wildly wrong step sizes.
Teal = with correction (Adam). Red = without correction (naive). Watch how the naive version takes oversized early steps.
Adam became the default optimizer for most of deep learning because of three properties:
1. Robust to learning rate choice. The m/√v normalisation means the effective step size is approximately η regardless of gradient scale. You can use η=0.001 on problems with tiny gradients or huge gradients and it works reasonably well. SGD with momentum requires tuning η over orders of magnitude.
2. Fast initial progress. The momentum term m smooths noisy gradients, while the adaptive scaling v ensures no parameter gets stuck with a too-large or too-small learning rate. The combination converges quickly in the early phase of training.
3. Handles sparse gradients. Like RMSProp, Adam gives larger effective learning rates to parameters that receive rare or small gradients. This is critical in NLP (rare tokens), recommendation systems (rare items), and any domain with sparse features.
Adam has a subtle bug when combined with weight decay. To understand it, we first need to understand what weight decay is and why it matters.
Large weights are suspicious. A model with weights of magnitude 1000 is fitting the training data with extreme precision — memorising noise, not learning patterns. Weight decay is a gentle force that pulls all weights toward zero, penalising complexity.
The simplest form is L2 regularisation: add a penalty proportional to the sum of squared weights:
where λ is the regularisation strength. The gradient of the penalty term is:
So the gradient becomes gt + λθ, and the SGD update becomes:
That factor (1 - ηλ) shrinks the weights slightly each step. With η=0.01 and λ=0.01, each weight is multiplied by 0.9999 per step — a gentle 0.01% decay. Over thousands of steps, this keeps weights from exploding.
For SGD, L2 regularisation and weight decay are mathematically identical: adding λθ to the gradient has the same effect as shrinking θ by (1 - ηλ). But for Adam, they are NOT the same.
When you add the L2 term λθ to the gradient in Adam, it gets fed into both the first moment m and the second moment v. The adaptive scaling then modifies the effective weight decay differently for each parameter. Parameters with large accumulated gradients (large v) get less weight decay. Parameters with small accumulated gradients get more weight decay. The regularisation becomes non-uniform and unpredictable.
Let's see this concretely. With L2 in Adam, the effective weight decay for parameter i is:
A busy parameter (large vi) gets its decay divided by a large number. A quiet parameter (small vi) gets its decay amplified. This is backwards! Busy parameters — often in the final layers — tend to have larger weights and need more regularisation, not less.
Loshchilov & Hutter (2019) proposed a simple fix: decouple the weight decay from the adaptive gradient scaling. Instead of adding λθ to the gradient (which then goes through Adam's machinery), apply weight decay directly to the parameter update:
The key change: gt is the pure data gradient (no L2 term added). The weight decay λθ is applied directly by multiplying θ by (1 - ηλ) outside of Adam's m/v machinery. Now every parameter gets exactly the same proportional decay, regardless of its gradient history.
One parameter: θ=5.0, g=2.0, λ=0.1, η=0.001. Assume we're at a step where bias correction is negligible (late training). Assume v stabilised at v = 4.0.
Adam + L2 regularisation:
AdamW (decoupled):
The effective weight decay in AdamW (0.01%) is 2× larger than in Adam+L2 (0.005%). The L2 version had its decay reduced by the adaptive scaling (dividing by √v = 2). AdamW applies the decay at its intended strength.
Two weights start at the same value. Watch how L2 regularisation interacts differently with Adam's adaptive scaling vs AdamW's decoupled decay.
python def adamw_step(params, grads, m, v, t, lr=1e-3, b1=0.9, b2=0.999, eps=1e-8, wd=0.01): for i in range(len(params)): # 1. Weight decay (DECOUPLED — applied to params, not grad) params[i] *= (1 - lr * wd) # 2. Standard Adam on pure data gradient m[i] = b1 * m[i] + (1 - b1) * grads[i] v[i] = b2 * v[i] + (1 - b2) * grads[i] ** 2 m_hat = m[i] / (1 - b1 ** t) v_hat = v[i] / (1 - b2 ** t) params[i] -= lr * m_hat / (v_hat ** 0.5 + eps)
PyTorch:
python # AdamW (decoupled weight decay) — now the default optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01) # WARNING: this is L2, NOT decoupled weight decay optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=0.01)
torch.optim.Adam(weight_decay=0.01) applies
L2 regularisation (adds λθ to gradient). torch.optim.AdamW(weight_decay=0.01)
applies decoupled weight decay. They are NOT the same. If you're using Adam with
weight decay, you almost certainly want AdamW.
Adam is excellent but not the end of the story. Researchers keep pushing for faster convergence, better generalisation, or lower memory. Here we survey the most important post-Adam optimizers — each one a variation on the themes we've already built.
Training with very large batch sizes (8192+) breaks Adam. The gradient noise drops so low that the optimizer converges to sharp minima. LARS (Layer-wise Adaptive Rate Scaling) and LAMB (Layer-wise Adaptive Moments for Batch) fix this by scaling the learning rate per layer, not just per parameter.
The idea: compute a "trust ratio" for each layer — the ratio of the weight norm to the gradient norm. If the gradient is huge relative to the weights, the trust ratio is small (take a cautious step). If the gradient is tiny, the trust ratio is large (take a bold step).
LARS wraps SGD+momentum. LAMB wraps Adam. Google used LAMB to train BERT in 76 minutes on 1024 TPUs — a 76× speedup from the original 4 days. The key: LAMB made batch size 64K stable, so massive data parallelism worked.
Adam's bias correction helps, but the variance of the adaptive learning rate is still high in early training. RAdam (Liu et al., 2020) tracks the variance of the second moment estimate and automatically switches from SGD (high variance regime) to Adam (low variance regime).
In practice: the first ~5 steps use SGD with momentum (no adaptive scaling), then smoothly transition to Adam as vt becomes reliable. This eliminates the need for learning rate warmup — RAdam warms itself up.
Adam stores two buffers (m and v) per parameter. For a 175B-parameter model like GPT-3, that's 350B extra floats = 1.4 TB of optimizer state in fp32. AdaFactor (Shazeer & Stern, 2018) reduces this dramatically by factorizing the second moment matrix.
For a weight matrix W of shape (d1 × d2), instead of storing a full d1×d2 second moment matrix, AdaFactor stores a row factor (d1) and a column factor (d2). Memory drops from O(d1·d2) to O(d1 + d2). For a 4096×4096 matrix, that's 8192 instead of 16.7M — a 2000× reduction.
Lion (EvoLved Sign Momentum, Chen et al., 2023) was discovered by having an AI search over the space of optimizer algorithms. The result is surprisingly simple — it uses only the sign of the momentum update, not its magnitude:
Every parameter moves by exactly ±η per step. No second moment, no adaptive scaling, no square roots. Memory: only one buffer (m) vs Adam's two (m + v). 50% less optimizer memory.
Lion's defaults: β1=0.9, β2=0.99, η=3×10-4 (3-10× smaller than Adam's because every update has the same magnitude). Weight decay is critical with Lion — typically 3-10× larger than with Adam.
All methods so far use only first-order information (the gradient). The Hessian — the matrix of second derivatives — tells you about curvature. In a flat region, take large steps. In a sharply curved region, take small steps. Adam approximates this crudely (v captures gradient magnitude, which correlates with curvature). Sophia (Liu et al., 2023) uses a lightweight Hessian diagonal estimate for more precise scaling.
Sophia pre-conditions the gradient by dividing by a Hessian diagonal estimate ht (computed cheaply via a Hutchinson estimator every ~10 steps):
The clip operation caps the update magnitude, preventing the optimizer from taking catastrophically large steps where the Hessian estimate is inaccurate. Sophia achieves 2× the throughput of Adam on LLM training (same final loss in half the steps) while using similar memory.
Every optimizer we've discussed is a variation on a few core ideas. The tree below shows how they connect. Click any node for details.
Click a node to see its key innovation. The tree shows evolutionary relationships.
| Optimizer | Year | Key Innovation | Memory | Best For |
|---|---|---|---|---|
| SGD | 1951 | Stochastic gradients | 0 extra | Baselines, convex |
| SGD+Momentum | 1964 | Velocity accumulation | 1 buffer | Vision (ResNet, ConvNets) |
| AdaGrad | 2011 | Per-param LR (cumulative) | 1 buffer | Sparse data, online |
| RMSProp | 2012 | Per-param LR (EMA) | 1 buffer | RNNs |
| Adam | 2015 | Momentum + adaptive LR | 2 buffers | General default |
| AdamW | 2019 | Decoupled weight decay | 2 buffers | Transformers, LLMs |
| LAMB | 2020 | Layer-wise trust ratio | 2 buffers | Large-batch training |
| Lion | 2023 | Sign-based, AI-discovered | 1 buffer | Memory-constrained LLMs |
| Sophia | 2023 | Hessian pre-conditioning | 2 buffers | LLM pre-training |
Even with Adam, a fixed learning rate is suboptimal. Early in training, you want large steps to make rapid progress. Late in training, you want small steps to settle precisely into a minimum. A learning rate schedule adjusts η over the course of training.
This is separate from adaptive per-parameter scaling (Adam's v). The schedule changes the base learning rate that multiplies everything. Think of it as the global volume knob, while Adam's per-parameter scaling is the equaliser.
Modern training almost always starts with a warmup phase: the learning rate starts near zero and linearly increases to its target value over the first few hundred or thousand steps.
Why? At initialisation, the model's weights are random. The gradients in the first few steps are large, noisy, and unrepresentative. Taking full-sized steps based on these wild early gradients can push the model into a bad region of the loss landscape from which it never recovers. Warmup says: "Take tiny steps while the gradients are chaotic. Ramp up to full speed once they stabilise."
For BERT: Twarmup = 10,000 steps out of 1,000,000 total (1%). For GPT-3: Twarmup = 375 steps. For ViT: 10,000 steps. Typical range: 1-5% of total training.
After warmup, several decay strategies are in common use:
1. Step Decay. Drop the LR by a fixed factor (e.g., ÷10) at predetermined epochs. Simple and effective for convnets: "Train at 0.1 for 30 epochs, 0.01 for 30 epochs, 0.001 for 30 epochs."
Where γ is the decay factor (typically 0.1) and s is the step interval.
2. Cosine Decay. The LR follows a half-cosine curve from ηmax down to ηmin (often 0 or ηmax/100). Smooth and widely used — the default for transformer training.
Cosine decay is popular because it's smooth (no abrupt drops that cause loss spikes), has no hyperparameters beyond the min/max LR and total steps, and empirically works as well as or better than hand-tuned step schedules.
3. Linear Decay. The simplest: a straight line from ηmax to 0 (or some ηmin). Used by GPT-2 and many BERT fine-tuning recipes.
4. Inverse Square Root. Popular for training transformers (the original "Attention Is All You Need" paper):
This gives warmup followed by a gradual 1/√t decay. Slower than cosine decay — the LR stays higher for longer, which can help on very long training runs.
5. Warmup-Stable-Decay (WSD). A three-phase schedule used by recent large models (e.g., MiniCPM, some LLaMA variants): warmup to peak LR, hold constant for most of training, then rapidly decay at the end.
Total steps T = 1000, ηmax = 0.001, ηmin = 0, warmup for first 100 steps.
| Step | Phase | η |
|---|---|---|
| 0 | Warmup | 0 |
| 50 | Warmup | 0.001 × 50/100 = 0.0005 |
| 100 | Warmup complete | 0.001 |
| 325 | Cosine decay | 0.001 × (1+cos(π×225/900))/2 = 0.001 × 0.854 = 0.000854 |
| 550 | Cosine decay | 0.001 × (1+cos(π×450/900))/2 = 0.001 × 0.500 = 0.000500 |
| 775 | Cosine decay | 0.001 × (1+cos(π×675/900))/2 = 0.001 × 0.146 = 0.000146 |
| 1000 | End | 0.001 × (1+cos(π))/2 = 0 |
The cosine curve is front-loaded: half the decay happens in the first quarter of the post-warmup phase. This matches the empirical observation that most learning happens early and the last quarter of training is fine-tuning.
Choose a schedule type, adjust total steps and warmup percentage, and see the learning rate curve. The Y-axis shows η normalised to the peak value.
python import math def cosine_lr(step, total, warmup, lr_max, lr_min=0): """Warmup + cosine decay schedule.""" if step < warmup: return lr_max * step / warmup # linear warmup progress = (step - warmup) / (total - warmup) # 0 to 1 return lr_min + (lr_max - lr_min) * 0.5 * ( 1 + math.cos(math.pi * progress)) # cosine decay # PyTorch equivalent scheduler = torch.optim.lr_scheduler.CosineAnnealingLR( optimizer, T_max=1000, eta_min=0) # Combined with warmup (PyTorch 2.0+) scheduler = torch.optim.lr_scheduler.SequentialLR(optimizer, [ torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=1e-5, total_iters=100), torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=900) ], milestones=[100])
scheduler.step()
after optimizer.step(). The learning rate stays constant and you wonder why
your training curve looks weird. In PyTorch, the optimizer and scheduler are separate
objects — you must step both.
Everything we've learned, in one simulation. Drop multiple optimizers onto the same loss landscape and watch them race. Each one runs its actual algorithm — the real update equations, not an approximation.
This is the payoff. You've learned what each optimizer does; now you'll see the differences in real time. Try different landscapes and watch how each optimizer handles ravines, saddle points, and noisy surfaces.
Race optimizers head-to-head on the same landscape. Each runs its real update equations.
Loss vs step for each active optimizer. Lower = better.
On the Ravine: SGD zig-zags badly. Momentum cuts through the valley much faster. RMSProp compensates for the uneven curvature by scaling per-axis. Adam combines both advantages and reaches the minimum first.
On Rosenbrock: The famous banana-shaped valley. SGD and Momentum get stuck navigating the curve. Adam tracks the curved valley floor because its per-parameter scaling handles the varying curvature.
On the Saddle Point: All optimizers struggle at the exact saddle (zero gradient). In practice, mini-batch noise would kick them off. In our deterministic simulation, only momentum-based methods escape (their accumulated velocity carries them past the zero-gradient point). Try starting slightly off-center to see normal behavior.
Learning rate sensitivity: Drag the LR slider and watch. SGD is the most sensitive — too high and it diverges, too low and it crawls. Adam is the most forgiving — it works across a much wider range of learning rates because the m/√v normalisation compensates. This robustness is why Adam is the default for most practitioners.
You now understand the full optimizer toolkit — from raw gradient descent to modern adaptive methods with schedules. This chapter is your practical reference. No new concepts. Just the decision guide you'll actually use.
Click the nodes below to follow the decision path for your situation.
Click nodes to navigate the tree. The recommended optimizer highlights at the bottom.
| Situation | Optimizer | LR | Schedule | Weight Decay |
|---|---|---|---|---|
| LLM pre-training (GPT, LLaMA) | AdamW | 3e-4 to 1e-3 | Warmup + cosine to 0.1×peak | 0.01–0.1 |
| Transformer fine-tuning (BERT) | AdamW | 1e-5 to 5e-5 | Linear warmup + linear decay | 0.01 |
| Vision (ResNet, ConvNet) | SGD+Nesterov | 0.1 | Cosine or step decay | 1e-4 |
| Vision Transformer (ViT) | AdamW | 1e-3 | Warmup + cosine | 0.05–0.3 |
| GAN training | Adam | 1e-4 to 2e-4 | None or very slow decay | 0 |
| Diffusion models | AdamW | 1e-4 | Constant or slow cosine | 0.01 |
| RL (PPO, SAC) | Adam | 3e-4 | Linear decay or none | 0 |
| Memory-constrained LLM | Lion | 3e-5 to 3e-4 | Same as AdamW | 0.1–1.0 (3-10× higher) |
| Large-batch distributed | LAMB | Scaled linearly with batch | Warmup + polynomial | 0.01 |
| Hyperparameter | Default | Typical Range | What it Controls |
|---|---|---|---|
| η (learning rate) | 1e-3 (Adam), 0.1 (SGD) | 1e-5 to 1.0 | Step size — the #1 most important hyperparameter |
| β1 | 0.9 | 0.8–0.95 | Momentum decay — how many steps of gradient history |
| β2 | 0.999 | 0.99–0.9999 | Second moment decay — how many steps of variance history |
| ε | 1e-8 | 1e-8 to 1e-6 | Numerical stability — rarely needs tuning |
| weight decay | 0.01 | 0–0.3 | L2 penalty strength — higher = stronger regularisation |
| warmup steps | 1-5% of total | 100–10,000 | Steps to ramp LR from 0 to peak |
| batch size | 32–256 | 16–65536 | Gradient noise level — larger = less noise |
When you have limited time for hyperparameter tuning, this is the order of importance:
Optimizers don't exist in isolation. Here's where to go next: