Convexity, gradient methods, constraints, duality, proximal operators, convergence rates, and second-order methods — all derived by hand with instant feedback.
You're training a neural network and the loss keeps jumping around. Sometimes gradient descent converges beautifully; other times it spirals into a terrible local minimum. The difference often comes down to one property of the loss surface: convexity.
A function f is convex if the line segment between any two points on its graph lies above or on the graph. Formally:
f(x) = x², so f'(x) = 2x and f''(x) = 2. Since f''(x) = 2 > 0 for all x, the function is strictly convex. The unique global minimum is at x = 0. This is the simplest possible convex function — the "bowl" that gradient descent loves.
The triangle inequality gives us |λx + (1−λ)y| ≤ |λx| + |(1−λ)y| = λ|x| + (1−λ)|y|. This is exactly the convexity definition. So yes, |x| is convex despite not being differentiable at zero. Convexity does NOT require smoothness — it's a geometric property about the shape of the graph. The L1 norm penalty in LASSO is convex, which is what makes LASSO tractable.
For convexity we need f''(x) ≥ 0 everywhere. Since f''(x) = 6x is negative for x < 0 and positive for x > 0, the function is neither convex nor concave over all of R. It's convex on [0, ∞) and concave on (−∞, 0]. Neural network loss surfaces behave like this — convex in some directions, non-convex in others.
Compute the Hessian matrix H for f(x,y) = x² + xy + y². Then find the smallest eigenvalue of H to determine if the function is convex.
H = [[∂²f/∂x², ∂²f/∂x∂y], [∂²f/∂y∂x, ∂²f/∂y²]]. Eigenvalues of [[a,b],[b,c]]: λ = ((a+c) ± √((a−c)² + 4b²)) / 2.
Both eigenvalues are positive, so H is positive definite everywhere (it's constant). Therefore f is strictly convex. The smallest eigenvalue λmin = 1 tells us the function curves least steeply in the direction (1, −1)/√2 — that's the "hardest" direction for gradient descent.
Write a function that checks whether a 2×2 matrix (given as [a, b, c, d] for [[a,b],[c,d]]) is positive semi-definite using the eigenvalue test.
javascript function isPSD(mat) { const [a, b, c, d] = mat; const trace = a + d; const det = a * d - b * c; const disc = trace * trace - 4 * det; if (disc < 0) return false; // complex eigenvalues const sq = Math.sqrt(disc); const l1 = (trace + sq) / 2; const l2 = (trace - sq) / 2; return l1 >= -1e-10 && l2 >= -1e-10; }
This convexity checker always says "convex." Click the buggy line.
function isConvex(f, x1, x2, lam) { const mid = lam * x1 + (1 - lam) * x2; const lhs = f(mid); const rhs = lam * f(x1) + (1 - lam) * f(x2); return lhs <= lhs; // should be lhs <= rhs }
Line 5 is the bug. It compares lhs ≤ lhs instead of lhs ≤ rhs. Any number is always ≤ itself, so this returns true no matter what. The fix: return lhs ≤ rhs;. This is a classic copy-paste typo — the kind that passes unit tests if you only test convex functions.
You know a function is convex and has a minimum. How do you find it? You roll downhill. That's gradient descent: compute the slope, take a step in the opposite direction, repeat. The only decisions are how big your steps are (the learning rate α) and where you start.
f(x) = 3x² + 2x + 1. Starting at x0 = 5 with α = 0.1. Compute x1.
∇f(x0) = 6(5) + 2 = 32. Then x1 = x0 − α ∇f(x0).
We jumped from x=5 to x=1.8 — a big move because the gradient was steep (32). The true minimum is at x*=−1/3 ≈ −0.333, so we're getting closer but still far.
Continue the trace: compute x2 through x5 for f(x) = 3x² + 2x + 1 with α = 0.1 from x0 = 5. What is x5?
Pattern: xk+1 = xk − 0.1(6xk + 2) = xk(1 − 0.6) − 0.2 = 0.4xk − 0.2.
After 5 steps we're at x ≈ −0.279, close to the true minimum x* = −0.333. The linear recurrence xk+1 = 0.4xk − 0.2 converges because |0.4| < 1 (the contraction factor is 1 − αL = 1 − 0.6 = 0.4).
f(x,y) = 50x² + y². The Hessian is [[100, 0], [0, 2]]. Compute the condition number κ = λmax/λmin. How many GD iterations are needed to reduce the error by a factor of 10?
For quadratics, GD convergence rate = (κ−1)/(κ+1) per step. Steps for factor-10 reduction: k = ln(10) / ln(κ/(κ−1)) ≈ κ × ln(10).
A condition number of 50 means the function is 50 times steeper in one direction than another. GD zigzags along the narrow valley, making ~59 steps for each order of magnitude of improvement. This is why preconditioning (rescaling coordinates) and Adam (per-parameter learning rates) help — they effectively reduce κ.
The contraction factor is |1 − αL| = |1 − 0.5 × 6| = |1 − 3| = 2 > 1. Since the contraction factor exceeds 1, each step amplifies the error. The iterates 1, −3, 5, −11, ... grow exponentially. The maximum stable learning rate is α < 2/L = 1/3 ≈ 0.333. This is why α = 0.5 diverges.
Write gradient descent for 1D. Given f', starting x, learning rate, and number of steps, return the final x.
javascript function gradDescent(gradf, x0, alpha, steps) { let x = x0; for (let i = 0; i < steps; i++) { x = x - alpha * gradf(x); } return x; }
For f(x) = 10x², what is the optimal fixed learning rate α*? (Hint: the optimal rate for a quadratic ax² is 1/(2a), which gives convergence in a single step.)
With the optimal rate, GD converges in exactly one step for a pure quadratic. This is because the Newton step for a quadratic is exact. In practice, loss functions aren't pure quadratics, so the optimal rate changes at every point — motivating adaptive methods like Adam.
Vanilla gradient descent moves in the steepest direction at every step, with no memory of past gradients. That causes two problems: zigzagging on ill-conditioned problems, and getting stuck in flat regions. Momentum fixes both by adding inertia — like a heavy ball rolling downhill that carries speed through flat patches and smooths out oscillations.
f(x) = 3x² + 2x + 1. Starting at x0 = 5, v0 = 0, α = 0.1, β = 0.9. Compute v1 and x1.
The first step with momentum is identical to vanilla GD because v0 = 0. The momentum effect kicks in starting from step 2, where v carries history from step 1.
Continue from Exercise 2.1. Compute v2, x2, v3, and x3. What is x3?
Remember: vk+1 = βvk + ∇f(xk), then xk+1 = xk − αvk+1.
Notice momentum overshoots — x went past the minimum at −0.333 to −2.36, then even further to −4.888. The accumulated velocity is too high. This is typical early behavior: momentum takes a few steps to stabilize. With vanilla GD, x3 would be 0.008 — closer but slower to eventually converge. Momentum trades early overshoot for faster asymptotic convergence.
Standard momentum computes the gradient at where you are. Nesterov computes the gradient at where you're about to be. If momentum is carrying you past the minimum, the lookahead gradient points backward — so Nesterov applies the brakes before overshooting, not after. This gives O(1/k²) convergence on convex problems vs O(1/k) for standard momentum. That's provably optimal for first-order methods (Nesterov, 1983).
With momentum β = 0.99, the velocity is an exponential average of past gradients. Approximately how many past gradients contribute significantly? (Use the rule: effective window ≈ 1/(1−β).)
With β = 0.99, the velocity remembers roughly 100 past gradients. This is great for smoothing out minibatch noise (SGD on large datasets) but terrible for rapidly changing landscapes. Most practitioners use β = 0.9 (window ~10) as a balance. Adam uses β1 = 0.9 for momentum and β2 = 0.999 (window ~1000) for the second moment — the second moment changes more slowly than the mean.
Put the momentum SGD steps in the correct order for one iteration.
Correct order: Evaluate f → Compute gradient → Update velocity → Update parameters. You must evaluate the function before computing its gradient (backprop requires the forward pass). Then velocity accumulates the gradient with history, and finally the parameter update uses the velocity.
This momentum implementation diverges instead of converging. Click the buggy line.
function momentumSGD(gradf, x, alpha, beta, steps) { let v = 0; for (let i = 0; i < steps; i++) { const g = gradf(x); v = beta * v + g; x = x + alpha * v; // should be x - alpha * v } return x; }
Line 6 is the bug. It uses x + alpha * v instead of x - alpha * v. Gradient descent moves opposite to the gradient (downhill). Adding the gradient moves uphill — maximizing the function instead of minimizing it. The sign error is the most common optimization bug in practice.
Most real problems have constraints. You want to minimize a loss, but your weights must satisfy a norm budget. You want to allocate resources, but they must sum to a fixed amount. You can't just set the gradient to zero — you need to incorporate the constraint into the optimization. The key tool is the Lagrangian.
Set up the Lagrangian L = x² + y² + λ(x + y − 1). Solve ∂L/∂x = 0, ∂L/∂y = 0, and x + y = 1 for x*, y*, λ*. What is x*?
The closest point on the line x+y=1 to the origin is (1/2, 1/2). By symmetry, x=y was expected. The Lagrange multiplier λ = −1 means: if we changed the constraint to x+y=1.01, the optimal value would decrease by about 0.01 (from 0.5 to 0.4999...).
From Exercise 3.1, what is λ*?
From the derivation above: λ* = −1. This tells us that relaxing the constraint from x+y=1 to x+y=1+ε would decrease the objective by approximately ε. Tightening it to x+y=1−ε would increase the objective by ε.
Maximize H(p) = −p1ln(p1) − p2ln(p2) subject to p1 + p2 = 1. What is the maximum entropy value?
L = −p1ln(p1) − p2ln(p2) + λ(p1 + p2 − 1). Set ∂L/∂pi = 0.
The maximum entropy distribution is the uniform distribution — maximum uncertainty. This result generalizes: for n outcomes, max entropy is ln(n) achieved by the uniform distribution. This is the principle behind cross-entropy loss — it's the gap between your model's distribution and the data distribution.
Solve: minimize ax² + by² subject to x + y = c. Return [x*, y*] using the Lagrangian method analytically.
javascript function solveConstrained(a, b, c) { // -lambda/(2a) - lambda/(2b) = c // -lambda * (1/(2a) + 1/(2b)) = c // -lambda * (a+b)/(2ab) = c const lam = -c * 2 * a * b / (a + b); const x = -lam / (2 * a); const y = -lam / (2 * b); return [x, y]; }
With constraint x+y=1.1, the solution is x*=y*=0.55 and f*=2(0.55)² = 0.605. The linear approximation using λ gives f* ≈ 0.5 + |λ|×0.1 = 0.6, which is close. The extra 0.005 is the second-order correction. The sensitivity interpretation: λ gives the first-order rate of change of f* with respect to the constraint's right-hand side.
Lagrange multipliers handle equality constraints (g(x) = 0). But what about inequality constraints (g(x) ≤ 0)? The Karush-Kuhn-Tucker (KKT) conditions generalize the Lagrangian method to handle both. They add one crucial insight: an inequality constraint is either active (binding, equality holds) or inactive (slack, strict inequality).
Rewrite as: min f(x)=x² s.t. g(x)=2−x ≤ 0. Apply KKT. What is x*?
Stationarity: 2x − μ = 0. Complementary slackness: μ(2−x) = 0.
The unconstrained minimizer of x² is x=0, but 0 < 2 violates g(x) ≤ 0. So the constraint must be active.
All KKT conditions satisfied. The constraint pushes x away from the unconstrained minimum (x=0) to x=2. The multiplier μ=4 measures this push. In ML: this is like weight clipping pushing your weights to the constraint boundary.
From Exercise 4.1, what is μ*?
μ* = 4 > 0, confirming the constraint is active. If we changed the constraint to x ≥ 1.9, the optimal value would improve (decrease) by approximately μ × 0.1 = 0.4. Actual change: 2² − 1.9² = 4 − 3.61 = 0.39 ≈ 0.4.
The unconstrained minimum x=5 satisfies x ≥ 2 (since 5 > 2), so the constraint is inactive (slack). By complementary slackness, μ=0. The constraint doesn't affect the solution at all — removing it wouldn't change x*. This is the "you don't feel the wall" case.
For convex problems (convex objective, convex feasible set), KKT conditions are both necessary AND sufficient. This is because convex problems have no spurious local minima — any point satisfying the first-order conditions must be globally optimal. This is the foundation of convex optimization: if you can prove your problem is convex and find a KKT point, you're done. SVMs, linear regression with L2 regularization, and logistic regression are all convex.
Minimize f(x) = x² subject to x ≥ −3 AND x ≤ 1. Rewrite as g1(x) = −3−x ≤ 0 and g2(x) = x−1 ≤ 0. What is x*? Which constraint(s) are active?
Both constraints are inactive. The unconstrained minimum lies safely inside the feasible region [−3, 1]. Neither constraint affects the solution. This is common in practice — many constraints in an optimization problem may be inactive at the solution.
Every constrained optimization problem has a twin — the dual problem. The original is called the primal. The dual maximizes a lower bound on the primal's optimal value. When they agree (strong duality), solving the easier one solves both.
Primal: minimize cTx subject to Ax ≥ b, x ≥ 0. For a 1D case: minimize 3x s.t. 2x ≥ 4, x ≥ 0. Find the primal optimum p*.
The constraint 2x ≥ 4 gives x ≥ 2. Combined with x ≥ 0, the feasible set is [2, ∞). The objective 3x is minimized at the boundary.
For the LP in 5.1 (min 3x, s.t. 2x ≥ 4, x ≥ 0), write the dual: max 4μ s.t. 2μ ≤ 3, μ ≥ 0. Find d*.
Dual constraint: 2μ ≤ 3 ⇒ μ ≤ 1.5. Maximize 4μ over μ ∈ [0, 1.5].
Strong duality holds — the gap is zero. This always happens for LPs (linear programs). The dual variable μ* = 1.5 is the shadow price: increasing the right-hand side of 2x ≥ 4 to 2x ≥ 5 would increase the optimal cost by μ × 1 = 1.5 (from 6 to 7.5).
Duality gap = p* − d* = 5 − 3 = 2 > 0. The dual provides a certified lower bound on the primal optimum. We know the true optimal value is between 3 and 5. For non-convex problems, the gap can be large. For convex problems satisfying Slater's condition (a strictly feasible point exists), the gap is zero. This is why convexity is so valuable — it means the dual bound is tight.
Option 1: x²+y² is convex, x+y ≤ 5 is a linear (hence convex) constraint, and Slater's condition holds (e.g., x=0, y=0 is strictly feasible since 0 < 5). Strong duality guaranteed. Options 2 and 3 have non-convex objectives (sin is non-convex, x³ is non-convex over all R though it IS convex on [1,∞)) — no guarantee of strong duality.
Compute the Lagrange dual function for: min x² s.t. x ≥ 2 (rewritten: g(x) = 2−x ≤ 0).
L(x,μ) = x² + μ(2−x). Minimize over x: ∂L/∂x = 2x − μ = 0 ⇒ x = μ/2. Substitute back. What is g(μ)?
Strong duality holds because the problem is convex. The dual computation independently confirms p* = 4 and μ* = 4, matching our KKT solution from Chapter 4.
Some functions are non-smooth — they have kinks where the derivative doesn't exist. The absolute value |x| at zero, the L1 norm, the ReLU at zero. Gradient descent can't handle these directly because there's no unique gradient at the kink. Proximal operators solve this by wrapping the non-smooth function in a smooth envelope.
Compute prox(x) = sign(x) · max(|x| − λ, 0) for x = 3, λ = 1.
The value 3 gets shrunk by λ=1 toward zero, becoming 2. It doesn't become zero because |3| > λ. Only values with |x| ≤ λ get zeroed out.
Compute prox(x) for x = 0.5, λ = 1.
Since |0.5| < λ = 1, the value gets thresholded to exactly zero. This is the sparsity-inducing property of L1. Any weight smaller than the threshold gets killed. This is exactly what happens in LASSO regression.
Compute prox(x) for x = −2.5, λ = 0.5.
The sign is preserved, and the magnitude is shrunk by λ = 0.5. The soft thresholding operator is symmetric: it shrinks positive and negative values equally toward zero.
Implement the soft thresholding operator for a vector (array of numbers).
javascript function softThreshold(x, lambda) { return x.map(xi => { const s = Math.sign(xi); const m = Math.max(Math.abs(xi) - lambda, 0); return s * m; }); }
prox(6) = 6/(1 + 1×2) = 6/3 = 2. L2 regularization shrinks values toward zero by a multiplicative factor but never reaches zero. L1 regularization thresholds values to zero if they're small enough. This is the fundamental difference: L1 produces sparsity, L2 produces small weights. In neural networks, weight decay (L2) keeps all weights small; L1 pruning kills unimportant weights entirely.
This soft thresholding function doesn't produce sparsity — values near zero never become zero. Click the buggy line.
function softThresh(x, lam) { const result = []; for (let i = 0; i < x.length; i++) { const s = Math.sign(x[i]); const m = Math.abs(x[i]) - lam; // missing max(..., 0) result.push(s * m); } return result; }
Line 5 is the bug. It computes |x[i]| − lam without the max(..., 0) clamp. When |x[i]| < lam, this produces a negative value, and multiplied by sign gives a flipped-sign result instead of zero. For example, x=0.3, lam=1: should give 0, but gives sign(0.3)×(0.3−1) = −0.7. The fix: const m = Math.max(Math.abs(x[i]) - lam, 0);
Not all optimizers converge at the same speed. The convergence rate tells you how many iterations you need to get within ε of the optimum. This is the theoretical yardstick for comparing gradient descent, momentum, Nesterov, and Newton's method.
GD on a convex function converges as f(xk) − f* ≤ C/k where C is a problem constant. How many iterations k to reach ε = 10−6 if C = 1?
One million iterations — painfully slow. This is the O(1/k) sublinear rate. Doubling accuracy requires doubling iterations. This motivates using Nesterov acceleration or exploiting strong convexity for faster convergence.
Nesterov accelerated gradient on a convex function: f(xk) − f* ≤ C/k². How many iterations for ε = 10−6 with C = 1?
1,000 vs 1,000,000 — that's a 1000x speedup just from the Nesterov trick. The O(1/k²) rate is provably optimal for first-order methods on convex functions (Nesterov, 1983). You cannot do better with only gradient information — this is the information-theoretic lower bound.
For strongly convex f with κ = 100 (condition number), GD converges as (1 − 1/κ)k ≤ ε. How many iterations for ε = 10−6?
Take log: k × ln(1 − 1/κ) ≤ ln(ε). Use ln(1 − x) ≈ −x for small x.
Strong convexity gives linear convergence — each step removes a constant fraction (1/κ) of the remaining error. This is exponentially faster than the O(1/k) rate. The cost is proportional to κ: ill-conditioned problems (large κ) converge slowly. Most deep learning loss surfaces are extremely ill-conditioned (κ > 106), which is why raw GD is impractical.
GD takes ~1382 steps, Nesterov takes ~138 steps (10x fewer), Newton takes ~5 steps (276x fewer). But Newton's method requires computing and inverting the Hessian at each step — O(n³) work per step where n is the problem dimension. For a neural network with n = 108 parameters, one Newton step would require 1024 operations. That's why deep learning uses first-order methods despite their slower convergence rate — the per-step cost dominates.
You're training a model with 10M parameters. Each GD step takes 0.1 seconds. Each Newton step takes n² × 10−12 seconds (simplified). How many seconds for 1000 GD steps vs 5 Newton steps?
GD wins! 100 vs 500 seconds. Newton takes fewer steps but each step is enormously expensive. At 10M parameters, Newton is already 5x slower despite needing 200x fewer iterations. At 1B parameters, Newton would take 5 × 106 seconds per step — completely infeasible. This is the fundamental reason deep learning uses first-order methods.
Gradient descent uses the first derivative (slope) to decide where to go. Newton's method uses the second derivative (curvature) to decide how far to go. If the surface curves sharply, take a small step. If it curves gently, take a big step. This curvature information is encoded in the Hessian matrix.
Compute one Newton step from (x0, y0) = (10, 10). The Hessian is [[2, 0], [0, 20]], the gradient is [2x, 20y]. What is the new point (x1, y1)?
Newton: [x,y]new = [x,y]old − H−1∇f. H−1 = [[1/2, 0], [0, 1/20]].
One step to the exact minimum. Since f is quadratic, Newton solves it perfectly. Compare to GD with optimal α = 1/L = 1/20: after one step, x1 = 10 − 20/20 = 9 (y would reach 0, but x barely moved). GD would need many iterations because of the 10:1 condition number. Newton doesn't care about conditioning — it normalizes each direction by its curvature.
For f(x,y) = x² + 10y² starting at (10, 10), one GD step with optimal α = 1/20 (1/Lmax). What is x1 after GD?
After one GD step: (9, 0). The y-direction converges fast (curvature 20, step size matches) but x-direction crawls (curvature 2, but step size is tuned for curvature 20). This is the zigzag problem caused by κ = 20/2 = 10. Newton would need 1 step; GD needs ~10 × ln(10) ≈ 23 steps for the same accuracy.
A transformer with 7B parameters. How many entries in its full Hessian matrix? Express in scientific notation.
200 exabytes — that's ~200 million terabytes. The world's total data storage is ~120 exabytes. You literally cannot store the Hessian of a 7B model. This is why exact Newton's method is impossible for deep learning, and why approximations (Adam's per-parameter second moments, K-FAC's block-diagonal approximation, L-BFGS's low-rank updates) exist.
All options are real concerns, but the computational cost is the dealbreaker. Non-convexity and saddle points can be handled with modifications (regularized Newton, trust regions). Stochastic Hessians exist. But there's no getting around O(n²) storage and O(n³) per-step cost. Even approximate methods like L-BFGS (which stores a low-rank Hessian approximation) struggle beyond ~1M parameters because the inner products still scale with n. Adam is the practical compromise: it maintains per-parameter second moment estimates (diagonal Hessian approximation) at O(n) cost.
Implement one step of Newton's method for 1D: x_new = x - f'(x)/f''(x).
javascript function newtonStep(df, ddf, x) { const h = ddf(x); if (Math.abs(h) < 1e-12) return x; // avoid div by zero return x - df(x) / h; }
Put the steps of one Newton iteration in the correct order for the multi-dimensional case.
Correct order: Compute gradient → Compute Hessian → Solve linear system → Update x → Check convergence. The key insight: you solve HΔx = −∇f instead of computing H−1 explicitly. Solving a linear system is O(n³) but inverting a matrix is also O(n³) with a worse constant. In practice, Cholesky decomposition (for PD Hessians) or conjugate gradient (for approximate solves) is used.
Everything comes together here. A neural network's loss surface is a high-dimensional function of its parameters — non-convex, with saddle points, flat regions, and sharp valleys. Understanding this landscape through the lens of optimization theory explains why certain training recipes work and others fail.
The eigenvalue −2 means the surface curves downward in one direction — you can decrease f by moving in that direction. This makes it a saddle point, not a minimum. In high-dimensional spaces, saddle points vastly outnumber local minima. For a random critical point in n dimensions, the probability that ALL eigenvalues are positive (true minimum) is roughly (1/2)n. For n = 106 parameters, saddle points are overwhelmingly more common than minima.
The "sharpness" of a minimum is often measured by the largest Hessian eigenvalue λmax. If minimum A has λmax = 100 and minimum B has λmax = 0.5, which generalizes better and why? Express the ratio of max learning rates that work (using α < 2/λmax).
Minimum B is 200x flatter and can tolerate 200x larger learning rates. Flat minima generalize better because small perturbations to the weights (which happen between train and test distributions) cause small changes in loss. In sharp minima, tiny weight perturbations can dramatically increase the loss — the model has memorized the training data at a fragile parameter configuration.
At random initialization, the gradient magnitudes are large and unpredictable. The effective Lipschitz constant L is very high early in training. A large learning rate violates α < 2/L and causes the loss to spike or diverge. Warmup keeps α small until the gradients stabilize, then ramps up to the target rate. Empirically, transformers are especially sensitive to this — without warmup, training often diverges in the first few hundred steps.
Arrange these training phases in the correct order for a transformer training run.
Correct order: Random init → LR warmup → Steady at peak LR → Cosine decay → Final low LR. Initialize randomly, warm up to avoid early divergence, train at peak rate for most of the budget (this is where most learning happens), then decay the LR to settle into a flat minimum. The final LR is typically 10% of peak. This is the standard "warmup + cosine decay" schedule used by GPT-3, LLaMA, and most modern LLMs.
AdamW is the standard choice for transformer training. Newton is infeasible (O(n²) memory). L-BFGS requires full-batch gradients and scales poorly beyond ~1M parameters. SGD works but requires extensive learning rate tuning and converges slower on transformers. AdamW provides per-parameter learning rate adaptation (diagonal Hessian approximation) at O(n) memory cost, with decoupled weight decay for regularization. The cost: 3x the memory of the model parameters (weights + m + v buffers).
Adam stores first moment (m) and second moment (v) for each parameter, both in FP32. For a 7B parameter model with FP16 weights, how much total GPU memory for weights + optimizer states?
Weights: 7B × 2 bytes (FP16). Gradients: 7B × 2 bytes (FP16). Adam m: 7B × 4 bytes (FP32). Adam v: 7B × 4 bytes (FP32).
84 GB — more than one A100 (80 GB). The optimizer states alone (56 GB) are 4x the model weights (14 GB). This is why model parallelism and ZeRO-style optimizer state sharding are essential for training large models. The optimizer is the memory bottleneck, not the model.
This training loop uses AdamW but the loss doesn't decrease. Click the buggy line.
function trainStep(params, grads, m, v, t, lr, beta1, beta2, wd) { t += 1; for (let i = 0; i < params.length; i++) { m[i] = beta1 * m[i] + (1 - beta1) * grads[i]; v[i] = beta2 * v[i] + (1 - beta2) * grads[i] * grads[i]; const mHat = m[i] / (1 - Math.pow(beta1, t)); const vHat = v[i] / (1 - Math.pow(beta2, t)); params[i] -= lr * (mHat / (Math.sqrt(vHat) + 1e-8) - wd * params[i]); } }
Line 8 has the sign wrong for weight decay. In AdamW (decoupled weight decay), the weight decay term should be added to the update, not subtracted: params[i] -= lr * (mHat / (Math.sqrt(vHat) + 1e-8)) + lr * wd * params[i]. The current code subtracts wd*params[i], which means it's growing the weights instead of shrinking them. Equivalently, write it as: params[i] = params[i] * (1 - lr * wd) - lr * mHat / (Math.sqrt(vHat) + 1e-8);