Workbook — Optimization Theory

Optimization Theory

Convexity, gradient methods, constraints, duality, proximal operators, convergence rates, and second-order methods — all derived by hand with instant feedback.

Prerequisites: Calculus (derivatives, partial derivatives) + Linear algebra (matrices, eigenvalues). That's it.
10
Chapters
57
Exercises
5
Exercise Types
Mastery
0 / 57 exercises (0%)
0
Day Streak
Best: 0

Chapter 0: Convexity

You're training a neural network and the loss keeps jumping around. Sometimes gradient descent converges beautifully; other times it spirals into a terrible local minimum. The difference often comes down to one property of the loss surface: convexity.

A function f is convex if the line segment between any two points on its graph lies above or on the graph. Formally:

Convexity definition:
f(λx + (1−λ)y) ≤ λf(x) + (1−λ)f(y)    for all λ ∈ [0, 1]

Why it matters:
• Convex function ⇒ every local minimum is a global minimum
• Gradient descent on convex functions has guaranteed convergence
• Non-convex ⇒ local minima, saddle points, no guarantees

Second-order test (Hessian test):
f is convex ⇔ H(x) ⪰ 0 everywhere // H is positive semi-definite
For single-variable: f is convex ⇔ f''(x) ≥ 0 everywhere
The convexity guarantee. If your optimization problem is convex, gradient descent will find the global optimum. Full stop. This is why so much of optimization theory focuses on convexity — it's the dividing line between "we can solve this" and "we're hoping for the best."
Exercise 0.1: Is f(x) = x² Convex? Trace
Apply the second-derivative test to f(x) = x². We have f''(x) = 2. Since 2 > 0 for all x, the function is:
Show reasoning

f(x) = x², so f'(x) = 2x and f''(x) = 2. Since f''(x) = 2 > 0 for all x, the function is strictly convex. The unique global minimum is at x = 0. This is the simplest possible convex function — the "bowl" that gradient descent loves.

Exercise 0.2: Is f(x) = |x| Convex? Trace
f(x) = |x| is not differentiable at x = 0. Can we still determine convexity? Check the definition directly: does f(λx + (1−λ)y) ≤ λf(x) + (1−λ)f(y) hold?
Show reasoning

The triangle inequality gives us |λx + (1−λ)y| ≤ |λx| + |(1−λ)y| = λ|x| + (1−λ)|y|. This is exactly the convexity definition. So yes, |x| is convex despite not being differentiable at zero. Convexity does NOT require smoothness — it's a geometric property about the shape of the graph. The L1 norm penalty in LASSO is convex, which is what makes LASSO tractable.

Exercise 0.3: Is f(x) = x³ Convex? Trace
f(x) = x³. Compute f''(x) = 6x. At x = −1, f''(−1) = −6 < 0. At x = 1, f''(1) = 6 > 0. What does this tell us?
Show reasoning

For convexity we need f''(x) ≥ 0 everywhere. Since f''(x) = 6x is negative for x < 0 and positive for x > 0, the function is neither convex nor concave over all of R. It's convex on [0, ∞) and concave on (−∞, 0]. Neural network loss surfaces behave like this — convex in some directions, non-convex in others.

Exercise 0.4: Hessian of f(x,y) = x² + xy + y² Derive

Compute the Hessian matrix H for f(x,y) = x² + xy + y². Then find the smallest eigenvalue of H to determine if the function is convex.

H = [[∂²f/∂x², ∂²f/∂x∂y], [∂²f/∂y∂x, ∂²f/∂y²]]. Eigenvalues of [[a,b],[b,c]]: λ = ((a+c) ± √((a−c)² + 4b²)) / 2.

smallest eigenvalue of H
Show derivation
∂f/∂x = 2x + y,   ∂f/∂y = x + 2y
∂²f/∂x² = 2,   ∂²f/∂x∂y = 1,   ∂²f/∂y² = 2
H = [[2, 1], [1, 2]]
λ = (2+2 ± √((2−2)² + 4·1²)) / 2 = (4 ± 2) / 2
λ1 = 1,   λ2 = 3

Both eigenvalues are positive, so H is positive definite everywhere (it's constant). Therefore f is strictly convex. The smallest eigenvalue λmin = 1 tells us the function curves least steeply in the direction (1, −1)/√2 — that's the "hardest" direction for gradient descent.

Exercise 0.5: Convexity Checker Build

Write a function that checks whether a 2×2 matrix (given as [a, b, c, d] for [[a,b],[c,d]]) is positive semi-definite using the eigenvalue test.

Return true if the matrix is PSD (both eigenvalues ≥ 0), false otherwise.
Show solution
javascript
function isPSD(mat) {
  const [a, b, c, d] = mat;
  const trace = a + d;
  const det = a * d - b * c;
  const disc = trace * trace - 4 * det;
  if (disc < 0) return false; // complex eigenvalues
  const sq = Math.sqrt(disc);
  const l1 = (trace + sq) / 2;
  const l2 = (trace - sq) / 2;
  return l1 >= -1e-10 && l2 >= -1e-10;
}
Exercise 0.6: Find the Bug Debug

This convexity checker always says "convex." Click the buggy line.

function isConvex(f, x1, x2, lam) {
  const mid = lam * x1 + (1 - lam) * x2;
  const lhs = f(mid);
  const rhs = lam * f(x1) + (1 - lam) * f(x2);
  return lhs <= lhs;  // should be lhs <= rhs
}
Show explanation

Line 5 is the bug. It compares lhs ≤ lhs instead of lhs ≤ rhs. Any number is always ≤ itself, so this returns true no matter what. The fix: return lhs ≤ rhs;. This is a classic copy-paste typo — the kind that passes unit tests if you only test convex functions.

Chapter 1: Gradient Descent

You know a function is convex and has a minimum. How do you find it? You roll downhill. That's gradient descent: compute the slope, take a step in the opposite direction, repeat. The only decisions are how big your steps are (the learning rate α) and where you start.

Gradient descent update rule:
xk+1 = xk − α ∇f(xk)

Example: f(x) = 3x² + 2x + 1
∇f(x) = 6x + 2
Minimum at ∇f(x) = 0 ⇒ x* = −1/3
f(x*) = 3(1/9) − 2/3 + 1 = 1/3 − 2/3 + 1 = 2/3

Convergence condition:
For convex f with L-Lipschitz gradient: use α < 2/L
For f(x) = 3x² + 2x + 1: ∇²f = 6, so L = 6, need α < 1/3
Learning rate is everything. Too large and you overshoot the minimum, bouncing back and forth forever. Too small and you crawl toward it, wasting compute. The optimal α for a quadratic is 1/L where L is the largest eigenvalue of the Hessian. In deep learning, L is unknown and changes at every step — that's why learning rate schedules matter.
Exercise 1.1: First GD Step Derive

f(x) = 3x² + 2x + 1. Starting at x0 = 5 with α = 0.1. Compute x1.

∇f(x0) = 6(5) + 2 = 32. Then x1 = x0 − α ∇f(x0).

x1
Show derivation
∇f(5) = 6(5) + 2 = 32
x1 = 5 − 0.1 × 32 = 5 − 3.2 = 1.8

We jumped from x=5 to x=1.8 — a big move because the gradient was steep (32). The true minimum is at x*=−1/3 ≈ −0.333, so we're getting closer but still far.

Exercise 1.2: Five Iterations Derive

Continue the trace: compute x2 through x5 for f(x) = 3x² + 2x + 1 with α = 0.1 from x0 = 5. What is x5?

Pattern: xk+1 = xk − 0.1(6xk + 2) = xk(1 − 0.6) − 0.2 = 0.4xk − 0.2.

x5
Show derivation
x0 = 5
x1 = 0.4(5) − 0.2 = 1.8
x2 = 0.4(1.8) − 0.2 = 0.52
x3 = 0.4(0.52) − 0.2 = 0.008
x4 = 0.4(0.008) − 0.2 = −0.1968
x5 = 0.4(−0.1968) − 0.2 = −0.27872 − ... wait, let me redo.
x4 = 0.4 × 0.008 − 0.2 = 0.0032 − 0.2 = −0.1968
x5 = 0.4 × (−0.1968) − 0.2 = −0.07872 − 0.2 = −0.27872

After 5 steps we're at x ≈ −0.279, close to the true minimum x* = −0.333. The linear recurrence xk+1 = 0.4xk − 0.2 converges because |0.4| < 1 (the contraction factor is 1 − αL = 1 − 0.6 = 0.4).

Exercise 1.3: Condition Number Derive

f(x,y) = 50x² + y². The Hessian is [[100, 0], [0, 2]]. Compute the condition number κ = λmaxmin. How many GD iterations are needed to reduce the error by a factor of 10?

For quadratics, GD convergence rate = (κ−1)/(κ+1) per step. Steps for factor-10 reduction: k = ln(10) / ln(κ/(κ−1)) ≈ κ × ln(10).

κ (condition number)
Show derivation
λmax = 100,   λmin = 2
κ = 100 / 2 = 50
Convergence rate per step = (50 − 1)/(50 + 1) = 49/51 ≈ 0.961
Steps for 10× reduction ≈ ln(10) / ln(51/49) ≈ 2.303 / 0.0392 ≈ 59 steps

A condition number of 50 means the function is 50 times steeper in one direction than another. GD zigzags along the narrow valley, making ~59 steps for each order of magnitude of improvement. This is why preconditioning (rescaling coordinates) and Adam (per-parameter learning rates) help — they effectively reduce κ.

Exercise 1.4: Divergence Check Trace
For f(x) = 3x² + 2x + 1 (L = 6), what happens if you set α = 0.5? Compute x1 from x0 = 1: x1 = 1 − 0.5(8) = −3. Then x2 = −3 − 0.5(−16) = 5. Then x3 = 5 − 0.5(32) = −11...
Show reasoning

The contraction factor is |1 − αL| = |1 − 0.5 × 6| = |1 − 3| = 2 > 1. Since the contraction factor exceeds 1, each step amplifies the error. The iterates 1, −3, 5, −11, ... grow exponentially. The maximum stable learning rate is α < 2/L = 1/3 ≈ 0.333. This is why α = 0.5 diverges.

Exercise 1.5: Implement gradDescent() Build

Write gradient descent for 1D. Given f', starting x, learning rate, and number of steps, return the final x.

Apply x = x - alpha * gradf(x) for 'steps' iterations.
Show solution
javascript
function gradDescent(gradf, x0, alpha, steps) {
  let x = x0;
  for (let i = 0; i < steps; i++) {
    x = x - alpha * gradf(x);
  }
  return x;
}
Exercise 1.6: Optimal Learning Rate Derive

For f(x) = 10x², what is the optimal fixed learning rate α*? (Hint: the optimal rate for a quadratic ax² is 1/(2a), which gives convergence in a single step.)

α*
Show derivation
f(x) = 10x²,   f'(x) = 20x,   f''(x) = 20 = L
α* = 1/L = 1/20 = 0.05
Check: x1 = x0 − 0.05 × 20x0 = x0(1 − 1) = 0  

With the optimal rate, GD converges in exactly one step for a pure quadratic. This is because the Newton step for a quadratic is exact. In practice, loss functions aren't pure quadratics, so the optimal rate changes at every point — motivating adaptive methods like Adam.

Chapter 2: SGD Variants

Vanilla gradient descent moves in the steepest direction at every step, with no memory of past gradients. That causes two problems: zigzagging on ill-conditioned problems, and getting stuck in flat regions. Momentum fixes both by adding inertia — like a heavy ball rolling downhill that carries speed through flat patches and smooths out oscillations.

SGD with Momentum:
vk+1 = β vk + ∇f(xk)
xk+1 = xk − α vk+1

Nesterov Accelerated Gradient (NAG):
vk+1 = β vk + ∇f(xk − αβ vk) // gradient at lookahead position
xk+1 = xk − α vk+1

Typical hyperparameters: β = 0.9, α = 0.01
Momentum as exponential averaging. The velocity vk = ∇f(xk) + β∇f(xk−1) + β²∇f(xk−2) + ... It's an exponentially weighted average of past gradients. With β = 0.9, the effective window is ~10 past gradients (1/(1−β) = 10). Components that consistently point the same way get amplified; noisy oscillations cancel out.
Exercise 2.1: Momentum Step 1 Derive

f(x) = 3x² + 2x + 1. Starting at x0 = 5, v0 = 0, α = 0.1, β = 0.9. Compute v1 and x1.

x1
Show derivation
∇f(5) = 6(5) + 2 = 32
v1 = 0.9 × 0 + 32 = 32
x1 = 5 − 0.1 × 32 = 1.8

The first step with momentum is identical to vanilla GD because v0 = 0. The momentum effect kicks in starting from step 2, where v carries history from step 1.

Exercise 2.2: Momentum Step 2 and 3 Derive

Continue from Exercise 2.1. Compute v2, x2, v3, and x3. What is x3?

Remember: vk+1 = βvk + ∇f(xk), then xk+1 = xk − αvk+1.

x3
Show derivation
∇f(1.8) = 6(1.8) + 2 = 12.8
v2 = 0.9 × 32 + 12.8 = 28.8 + 12.8 = 41.6
x2 = 1.8 − 0.1 × 41.6 = 1.8 − 4.16 = −2.36
∇f(−2.36) = 6(−2.36) + 2 = −12.16
v3 = 0.9 × 41.6 + (−12.16) = 37.44 − 12.16 = 25.28
x3 = −2.36 − 0.1 × 25.28 = −2.36 − 2.528 = −4.888

Notice momentum overshoots — x went past the minimum at −0.333 to −2.36, then even further to −4.888. The accumulated velocity is too high. This is typical early behavior: momentum takes a few steps to stabilize. With vanilla GD, x3 would be 0.008 — closer but slower to eventually converge. Momentum trades early overshoot for faster asymptotic convergence.

Exercise 2.3: Nesterov vs Standard Momentum Trace
Nesterov momentum evaluates the gradient at xk − αβvk (the "lookahead" point) instead of xk. Why does this help convergence?
Show reasoning

Standard momentum computes the gradient at where you are. Nesterov computes the gradient at where you're about to be. If momentum is carrying you past the minimum, the lookahead gradient points backward — so Nesterov applies the brakes before overshooting, not after. This gives O(1/k²) convergence on convex problems vs O(1/k) for standard momentum. That's provably optimal for first-order methods (Nesterov, 1983).

Exercise 2.4: Effective Window Derive

With momentum β = 0.99, the velocity is an exponential average of past gradients. Approximately how many past gradients contribute significantly? (Use the rule: effective window ≈ 1/(1−β).)

past gradients
Show derivation
Effective window = 1 / (1 − 0.99) = 1 / 0.01 = 100

With β = 0.99, the velocity remembers roughly 100 past gradients. This is great for smoothing out minibatch noise (SGD on large datasets) but terrible for rapidly changing landscapes. Most practitioners use β = 0.9 (window ~10) as a balance. Adam uses β1 = 0.9 for momentum and β2 = 0.999 (window ~1000) for the second moment — the second moment changes more slowly than the mean.

Exercise 2.5: Arrange Momentum Update Design

Put the momentum SGD steps in the correct order for one iteration.

?
?
?
?
Compute ∇f(xk) v = βv + ∇f x = x − αv Evaluate f(xk)
Show explanation

Correct order: Evaluate fCompute gradientUpdate velocityUpdate parameters. You must evaluate the function before computing its gradient (backprop requires the forward pass). Then velocity accumulates the gradient with history, and finally the parameter update uses the velocity.

Exercise 2.6: Find the Bug Debug

This momentum implementation diverges instead of converging. Click the buggy line.

function momentumSGD(gradf, x, alpha, beta, steps) {
  let v = 0;
  for (let i = 0; i < steps; i++) {
    const g = gradf(x);
    v = beta * v + g;
    x = x + alpha * v;  // should be x - alpha * v
  }
  return x;
}
Show explanation

Line 6 is the bug. It uses x + alpha * v instead of x - alpha * v. Gradient descent moves opposite to the gradient (downhill). Adding the gradient moves uphill — maximizing the function instead of minimizing it. The sign error is the most common optimization bug in practice.

Chapter 3: Constrained Optimization

Most real problems have constraints. You want to minimize a loss, but your weights must satisfy a norm budget. You want to allocate resources, but they must sum to a fixed amount. You can't just set the gradient to zero — you need to incorporate the constraint into the optimization. The key tool is the Lagrangian.

Constrained problem:
minimize f(x)    subject to g(x) = 0

Lagrangian:
L(x, λ) = f(x) + λ g(x)

Optimality conditions:
xL = 0    // gradient w.r.t. x vanishes
λL = 0    // constraint is satisfied: g(x) = 0

Interpretation: λ = sensitivity of optimum to constraint
If the constraint is "budget = B", then df*/dB = −λ
λ as a shadow price. The Lagrange multiplier λ tells you how much the optimal value would improve if you relaxed the constraint by one unit. If λ = 3 for a budget constraint, then one extra dollar of budget improves your objective by 3. In ML, λ in weight decay tells you the trade-off between fitting the data and keeping weights small.
Exercise 3.1: Minimize x²+y² s.t. x+y=1 Derive

Set up the Lagrangian L = x² + y² + λ(x + y − 1). Solve ∂L/∂x = 0, ∂L/∂y = 0, and x + y = 1 for x*, y*, λ*. What is x*?

x*
Show derivation
L = x² + y² + λ(x + y − 1)
∂L/∂x = 2x + λ = 0 ⇒ x = −λ/2
∂L/∂y = 2y + λ = 0 ⇒ y = −λ/2
Constraint: x + y = 1 ⇒ −λ/2 − λ/2 = 1 ⇒ −λ = 1 ⇒ λ = −1
x* = −(−1)/2 = 1/2,   y* = 1/2
f(x*, y*) = 1/4 + 1/4 = 1/2

The closest point on the line x+y=1 to the origin is (1/2, 1/2). By symmetry, x=y was expected. The Lagrange multiplier λ = −1 means: if we changed the constraint to x+y=1.01, the optimal value would decrease by about 0.01 (from 0.5 to 0.4999...).

Exercise 3.2: Lagrange Multiplier Value Derive

From Exercise 3.1, what is λ*?

λ*
Show derivation

From the derivation above: λ* = −1. This tells us that relaxing the constraint from x+y=1 to x+y=1+ε would decrease the objective by approximately ε. Tightening it to x+y=1−ε would increase the objective by ε.

Exercise 3.3: Maximize Entropy Derive

Maximize H(p) = −p1ln(p1) − p2ln(p2) subject to p1 + p2 = 1. What is the maximum entropy value?

L = −p1ln(p1) − p2ln(p2) + λ(p1 + p2 − 1). Set ∂L/∂pi = 0.

H* (nats)
Show derivation
∂L/∂p1 = −ln(p1) − 1 + λ = 0 ⇒ p1 = eλ−1
∂L/∂p2 = −ln(p2) − 1 + λ = 0 ⇒ p2 = eλ−1
p1 = p2 (both equal) ⇒ p1 + p2 = 1 ⇒ p1 = p2 = 1/2
H* = −(1/2)ln(1/2) − (1/2)ln(1/2) = ln(2) ≈ 0.693 nats

The maximum entropy distribution is the uniform distribution — maximum uncertainty. This result generalizes: for n outcomes, max entropy is ln(n) achieved by the uniform distribution. This is the principle behind cross-entropy loss — it's the gap between your model's distribution and the data distribution.

Exercise 3.4: Constrained Optimization Solver Build

Solve: minimize ax² + by² subject to x + y = c. Return [x*, y*] using the Lagrangian method analytically.

Solve for lambda first, then compute x and y.
Show solution
javascript
function solveConstrained(a, b, c) {
  // -lambda/(2a) - lambda/(2b) = c
  // -lambda * (1/(2a) + 1/(2b)) = c
  // -lambda * (a+b)/(2ab) = c
  const lam = -c * 2 * a * b / (a + b);
  const x = -lam / (2 * a);
  const y = -lam / (2 * b);
  return [x, y];
}
Exercise 3.5: Sensitivity Interpretation Trace
You solve min x²+y² s.t. x+y=1 and get λ*=−1 with optimal value f*=0.5. If the constraint changes to x+y=1.1, approximately what is the new optimal value?
Show reasoning

With constraint x+y=1.1, the solution is x*=y*=0.55 and f*=2(0.55)² = 0.605. The linear approximation using λ gives f* ≈ 0.5 + |λ|×0.1 = 0.6, which is close. The extra 0.005 is the second-order correction. The sensitivity interpretation: λ gives the first-order rate of change of f* with respect to the constraint's right-hand side.

Chapter 4: KKT Conditions

Lagrange multipliers handle equality constraints (g(x) = 0). But what about inequality constraints (g(x) ≤ 0)? The Karush-Kuhn-Tucker (KKT) conditions generalize the Lagrangian method to handle both. They add one crucial insight: an inequality constraint is either active (binding, equality holds) or inactive (slack, strict inequality).

Problem: minimize f(x)    s.t. gi(x) ≤ 0,   hj(x) = 0

KKT conditions (necessary for optimality):
1. Stationarity: ∇f + ∑μi∇gi + ∑λj∇hj = 0
2. Primal feasibility: gi(x) ≤ 0, hj(x) = 0
3. Dual feasibility: μi ≥ 0
4. Complementary slackness: μi gi(x) = 0 for all i

// Complementary slackness: either the constraint is active (gi=0)
// or its multiplier is zero (μi=0). Never both non-zero.
Complementary slackness is the key insight. If a constraint is inactive (the solution is in the interior, gi < 0), then the constraint doesn't affect the solution, so μi = 0. If μi > 0, the constraint is pushing back — it must be active (gi = 0). Think of it as: you only feel the wall if you're pressed against it.
Exercise 4.1: Minimize x² s.t. x ≥ 2 Derive

Rewrite as: min f(x)=x² s.t. g(x)=2−x ≤ 0. Apply KKT. What is x*?

Stationarity: 2x − μ = 0. Complementary slackness: μ(2−x) = 0.

x*
Show derivation

The unconstrained minimizer of x² is x=0, but 0 < 2 violates g(x) ≤ 0. So the constraint must be active.

Active constraint: g(x*) = 0 ⇒ 2 − x* = 0 ⇒ x* = 2
Stationarity: 2x* − μ = 0 ⇒ μ = 2(2) = 4
Dual feasibility: μ = 4 ≥ 0 ✓
Complementary slackness: μ × g(x*) = 4 × 0 = 0 ✓

All KKT conditions satisfied. The constraint pushes x away from the unconstrained minimum (x=0) to x=2. The multiplier μ=4 measures this push. In ML: this is like weight clipping pushing your weights to the constraint boundary.

Exercise 4.2: KKT Multiplier Derive

From Exercise 4.1, what is μ*?

μ*
Show derivation
From stationarity: 2x* − μ = 0 ⇒ μ = 2(2) = 4

μ* = 4 > 0, confirming the constraint is active. If we changed the constraint to x ≥ 1.9, the optimal value would improve (decrease) by approximately μ × 0.1 = 0.4. Actual change: 2² − 1.9² = 4 − 3.61 = 0.39 ≈ 0.4.

Exercise 4.3: Inactive Constraint Trace
Minimize f(x) = (x−5)² subject to x ≥ 2. The unconstrained minimum is x=5, which satisfies x ≥ 2. What are x* and μ*?
Show reasoning

The unconstrained minimum x=5 satisfies x ≥ 2 (since 5 > 2), so the constraint is inactive (slack). By complementary slackness, μ=0. The constraint doesn't affect the solution at all — removing it wouldn't change x*. This is the "you don't feel the wall" case.

Exercise 4.4: When Are KKT Sufficient? Trace
KKT conditions are necessary for optimality. When are they also sufficient (guaranteeing the point IS optimal)?
Show reasoning

For convex problems (convex objective, convex feasible set), KKT conditions are both necessary AND sufficient. This is because convex problems have no spurious local minima — any point satisfying the first-order conditions must be globally optimal. This is the foundation of convex optimization: if you can prove your problem is convex and find a KKT point, you're done. SVMs, linear regression with L2 regularization, and logistic regression are all convex.

Exercise 4.5: Two Constraints Derive

Minimize f(x) = x² subject to x ≥ −3 AND x ≤ 1. Rewrite as g1(x) = −3−x ≤ 0 and g2(x) = x−1 ≤ 0. What is x*? Which constraint(s) are active?

x*
Show derivation
Unconstrained minimum: x* = 0
Check g1(0) = −3 − 0 = −3 ≤ 0 ✓
Check g2(0) = 0 − 1 = −1 ≤ 0 ✓
Both constraints satisfied, so x* = 0, μ1 = μ2 = 0

Both constraints are inactive. The unconstrained minimum lies safely inside the feasible region [−3, 1]. Neither constraint affects the solution. This is common in practice — many constraints in an optimization problem may be inactive at the solution.

Chapter 5: Duality

Every constrained optimization problem has a twin — the dual problem. The original is called the primal. The dual maximizes a lower bound on the primal's optimal value. When they agree (strong duality), solving the easier one solves both.

Primal: p* = minx f(x)   s.t. gi(x) ≤ 0

Lagrange dual function:
g(μ) = minx L(x, μ) = minx [f(x) + ∑μigi(x)]

Dual problem: d* = maxμ≥0 g(μ)

Weak duality: d* ≤ p*   (always holds)
Strong duality: d* = p*   (holds for convex + Slater's condition)
Duality gap: p* − d* ≥ 0
Why duality matters in ML. SVMs are solved via the dual (which reveals the kernel trick). Reinforcement learning's LP formulation has a dual that yields the Bellman equation. Variational inference minimizes the KL divergence, which is the dual of the evidence lower bound (ELBO). Every time you see "lower bound" in ML, duality is lurking nearby.
Exercise 5.1: Dual of a Simple LP Derive

Primal: minimize cTx subject to Ax ≥ b, x ≥ 0. For a 1D case: minimize 3x s.t. 2x ≥ 4, x ≥ 0. Find the primal optimum p*.

The constraint 2x ≥ 4 gives x ≥ 2. Combined with x ≥ 0, the feasible set is [2, ∞). The objective 3x is minimized at the boundary.

p* (optimal primal value)
Show derivation
Feasible: x ≥ 2 (from 2x ≥ 4) and x ≥ 0
Minimize 3x over x ≥ 2: x* = 2, p* = 3(2) = 6
Exercise 5.2: Dual Value Derive

For the LP in 5.1 (min 3x, s.t. 2x ≥ 4, x ≥ 0), write the dual: max 4μ s.t. 2μ ≤ 3, μ ≥ 0. Find d*.

Dual constraint: 2μ ≤ 3 ⇒ μ ≤ 1.5. Maximize 4μ over μ ∈ [0, 1.5].

d* (optimal dual value)
Show derivation
μ* = 1.5,   d* = 4(1.5) = 6
Duality gap = p* − d* = 6 − 6 = 0

Strong duality holds — the gap is zero. This always happens for LPs (linear programs). The dual variable μ* = 1.5 is the shadow price: increasing the right-hand side of 2x ≥ 4 to 2x ≥ 5 would increase the optimal cost by μ × 1 = 1.5 (from 6 to 7.5).

Exercise 5.3: Duality Gap Trace
For a non-convex problem, d* = 3 and p* = 5. What is the duality gap, and what does it mean?
Show reasoning

Duality gap = p* − d* = 5 − 3 = 2 > 0. The dual provides a certified lower bound on the primal optimum. We know the true optimal value is between 3 and 5. For non-convex problems, the gap can be large. For convex problems satisfying Slater's condition (a strictly feasible point exists), the gap is zero. This is why convexity is so valuable — it means the dual bound is tight.

Exercise 5.4: Strong Duality Conditions Trace
Which of these problems is guaranteed to have strong duality (zero gap)?
Show reasoning

Option 1: x²+y² is convex, x+y ≤ 5 is a linear (hence convex) constraint, and Slater's condition holds (e.g., x=0, y=0 is strictly feasible since 0 < 5). Strong duality guaranteed. Options 2 and 3 have non-convex objectives (sin is non-convex, x³ is non-convex over all R though it IS convex on [1,∞)) — no guarantee of strong duality.

Exercise 5.5: Dual Computation Derive

Compute the Lagrange dual function for: min x² s.t. x ≥ 2 (rewritten: g(x) = 2−x ≤ 0).

L(x,μ) = x² + μ(2−x). Minimize over x: ∂L/∂x = 2x − μ = 0 ⇒ x = μ/2. Substitute back. What is g(μ)?

d* = max g(μ) for μ ≥ 0
Show derivation
L(x,μ) = x² + μ(2 − x)
∂L/∂x = 2x − μ = 0 ⇒ x = μ/2
g(μ) = (μ/2)² + μ(2 − μ/2) = μ²/4 + 2μ − μ²/2 = −μ²/4 + 2μ
Maximize: g'(μ) = −μ/2 + 2 = 0 ⇒ μ* = 4
d* = g(4) = −16/4 + 8 = −4 + 8 = 4
Check: p* = (2)² = 4.   Duality gap = 4 − 4 = 0 ✓

Strong duality holds because the problem is convex. The dual computation independently confirms p* = 4 and μ* = 4, matching our KKT solution from Chapter 4.

Chapter 6: Proximal Operators

Some functions are non-smooth — they have kinks where the derivative doesn't exist. The absolute value |x| at zero, the L1 norm, the ReLU at zero. Gradient descent can't handle these directly because there's no unique gradient at the kink. Proximal operators solve this by wrapping the non-smooth function in a smooth envelope.

Proximal operator:
proxηf(x) = argminz { f(z) + ||z − x||² / (2η) }

// "Find z that balances minimizing f and staying close to x"

Soft thresholding (prox of L1 norm):
proxηλ|·|(x) = sign(x) · max(|x| − ηλ, 0)

Proximal gradient descent:
xk+1 = proxηg(xk − η∇f(xk)) // for minimizing f(x) + g(x)
// f is smooth (GD on it), g is non-smooth (prox it)
Soft thresholding = L1 regularization in action. When you train with L1 penalty, the proximal step pushes small weights to exactly zero. If |w| − λ < 0, the weight becomes zero. This is why L1 produces sparse models while L2 only shrinks weights toward zero. The prox operator makes this precise: it's a thresholding operation, not a gradient step.
Exercise 6.1: Soft Threshold at x=3, λ=1 Derive

Compute prox(x) = sign(x) · max(|x| − λ, 0) for x = 3, λ = 1.

prox(3)
Show derivation
sign(3) = +1
max(|3| − 1, 0) = max(2, 0) = 2
prox(3) = (+1) × 2 = 2

The value 3 gets shrunk by λ=1 toward zero, becoming 2. It doesn't become zero because |3| > λ. Only values with |x| ≤ λ get zeroed out.

Exercise 6.2: Soft Threshold at x=0.5, λ=1 Derive

Compute prox(x) for x = 0.5, λ = 1.

prox(0.5)
Show derivation
sign(0.5) = +1
max(|0.5| − 1, 0) = max(−0.5, 0) = 0
prox(0.5) = (+1) × 0 = 0

Since |0.5| < λ = 1, the value gets thresholded to exactly zero. This is the sparsity-inducing property of L1. Any weight smaller than the threshold gets killed. This is exactly what happens in LASSO regression.

Exercise 6.3: Negative Input Derive

Compute prox(x) for x = −2.5, λ = 0.5.

prox(−2.5)
Show derivation
sign(−2.5) = −1
max(|−2.5| − 0.5, 0) = max(2, 0) = 2
prox(−2.5) = (−1) × 2 = −2

The sign is preserved, and the magnitude is shrunk by λ = 0.5. The soft thresholding operator is symmetric: it shrinks positive and negative values equally toward zero.

Exercise 6.4: Implement softThreshold() Build

Implement the soft thresholding operator for a vector (array of numbers).

Apply the soft thresholding formula to each element independently.
Show solution
javascript
function softThreshold(x, lambda) {
  return x.map(xi => {
    const s = Math.sign(xi);
    const m = Math.max(Math.abs(xi) - lambda, 0);
    return s * m;
  });
}
Exercise 6.5: Prox of L2 Squared Trace
The proximal operator of f(z) = (λ/2)||z||² is prox(x) = x/(1+ηλ). For x=6, η=1, λ=2, what is prox(x)?
Show reasoning

prox(6) = 6/(1 + 1×2) = 6/3 = 2. L2 regularization shrinks values toward zero by a multiplicative factor but never reaches zero. L1 regularization thresholds values to zero if they're small enough. This is the fundamental difference: L1 produces sparsity, L2 produces small weights. In neural networks, weight decay (L2) keeps all weights small; L1 pruning kills unimportant weights entirely.

Exercise 6.6: Find the Bug Debug

This soft thresholding function doesn't produce sparsity — values near zero never become zero. Click the buggy line.

function softThresh(x, lam) {
  const result = [];
  for (let i = 0; i < x.length; i++) {
    const s = Math.sign(x[i]);
    const m = Math.abs(x[i]) - lam;  // missing max(..., 0)
    result.push(s * m);
  }
  return result;
}
Show explanation

Line 5 is the bug. It computes |x[i]| − lam without the max(..., 0) clamp. When |x[i]| < lam, this produces a negative value, and multiplied by sign gives a flipped-sign result instead of zero. For example, x=0.3, lam=1: should give 0, but gives sign(0.3)×(0.3−1) = −0.7. The fix: const m = Math.max(Math.abs(x[i]) - lam, 0);

Chapter 7: Convergence Rates

Not all optimizers converge at the same speed. The convergence rate tells you how many iterations you need to get within ε of the optimum. This is the theoretical yardstick for comparing gradient descent, momentum, Nesterov, and Newton's method.

Convergence rates for f(xk) − f(x*) ≤ ε:

GD on convex: O(1/k)   // sublinear — k ∼ 1/ε
GD on strongly convex: O((1−1/κ)k)   // linear — k ∼ κ ln(1/ε)
Nesterov on convex: O(1/k²)   // accelerated — k ∼ 1/√ε
Nesterov on strongly convex: O((1−1/√κ)k)   // k ∼ √κ ln(1/ε)
Newton's method: O(c2^k)   // quadratic — k ∼ log log(1/ε)

// κ = condition number = L/μ (ratio of curvature extremes)
// "Linear" convergence = each step removes a constant fraction of error
Why these rates matter in practice. To reach ε = 10−6 on a convex problem: GD needs ~106 steps, Nesterov needs ~103 steps, and Newton needs ~20 steps. But each Newton step costs O(n³) for the Hessian inverse vs O(n) for a gradient step. The winner depends on problem dimension n: Newton wins for n < 1000, gradient methods win for n > 10,000 (like neural networks with millions of parameters).
Exercise 7.1: GD Iterations for ε=10−6 Derive

GD on a convex function converges as f(xk) − f* ≤ C/k where C is a problem constant. How many iterations k to reach ε = 10−6 if C = 1?

iterations
Show derivation
C/k ≤ ε ⇒ k ≥ C/ε = 1/10−6 = 106
k = 1,000,000 iterations

One million iterations — painfully slow. This is the O(1/k) sublinear rate. Doubling accuracy requires doubling iterations. This motivates using Nesterov acceleration or exploiting strong convexity for faster convergence.

Exercise 7.2: Nesterov Iterations for ε=10−6 Derive

Nesterov accelerated gradient on a convex function: f(xk) − f* ≤ C/k². How many iterations for ε = 10−6 with C = 1?

iterations
Show derivation
C/k² ≤ ε ⇒ k² ≥ C/ε = 106 ⇒ k ≥ 103 = 1000

1,000 vs 1,000,000 — that's a 1000x speedup just from the Nesterov trick. The O(1/k²) rate is provably optimal for first-order methods on convex functions (Nesterov, 1983). You cannot do better with only gradient information — this is the information-theoretic lower bound.

Exercise 7.3: Strongly Convex GD Derive

For strongly convex f with κ = 100 (condition number), GD converges as (1 − 1/κ)k ≤ ε. How many iterations for ε = 10−6?

Take log: k × ln(1 − 1/κ) ≤ ln(ε). Use ln(1 − x) ≈ −x for small x.

iterations
Show derivation
k ≥ ln(1/ε) / ln(1/(1 − 1/κ)) ≈ κ × ln(1/ε)
k ≈ 100 × ln(106) = 100 × 6 × ln(10) = 100 × 13.816 ≈ 1382

Strong convexity gives linear convergence — each step removes a constant fraction (1/κ) of the remaining error. This is exponentially faster than the O(1/k) rate. The cost is proportional to κ: ill-conditioned problems (large κ) converge slowly. Most deep learning loss surfaces are extremely ill-conditioned (κ > 106), which is why raw GD is impractical.

Exercise 7.4: Compare All Rates Trace
Rank these from slowest to fastest convergence for a strongly convex problem with κ=100, targeting ε=10−6:
A) GD: ~1382 iterations
B) Nesterov (strongly convex): ~√κ × ln(1/ε) = 10 × 13.8 ≈ 138
C) Newton: ~log2(log2(1/ε)) ≈ log2(20) ≈ 4-5 steps
Show reasoning

GD takes ~1382 steps, Nesterov takes ~138 steps (10x fewer), Newton takes ~5 steps (276x fewer). But Newton's method requires computing and inverting the Hessian at each step — O(n³) work per step where n is the problem dimension. For a neural network with n = 108 parameters, one Newton step would require 1024 operations. That's why deep learning uses first-order methods despite their slower convergence rate — the per-step cost dominates.

Exercise 7.5: Practical Implications Derive

You're training a model with 10M parameters. Each GD step takes 0.1 seconds. Each Newton step takes n² × 10−12 seconds (simplified). How many seconds for 1000 GD steps vs 5 Newton steps?

GD total time (seconds)
Show derivation
GD: 1000 steps × 0.1 sec = 100 seconds
Newton: 5 steps × (107)² × 10−12 = 5 × 1014 × 10−12 = 5 × 100 = 500 seconds

GD wins! 100 vs 500 seconds. Newton takes fewer steps but each step is enormously expensive. At 10M parameters, Newton is already 5x slower despite needing 200x fewer iterations. At 1B parameters, Newton would take 5 × 106 seconds per step — completely infeasible. This is the fundamental reason deep learning uses first-order methods.

Chapter 8: Second-Order Methods

Gradient descent uses the first derivative (slope) to decide where to go. Newton's method uses the second derivative (curvature) to decide how far to go. If the surface curves sharply, take a small step. If it curves gently, take a big step. This curvature information is encoded in the Hessian matrix.

Newton's method:
xk+1 = xk − [H(xk)]−1 ∇f(xk)

1D case: xk+1 = xk − f'(xk) / f''(xk)

Multi-dimensional:
H = [∂²f/∂xi∂xj]   // n×n matrix of second derivatives
Cost per step: O(n²) to form H, O(n³) to invert it

For quadratic f(x) = xTAx + bTx + c:
Newton converges in exactly 1 step (the update is exact)
Newton's method is perfect for quadratics. The second-order Taylor approximation of a quadratic IS the quadratic. So Newton solves quadratics in one step. Near any minimum of a smooth function, the landscape is approximately quadratic — this is why Newton has quadratic convergence rate: each step squares the error. The catch: forming the n×n Hessian for a neural network with 109 parameters would require 1018 entries — a billion GB.
Exercise 8.1: Newton Step for f(x,y) = x² + 10y² Derive

Compute one Newton step from (x0, y0) = (10, 10). The Hessian is [[2, 0], [0, 20]], the gradient is [2x, 20y]. What is the new point (x1, y1)?

Newton: [x,y]new = [x,y]old − H−1∇f. H−1 = [[1/2, 0], [0, 1/20]].

x1 (x-coordinate after Newton step)
Show derivation
∇f(10,10) = [2(10), 20(10)] = [20, 200]
H−1 = [[1/2, 0], [0, 1/20]]
H−1∇f = [1/2 × 20, 1/20 × 200] = [10, 10]
(x1, y1) = (10, 10) − (10, 10) = (0, 0)

One step to the exact minimum. Since f is quadratic, Newton solves it perfectly. Compare to GD with optimal α = 1/L = 1/20: after one step, x1 = 10 − 20/20 = 9 (y would reach 0, but x barely moved). GD would need many iterations because of the 10:1 condition number. Newton doesn't care about conditioning — it normalizes each direction by its curvature.

Exercise 8.2: GD vs Newton on Same Problem Derive

For f(x,y) = x² + 10y² starting at (10, 10), one GD step with optimal α = 1/20 (1/Lmax). What is x1 after GD?

x1 (GD)
Show derivation
∇f(10,10) = [20, 200]
x1 = 10 − (1/20) × 20 = 10 − 1 = 9
y1 = 10 − (1/20) × 200 = 10 − 10 = 0

After one GD step: (9, 0). The y-direction converges fast (curvature 20, step size matches) but x-direction crawls (curvature 2, but step size is tuned for curvature 20). This is the zigzag problem caused by κ = 20/2 = 10. Newton would need 1 step; GD needs ~10 × ln(10) ≈ 23 steps for the same accuracy.

Exercise 8.3: Hessian Size Derive

A transformer with 7B parameters. How many entries in its full Hessian matrix? Express in scientific notation.

×1019 entries
Show derivation
H is n × n where n = 7 × 109
Entries = n² = (7 × 109)² = 49 × 1018 = 4.9 × 1019
At 4 bytes each (float32): 4.9 × 1019 × 4 = ~200 exabytes

200 exabytes — that's ~200 million terabytes. The world's total data storage is ~120 exabytes. You literally cannot store the Hessian of a 7B model. This is why exact Newton's method is impossible for deep learning, and why approximations (Adam's per-parameter second moments, K-FAC's block-diagonal approximation, L-BFGS's low-rank updates) exist.

Exercise 8.4: Why Not Newton for Deep Learning? Trace
Which is the primary reason Newton's method isn't used for training neural networks?
Show reasoning

All options are real concerns, but the computational cost is the dealbreaker. Non-convexity and saddle points can be handled with modifications (regularized Newton, trust regions). Stochastic Hessians exist. But there's no getting around O(n²) storage and O(n³) per-step cost. Even approximate methods like L-BFGS (which stores a low-rank Hessian approximation) struggle beyond ~1M parameters because the inner products still scale with n. Adam is the practical compromise: it maintains per-parameter second moment estimates (diagonal Hessian approximation) at O(n) cost.

Exercise 8.5: 1D Newton Step Build

Implement one step of Newton's method for 1D: x_new = x - f'(x)/f''(x).

One line: x - df(x)/ddf(x). But what if ddf(x) = 0?
Show solution
javascript
function newtonStep(df, ddf, x) {
  const h = ddf(x);
  if (Math.abs(h) < 1e-12) return x; // avoid div by zero
  return x - df(x) / h;
}
Exercise 8.6: Arrange Newton's Method Pipeline Design

Put the steps of one Newton iteration in the correct order for the multi-dimensional case.

?
?
?
?
?
Compute ∇f(xk) Compute Hessian H(xk) Solve H · Δx = −∇f xk+1 = xk + Δx Check convergence ||∇f|| < ε
Show explanation

Correct order: Compute gradientCompute HessianSolve linear systemUpdate xCheck convergence. The key insight: you solve HΔx = −∇f instead of computing H−1 explicitly. Solving a linear system is O(n³) but inverting a matrix is also O(n³) with a worse constant. In practice, Cholesky decomposition (for PD Hessians) or conjugate gradient (for approximate solves) is used.

Chapter 9: Capstone — Loss Landscape Analysis

Everything comes together here. A neural network's loss surface is a high-dimensional function of its parameters — non-convex, with saddle points, flat regions, and sharp valleys. Understanding this landscape through the lens of optimization theory explains why certain training recipes work and others fail.

Loss landscape properties:
Hessian eigenvalues determine local curvature
• Positive eigenvalues = convex direction (bowl)
• Negative eigenvalues = concave direction (saddle)
• Near-zero eigenvalues = flat direction (plateau)

Sharp vs flat minima:
Sharp minimum: large Hessian eigenvalues, narrow basin
Flat minimum: small Hessian eigenvalues, wide basin
Flat minima generalize better (Hochreiter & Schmidhuber, 1997)

Learning rate warmup:
Start small (explore, avoid sharp minima) → increase (fast progress) → decay (converge)
Why SGD generalizes better than Adam (sometimes). SGD with large learning rates can't fit into sharp minima — the noise from minibatches kicks it out. It naturally settles into wide, flat minima that generalize. Adam adapts the learning rate per-parameter, which can fit into sharp minima that overfit. This is the sharpness-aware minimization (SAM) insight: deliberately seeking flat minima improves generalization.
Exercise 9.1: Classify the Critical Point Trace
At a critical point (∇f = 0), the Hessian eigenvalues are {5, −2, 0.1}. What kind of critical point is this?
Show reasoning

The eigenvalue −2 means the surface curves downward in one direction — you can decrease f by moving in that direction. This makes it a saddle point, not a minimum. In high-dimensional spaces, saddle points vastly outnumber local minima. For a random critical point in n dimensions, the probability that ALL eigenvalues are positive (true minimum) is roughly (1/2)n. For n = 106 parameters, saddle points are overwhelmingly more common than minima.

Exercise 9.2: Sharpness Measure Derive

The "sharpness" of a minimum is often measured by the largest Hessian eigenvalue λmax. If minimum A has λmax = 100 and minimum B has λmax = 0.5, which generalizes better and why? Express the ratio of max learning rates that work (using α < 2/λmax).

αmax(B) / αmax(A)
Show derivation
αmax(A) = 2/100 = 0.02
αmax(B) = 2/0.5 = 4
Ratio = 4 / 0.02 = 200

Minimum B is 200x flatter and can tolerate 200x larger learning rates. Flat minima generalize better because small perturbations to the weights (which happen between train and test distributions) cause small changes in loss. In sharp minima, tiny weight perturbations can dramatically increase the loss — the model has memorized the training data at a fragile parameter configuration.

Exercise 9.3: Learning Rate Warmup Trace
Learning rate warmup starts with a very small α, linearly increases to the target α over some number of steps, then decays. Why is the initial small α important?
Show reasoning

At random initialization, the gradient magnitudes are large and unpredictable. The effective Lipschitz constant L is very high early in training. A large learning rate violates α < 2/L and causes the loss to spike or diverge. Warmup keeps α small until the gradients stabilize, then ramps up to the target rate. Empirically, transformers are especially sensitive to this — without warmup, training often diverges in the first few hundred steps.

Exercise 9.4: Design a Training Schedule Design

Arrange these training phases in the correct order for a transformer training run.

?
?
?
?
?
LR warmup (small → peak) Cosine LR decay Random weight init Steady training at peak LR Final LR = ~0.1×peak
Show explanation

Correct order: Random initLR warmupSteady at peak LRCosine decayFinal low LR. Initialize randomly, warm up to avoid early divergence, train at peak rate for most of the budget (this is where most learning happens), then decay the LR to settle into a flat minimum. The final LR is typically 10% of peak. This is the standard "warmup + cosine decay" schedule used by GPT-3, LLaMA, and most modern LLMs.

Exercise 9.5: Optimizer Selection Trace
You're training a 1B parameter transformer. Which optimizer and why?
Show reasoning

AdamW is the standard choice for transformer training. Newton is infeasible (O(n²) memory). L-BFGS requires full-batch gradients and scales poorly beyond ~1M parameters. SGD works but requires extensive learning rate tuning and converges slower on transformers. AdamW provides per-parameter learning rate adaptation (diagonal Hessian approximation) at O(n) memory cost, with decoupled weight decay for regularization. The cost: 3x the memory of the model parameters (weights + m + v buffers).

Exercise 9.6: Adam Memory Overhead Derive

Adam stores first moment (m) and second moment (v) for each parameter, both in FP32. For a 7B parameter model with FP16 weights, how much total GPU memory for weights + optimizer states?

Weights: 7B × 2 bytes (FP16). Gradients: 7B × 2 bytes (FP16). Adam m: 7B × 4 bytes (FP32). Adam v: 7B × 4 bytes (FP32).

GB total
Show derivation
Weights = 7 × 109 × 2 = 14 GB
Gradients = 7 × 109 × 2 = 14 GB
Adam m = 7 × 109 × 4 = 28 GB
Adam v = 7 × 109 × 4 = 28 GB
Total = 14 + 14 + 28 + 28 = 84 GB

84 GB — more than one A100 (80 GB). The optimizer states alone (56 GB) are 4x the model weights (14 GB). This is why model parallelism and ZeRO-style optimizer state sharding are essential for training large models. The optimizer is the memory bottleneck, not the model.

Exercise 9.7: Full Pipeline — Find the Bug Debug

This training loop uses AdamW but the loss doesn't decrease. Click the buggy line.

function trainStep(params, grads, m, v, t, lr, beta1, beta2, wd) {
  t += 1;
  for (let i = 0; i < params.length; i++) {
    m[i] = beta1 * m[i] + (1 - beta1) * grads[i];
    v[i] = beta2 * v[i] + (1 - beta2) * grads[i] * grads[i];
    const mHat = m[i] / (1 - Math.pow(beta1, t));
    const vHat = v[i] / (1 - Math.pow(beta2, t));
    params[i] -= lr * (mHat / (Math.sqrt(vHat) + 1e-8) - wd * params[i]);
  }
}
Show explanation

Line 8 has the sign wrong for weight decay. In AdamW (decoupled weight decay), the weight decay term should be added to the update, not subtracted: params[i] -= lr * (mHat / (Math.sqrt(vHat) + 1e-8)) + lr * wd * params[i]. The current code subtracts wd*params[i], which means it's growing the weights instead of shrinking them. Equivalently, write it as: params[i] = params[i] * (1 - lr * wd) - lr * mHat / (Math.sqrt(vHat) + 1e-8);