Optimization Theory Workbook

Chapter 0: Convexity

You're training a neural network and the loss keeps jumping around. Sometimes gradient descent converges beautifully; other times it spirals into a terrible local minimum. The difference often comes down to one property of the loss surface: convexity.

A function f is convex if the line segment between any two points on its graph lies above or on the graph. Formally:

Convexity definition:
f(λx + (1−λ)y) ≤ λf(x) + (1−λ)f(y) for all λ ∈ [0, 1]

Why it matters:
• Convex function ⇒ every local minimum is a global minimum
• Gradient descent on convex functions has guaranteed convergence
• Non-convex ⇒ local minima, saddle points, no guarantees

Second-order test (Hessian test):
f is convex ⇔ H(x) ⪰ 0 everywhere // H is positive semi-definite
For single-variable: f is convex ⇔ f''(x) ≥ 0 everywhere

The convexity guarantee. If your optimization problem is convex, gradient descent will find the global optimum. Full stop. This is why so much of optimization theory focuses on convexity — it's the dividing line between "we can solve this" and "we're hoping for the best."

Exercise 0.1: Is f(x) = x² Convex? Trace

Apply the second-derivative test to f(x) = x². We have f''(x) = 2. Since 2 > 0 for all x, the function is:

Convex — f''(x) = 2 > 0 everywhere, so every local minimum is global Concave — the parabola opens upward Neither — x² has an inflection point at x = 0

Show reasoning

f(x) = x², so f'(x) = 2x and f''(x) = 2. Since f''(x) = 2 > 0 for all x, the function is strictly convex. The unique global minimum is at x = 0. This is the simplest possible convex function — the "bowl" that gradient descent loves.

Exercise 0.2: Is f(x) = |x| Convex? Trace

f(x) = |x| is not differentiable at x = 0. Can we still determine convexity? Check the definition directly: does f(λx + (1−λ)y) ≤ λf(x) + (1−λ)f(y) hold?

Not convex — the kink at x = 0 breaks convexity Convex — by triangle inequality |λx + (1−λ)y| ≤ λ|x| + (1−λ)|y| Convex only for x > 0

Show reasoning

The triangle inequality gives us |λx + (1−λ)y| ≤ |λx| + |(1−λ)y| = λ|x| + (1−λ)|y|. This is exactly the convexity definition. So yes, |x| is convex despite not being differentiable at zero. Convexity does NOT require smoothness — it's a geometric property about the shape of the graph. The L1 norm penalty in LASSO is convex, which is what makes LASSO tractable.

Exercise 0.3: Is f(x) = x³ Convex? Trace

f(x) = x³. Compute f''(x) = 6x. At x = −1, f''(−1) = −6 < 0. At x = 1, f''(1) = 6 > 0. What does this tell us?

Convex — the second derivative is positive somewhere Concave — the second derivative is negative somewhere Neither convex nor concave — f'' changes sign, so convexity fails globally

Show reasoning

For convexity we need f''(x) ≥ 0 everywhere. Since f''(x) = 6x is negative for x < 0 and positive for x > 0, the function is neither convex nor concave over all of R. It's convex on [0, ∞) and concave on (−∞, 0]. Neural network loss surfaces behave like this — convex in some directions, non-convex in others.

Exercise 0.4: Hessian of f(x,y) = x² + xy + y² Derive

Compute the Hessian matrix H for f(x,y) = x² + xy + y². Then find the smallest eigenvalue of H to determine if the function is convex.

H = [[∂²f/∂x², ∂²f/∂x∂y], [∂²f/∂y∂x, ∂²f/∂y²]]. Eigenvalues of [[a,b],[b,c]]: λ = ((a+c) ± √((a−c)² + 4b²)) / 2.

smallest eigenvalue of H

Show derivation

∂f/∂x = 2x + y, ∂f/∂y = x + 2y

∂²f/∂x² = 2, ∂²f/∂x∂y = 1, ∂²f/∂y² = 2

H = [[2, 1], [1, 2]]

λ = (2+2 ± √((2−2)² + 4·1²)) / 2 = (4 ± 2) / 2

λ₁ = 1, λ₂ = 3

Both eigenvalues are positive, so H is positive definite everywhere (it's constant). Therefore f is strictly convex. The smallest eigenvalue λ_min = 1 tells us the function curves least steeply in the direction (1, −1)/√2 — that's the "hardest" direction for gradient descent.

Exercise 0.5: Convexity Checker Build

Write a function that checks whether a 2×2 matrix (given as [a, b, c, d] for [[a,b],[c,d]]) is positive semi-definite using the eigenvalue test.

Return true if the matrix is PSD (both eigenvalues ≥ 0), false otherwise.

Show solution

javascript
function isPSD(mat) {
  const [a, b, c, d] = mat;
  const trace = a + d;
  const det = a * d - b * c;
  const disc = trace * trace - 4 * det;
  if (disc < 0) return false; // complex eigenvalues
  const sq = Math.sqrt(disc);
  const l1 = (trace + sq) / 2;
  const l2 = (trace - sq) / 2;
  return l1 >= -1e-10 && l2 >= -1e-10;
}

Exercise 0.6: Find the Bug Debug

This convexity checker always says "convex." Click the buggy line.

function isConvex(f, x1, x2, lam) {
  const mid = lam * x1 + (1 - lam) * x2;
  const lhs = f(mid);
  const rhs = lam * f(x1) + (1 - lam) * f(x2);
  return lhs <= lhs;  // should be lhs <= rhs
}

Show explanation

Line 5 is the bug. It compares lhs ≤ lhs instead of lhs ≤ rhs. Any number is always ≤ itself, so this returns true no matter what. The fix: return lhs ≤ rhs;. This is a classic copy-paste typo — the kind that passes unit tests if you only test convex functions.

Chapter 1: Gradient Descent

You know a function is convex and has a minimum. How do you find it? You roll downhill. That's gradient descent: compute the slope, take a step in the opposite direction, repeat. The only decisions are how big your steps are (the learning rate α) and where you start.

Gradient descent update rule:
x_k+1 = x_k − α ∇f(x_k)

Example: f(x) = 3x² + 2x + 1
∇f(x) = 6x + 2
Minimum at ∇f(x) = 0 ⇒ x* = −1/3
f(x*) = 3(1/9) − 2/3 + 1 = 1/3 − 2/3 + 1 = 2/3

Convergence condition:
For convex f with L-Lipschitz gradient: use α < 2/L
For f(x) = 3x² + 2x + 1: ∇²f = 6, so L = 6, need α < 1/3

Learning rate is everything. Too large and you overshoot the minimum, bouncing back and forth forever. Too small and you crawl toward it, wasting compute. The optimal α for a quadratic is 1/L where L is the largest eigenvalue of the Hessian. In deep learning, L is unknown and changes at every step — that's why learning rate schedules matter.

Exercise 1.1: First GD Step Derive

f(x) = 3x² + 2x + 1. Starting at x₀ = 5 with α = 0.1. Compute x₁.

∇f(x₀) = 6(5) + 2 = 32. Then x₁ = x₀ − α ∇f(x₀).

x₁

Show derivation

∇f(5) = 6(5) + 2 = 32

x₁ = 5 − 0.1 × 32 = 5 − 3.2 = 1.8

We jumped from x=5 to x=1.8 — a big move because the gradient was steep (32). The true minimum is at x*=−1/3 ≈ −0.333, so we're getting closer but still far.

Exercise 1.2: Five Iterations Derive

Continue the trace: compute x₂ through x₅ for f(x) = 3x² + 2x + 1 with α = 0.1 from x₀ = 5. What is x₅?

Pattern: x_k+1 = x_k − 0.1(6x_k + 2) = x_k(1 − 0.6) − 0.2 = 0.4x_k − 0.2.

x₅

Show derivation

x₀ = 5

x₁ = 0.4(5) − 0.2 = 1.8

x₂ = 0.4(1.8) − 0.2 = 0.52

x₃ = 0.4(0.52) − 0.2 = 0.008

x₄ = 0.4(0.008) − 0.2 = −0.1968

x₅ = 0.4(−0.1968) − 0.2 = −0.27872 − ... wait, let me redo.

x₄ = 0.4 × 0.008 − 0.2 = 0.0032 − 0.2 = −0.1968

x₅ = 0.4 × (−0.1968) − 0.2 = −0.07872 − 0.2 = −0.27872

After 5 steps we're at x ≈ −0.279, close to the true minimum x* = −0.333. The linear recurrence x_k+1 = 0.4x_k − 0.2 converges because |0.4| < 1 (the contraction factor is 1 − αL = 1 − 0.6 = 0.4).

Exercise 1.3: Condition Number Derive

f(x,y) = 50x² + y². The Hessian is [[100, 0], [0, 2]]. Compute the condition number κ = λ_max/λ_min. How many GD iterations are needed to reduce the error by a factor of 10?

For quadratics, GD convergence rate = (κ−1)/(κ+1) per step. Steps for factor-10 reduction: k = ln(10) / ln(κ/(κ−1)) ≈ κ × ln(10).

κ (condition number)

Show derivation

λ_max = 100, λ_min = 2

κ = 100 / 2 = 50

Convergence rate per step = (50 − 1)/(50 + 1) = 49/51 ≈ 0.961

Steps for 10× reduction ≈ ln(10) / ln(51/49) ≈ 2.303 / 0.0392 ≈ 59 steps

A condition number of 50 means the function is 50 times steeper in one direction than another. GD zigzags along the narrow valley, making ~59 steps for each order of magnitude of improvement. This is why preconditioning (rescaling coordinates) and Adam (per-parameter learning rates) help — they effectively reduce κ.

Exercise 1.4: Divergence Check Trace

For f(x) = 3x² + 2x + 1 (L = 6), what happens if you set α = 0.5? Compute x₁ from x₀ = 1: x₁ = 1 − 0.5(8) = −3. Then x₂ = −3 − 0.5(−16) = 5. Then x₃ = 5 − 0.5(32) = −11...

Converges slowly but eventually reaches x* Diverges — |x_k| grows without bound because α > 1/3 = 2/L Oscillates around x* with constant amplitude

Show reasoning

The contraction factor is |1 − αL| = |1 − 0.5 × 6| = |1 − 3| = 2 > 1. Since the contraction factor exceeds 1, each step amplifies the error. The iterates 1, −3, 5, −11, ... grow exponentially. The maximum stable learning rate is α < 2/L = 1/3 ≈ 0.333. This is why α = 0.5 diverges.

Exercise 1.5: Implement gradDescent() Build

Write gradient descent for 1D. Given f', starting x, learning rate, and number of steps, return the final x.

Apply x = x - alpha * gradf(x) for 'steps' iterations.

Show solution

javascript
function gradDescent(gradf, x0, alpha, steps) {
  let x = x0;
  for (let i = 0; i < steps; i++) {
    x = x - alpha * gradf(x);
  }
  return x;
}

Exercise 1.6: Optimal Learning Rate Derive

For f(x) = 10x², what is the optimal fixed learning rate α*? (Hint: the optimal rate for a quadratic ax² is 1/(2a), which gives convergence in a single step.)

α*

Show derivation

f(x) = 10x², f'(x) = 20x, f''(x) = 20 = L

α* = 1/L = 1/20 = 0.05

Check: x₁ = x₀ − 0.05 × 20x₀ = x₀(1 − 1) = 0

With the optimal rate, GD converges in exactly one step for a pure quadratic. This is because the Newton step for a quadratic is exact. In practice, loss functions aren't pure quadratics, so the optimal rate changes at every point — motivating adaptive methods like Adam.

Chapter 2: SGD Variants

Vanilla gradient descent moves in the steepest direction at every step, with no memory of past gradients. That causes two problems: zigzagging on ill-conditioned problems, and getting stuck in flat regions. Momentum fixes both by adding inertia — like a heavy ball rolling downhill that carries speed through flat patches and smooths out oscillations.

SGD with Momentum:
v_k+1 = β v_k + ∇f(x_k)
x_k+1 = x_k − α v_k+1

Nesterov Accelerated Gradient (NAG):
v_k+1 = β v_k + ∇f(x_k − αβ v_k) // gradient at lookahead position
x_k+1 = x_k − α v_k+1

Typical hyperparameters: β = 0.9, α = 0.01

Momentum as exponential averaging. The velocity v_k = ∇f(x_k) + β∇f(x_k−1) + β²∇f(x_k−2) + ... It's an exponentially weighted average of past gradients. With β = 0.9, the effective window is ~10 past gradients (1/(1−β) = 10). Components that consistently point the same way get amplified; noisy oscillations cancel out.

Exercise 2.1: Momentum Step 1 Derive

f(x) = 3x² + 2x + 1. Starting at x₀ = 5, v₀ = 0, α = 0.1, β = 0.9. Compute v₁ and x₁.

x₁

Show derivation

∇f(5) = 6(5) + 2 = 32

v₁ = 0.9 × 0 + 32 = 32

x₁ = 5 − 0.1 × 32 = 1.8

The first step with momentum is identical to vanilla GD because v₀ = 0. The momentum effect kicks in starting from step 2, where v carries history from step 1.

Exercise 2.2: Momentum Step 2 and 3 Derive

Continue from Exercise 2.1. Compute v₂, x₂, v₃, and x₃. What is x₃?

Remember: v_k+1 = βv_k + ∇f(x_k), then x_k+1 = x_k − αv_k+1.

x₃

Show derivation

∇f(1.8) = 6(1.8) + 2 = 12.8

v₂ = 0.9 × 32 + 12.8 = 28.8 + 12.8 = 41.6

x₂ = 1.8 − 0.1 × 41.6 = 1.8 − 4.16 = −2.36

∇f(−2.36) = 6(−2.36) + 2 = −12.16

v₃ = 0.9 × 41.6 + (−12.16) = 37.44 − 12.16 = 25.28

x₃ = −2.36 − 0.1 × 25.28 = −2.36 − 2.528 = −4.888

Notice momentum overshoots — x went past the minimum at −0.333 to −2.36, then even further to −4.888. The accumulated velocity is too high. This is typical early behavior: momentum takes a few steps to stabilize. With vanilla GD, x₃ would be 0.008 — closer but slower to eventually converge. Momentum trades early overshoot for faster asymptotic convergence.

Exercise 2.3: Nesterov vs Standard Momentum Trace

Nesterov momentum evaluates the gradient at x_k − αβv_k (the "lookahead" point) instead of x_k. Why does this help convergence?

It uses a larger effective learning rate It adds random noise for better exploration It evaluates the gradient where momentum is about to take us, so it can correct before overshooting It reduces the memory requirement by half

Show reasoning

Standard momentum computes the gradient at where you are. Nesterov computes the gradient at where you're about to be. If momentum is carrying you past the minimum, the lookahead gradient points backward — so Nesterov applies the brakes before overshooting, not after. This gives O(1/k²) convergence on convex problems vs O(1/k) for standard momentum. That's provably optimal for first-order methods (Nesterov, 1983).

Exercise 2.4: Effective Window Derive

With momentum β = 0.99, the velocity is an exponential average of past gradients. Approximately how many past gradients contribute significantly? (Use the rule: effective window ≈ 1/(1−β).)

past gradients

Show derivation

Effective window = 1 / (1 − 0.99) = 1 / 0.01 = 100

With β = 0.99, the velocity remembers roughly 100 past gradients. This is great for smoothing out minibatch noise (SGD on large datasets) but terrible for rapidly changing landscapes. Most practitioners use β = 0.9 (window ~10) as a balance. Adam uses β₁ = 0.9 for momentum and β₂ = 0.999 (window ~1000) for the second moment — the second moment changes more slowly than the mean.

Exercise 2.5: Arrange Momentum Update Design

Put the momentum SGD steps in the correct order for one iteration.

→

Compute ∇f(x_k) v = βv + ∇f x = x − αv Evaluate f(x_k)

Show explanation

Correct order: Evaluate f → Compute gradient → Update velocity → Update parameters. You must evaluate the function before computing its gradient (backprop requires the forward pass). Then velocity accumulates the gradient with history, and finally the parameter update uses the velocity.

Exercise 2.6: Find the Bug Debug

This momentum implementation diverges instead of converging. Click the buggy line.

function momentumSGD(gradf, x, alpha, beta, steps) {
  let v = 0;
  for (let i = 0; i < steps; i++) {
    const g = gradf(x);
    v = beta * v + g;
    x = x + alpha * v;  // should be x - alpha * v
  }
  return x;
}

Show explanation

Line 6 is the bug. It uses x + alpha * v instead of x - alpha * v. Gradient descent moves opposite to the gradient (downhill). Adding the gradient moves uphill — maximizing the function instead of minimizing it. The sign error is the most common optimization bug in practice.

Chapter 3: Constrained Optimization

Most real problems have constraints. You want to minimize a loss, but your weights must satisfy a norm budget. You want to allocate resources, but they must sum to a fixed amount. You can't just set the gradient to zero — you need to incorporate the constraint into the optimization. The key tool is the Lagrangian.

Constrained problem:
minimize f(x)    subject to g(x) = 0

Lagrangian:
L(x, λ) = f(x) + λ g(x)

Optimality conditions:
∇_xL = 0    // gradient w.r.t. x vanishes
∇_λL = 0    // constraint is satisfied: g(x) = 0

Interpretation: λ = sensitivity of optimum to constraint
If the constraint is "budget = B", then df*/dB = −λ

λ as a shadow price. The Lagrange multiplier λ tells you how much the optimal value would improve if you relaxed the constraint by one unit. If λ = 3 for a budget constraint, then one extra dollar of budget improves your objective by 3. In ML, λ in weight decay tells you the trade-off between fitting the data and keeping weights small.

Exercise 3.1: Minimize x²+y² s.t. x+y=1 Derive

Set up the Lagrangian L = x² + y² + λ(x + y − 1). Solve ∂L/∂x = 0, ∂L/∂y = 0, and x + y = 1 for x*, y*, λ*. What is x*?

Show derivation

L = x² + y² + λ(x + y − 1)

∂L/∂x = 2x + λ = 0 ⇒ x = −λ/2

∂L/∂y = 2y + λ = 0 ⇒ y = −λ/2

Constraint: x + y = 1 ⇒ −λ/2 − λ/2 = 1 ⇒ −λ = 1 ⇒ λ = −1

x* = −(−1)/2 = 1/2, y* = 1/2

f(x*, y*) = 1/4 + 1/4 = 1/2

The closest point on the line x+y=1 to the origin is (1/2, 1/2). By symmetry, x=y was expected. The Lagrange multiplier λ = −1 means: if we changed the constraint to x+y=1.01, the optimal value would decrease by about 0.01 (from 0.5 to 0.4999...).

Exercise 3.2: Lagrange Multiplier Value Derive

From Exercise 3.1, what is λ*?

λ*

Show derivation

From the derivation above: λ* = −1. This tells us that relaxing the constraint from x+y=1 to x+y=1+ε would decrease the objective by approximately ε. Tightening it to x+y=1−ε would increase the objective by ε.

Exercise 3.3: Maximize Entropy Derive

Maximize H(p) = −p₁ln(p₁) − p₂ln(p₂) subject to p₁ + p₂ = 1. What is the maximum entropy value?

L = −p₁ln(p₁) − p₂ln(p₂) + λ(p₁ + p₂ − 1). Set ∂L/∂p_i = 0.

H* (nats)

Show derivation

∂L/∂p₁ = −ln(p₁) − 1 + λ = 0 ⇒ p₁ = e^λ−1

∂L/∂p₂ = −ln(p₂) − 1 + λ = 0 ⇒ p₂ = e^λ−1

p₁ = p₂ (both equal) ⇒ p₁ + p₂ = 1 ⇒ p₁ = p₂ = 1/2

H* = −(1/2)ln(1/2) − (1/2)ln(1/2) = ln(2) ≈ 0.693 nats

The maximum entropy distribution is the uniform distribution — maximum uncertainty. This result generalizes: for n outcomes, max entropy is ln(n) achieved by the uniform distribution. This is the principle behind cross-entropy loss — it's the gap between your model's distribution and the data distribution.

Exercise 3.4: Constrained Optimization Solver Build

Solve: minimize ax² + by² subject to x + y = c. Return [x*, y*] using the Lagrangian method analytically.

Solve for lambda first, then compute x and y.

Show solution

javascript
function solveConstrained(a, b, c) {
  // -lambda/(2a) - lambda/(2b) = c
  // -lambda * (1/(2a) + 1/(2b)) = c
  // -lambda * (a+b)/(2ab) = c
  const lam = -c * 2 * a * b / (a + b);
  const x = -lam / (2 * a);
  const y = -lam / (2 * b);
  return [x, y];
}

Exercise 3.5: Sensitivity Interpretation Trace

You solve min x²+y² s.t. x+y=1 and get λ*=−1 with optimal value f*=0.5. If the constraint changes to x+y=1.1, approximately what is the new optimal value?

f* ≈ 0.6 (increases by 0.1) f* ≈ 0.4 (decreases by 0.1) f* ≈ 0.605 (the exact value is (1.1)²/2 = 0.605) f* ≈ 0.55 (increases by the magnitude of λ times 0.1)

Show reasoning

With constraint x+y=1.1, the solution is x*=y*=0.55 and f*=2(0.55)² = 0.605. The linear approximation using λ gives f* ≈ 0.5 + |λ|×0.1 = 0.6, which is close. The extra 0.005 is the second-order correction. The sensitivity interpretation: λ gives the first-order rate of change of f* with respect to the constraint's right-hand side.

Chapter 4: KKT Conditions

Lagrange multipliers handle equality constraints (g(x) = 0). But what about inequality constraints (g(x) ≤ 0)? The Karush-Kuhn-Tucker (KKT) conditions generalize the Lagrangian method to handle both. They add one crucial insight: an inequality constraint is either active (binding, equality holds) or inactive (slack, strict inequality).

Problem: minimize f(x) s.t. g_i(x) ≤ 0, h_j(x) = 0

KKT conditions (necessary for optimality):
1. Stationarity: ∇f + ∑μ_i∇g_i + ∑λ_j∇h_j = 0
2. Primal feasibility: g_i(x) ≤ 0, h_j(x) = 0
3. Dual feasibility: μ_i ≥ 0
4. Complementary slackness: μ_i g_i(x) = 0 for all i

// Complementary slackness: either the constraint is active (g_i=0)
// or its multiplier is zero (μ_i=0). Never both non-zero.

Complementary slackness is the key insight. If a constraint is inactive (the solution is in the interior, g_i < 0), then the constraint doesn't affect the solution, so μ_i = 0. If μ_i > 0, the constraint is pushing back — it must be active (g_i = 0). Think of it as: you only feel the wall if you're pressed against it.

Exercise 4.1: Minimize x² s.t. x ≥ 2 Derive

Rewrite as: min f(x)=x² s.t. g(x)=2−x ≤ 0. Apply KKT. What is x*?

Stationarity: 2x − μ = 0. Complementary slackness: μ(2−x) = 0.

Show derivation

The unconstrained minimizer of x² is x=0, but 0 < 2 violates g(x) ≤ 0. So the constraint must be active.

Active constraint: g(x*) = 0 ⇒ 2 − x* = 0 ⇒ x* = 2

Stationarity: 2x* − μ = 0 ⇒ μ = 2(2) = 4

Dual feasibility: μ = 4 ≥ 0 ✓

Complementary slackness: μ × g(x*) = 4 × 0 = 0 ✓

All KKT conditions satisfied. The constraint pushes x away from the unconstrained minimum (x=0) to x=2. The multiplier μ=4 measures this push. In ML: this is like weight clipping pushing your weights to the constraint boundary.

Exercise 4.2: KKT Multiplier Derive

From Exercise 4.1, what is μ*?

μ*

Show derivation

From stationarity: 2x* − μ = 0 ⇒ μ = 2(2) = 4

μ* = 4 > 0, confirming the constraint is active. If we changed the constraint to x ≥ 1.9, the optimal value would improve (decrease) by approximately μ × 0.1 = 0.4. Actual change: 2² − 1.9² = 4 − 3.61 = 0.39 ≈ 0.4.

Exercise 4.3: Inactive Constraint Trace

Minimize f(x) = (x−5)² subject to x ≥ 2. The unconstrained minimum is x=5, which satisfies x ≥ 2. What are x* and μ*?

x*=2, μ*=6 — the constraint is active x*=5, μ*=0 — the constraint is inactive, so its multiplier is zero x*=5, μ*=6 — the constraint is satisfied but still has a multiplier

Show reasoning

The unconstrained minimum x=5 satisfies x ≥ 2 (since 5 > 2), so the constraint is inactive (slack). By complementary slackness, μ=0. The constraint doesn't affect the solution at all — removing it wouldn't change x*. This is the "you don't feel the wall" case.

Exercise 4.4: When Are KKT Sufficient? Trace

KKT conditions are necessary for optimality. When are they also sufficient (guaranteeing the point IS optimal)?

When f is convex and all g_i are convex (constraint region is convex) Always — KKT is both necessary and sufficient Only for linear programs When there is exactly one constraint

Show reasoning

For convex problems (convex objective, convex feasible set), KKT conditions are both necessary AND sufficient. This is because convex problems have no spurious local minima — any point satisfying the first-order conditions must be globally optimal. This is the foundation of convex optimization: if you can prove your problem is convex and find a KKT point, you're done. SVMs, linear regression with L2 regularization, and logistic regression are all convex.

Exercise 4.5: Two Constraints Derive

Minimize f(x) = x² subject to x ≥ −3 AND x ≤ 1. Rewrite as g₁(x) = −3−x ≤ 0 and g₂(x) = x−1 ≤ 0. What is x*? Which constraint(s) are active?

Show derivation

Unconstrained minimum: x* = 0

Check g₁(0) = −3 − 0 = −3 ≤ 0 ✓

Check g₂(0) = 0 − 1 = −1 ≤ 0 ✓

Both constraints satisfied, so x* = 0, μ₁ = μ₂ = 0

Both constraints are inactive. The unconstrained minimum lies safely inside the feasible region [−3, 1]. Neither constraint affects the solution. This is common in practice — many constraints in an optimization problem may be inactive at the solution.

Chapter 5: Duality

Every constrained optimization problem has a twin — the dual problem. The original is called the primal. The dual maximizes a lower bound on the primal's optimal value. When they agree (strong duality), solving the easier one solves both.

Primal: p* = min_x f(x) s.t. g_i(x) ≤ 0

Lagrange dual function:
g(μ) = min_x L(x, μ) = min_x [f(x) + ∑μ_ig_i(x)]

Dual problem: d* = max_μ≥0 g(μ)

Weak duality: d* ≤ p* (always holds)
Strong duality: d* = p* (holds for convex + Slater's condition)
Duality gap: p* − d* ≥ 0

Why duality matters in ML. SVMs are solved via the dual (which reveals the kernel trick). Reinforcement learning's LP formulation has a dual that yields the Bellman equation. Variational inference minimizes the KL divergence, which is the dual of the evidence lower bound (ELBO). Every time you see "lower bound" in ML, duality is lurking nearby.

Exercise 5.1: Dual of a Simple LP Derive

Primal: minimize c^Tx subject to Ax ≥ b, x ≥ 0. For a 1D case: minimize 3x s.t. 2x ≥ 4, x ≥ 0. Find the primal optimum p*.

The constraint 2x ≥ 4 gives x ≥ 2. Combined with x ≥ 0, the feasible set is [2, ∞). The objective 3x is minimized at the boundary.

p* (optimal primal value)

Show derivation

Feasible: x ≥ 2 (from 2x ≥ 4) and x ≥ 0

Minimize 3x over x ≥ 2: x* = 2, p* = 3(2) = 6

Exercise 5.2: Dual Value Derive

For the LP in 5.1 (min 3x, s.t. 2x ≥ 4, x ≥ 0), write the dual: max 4μ s.t. 2μ ≤ 3, μ ≥ 0. Find d*.

Dual constraint: 2μ ≤ 3 ⇒ μ ≤ 1.5. Maximize 4μ over μ ∈ [0, 1.5].

d* (optimal dual value)

Show derivation

μ* = 1.5, d* = 4(1.5) = 6

Duality gap = p* − d* = 6 − 6 = 0

Strong duality holds — the gap is zero. This always happens for LPs (linear programs). The dual variable μ* = 1.5 is the shadow price: increasing the right-hand side of 2x ≥ 4 to 2x ≥ 5 would increase the optimal cost by μ × 1 = 1.5 (from 6 to 7.5).

Exercise 5.3: Duality Gap Trace

For a non-convex problem, d* = 3 and p* = 5. What is the duality gap, and what does it mean?

Gap = −2. The dual overestimates the primal. Gap = 2. The dual provides a lower bound that is 2 below the true optimum. Gap = 2. The primal solution is infeasible.

Show reasoning

Duality gap = p* − d* = 5 − 3 = 2 > 0. The dual provides a certified lower bound on the primal optimum. We know the true optimal value is between 3 and 5. For non-convex problems, the gap can be large. For convex problems satisfying Slater's condition (a strictly feasible point exists), the gap is zero. This is why convexity is so valuable — it means the dual bound is tight.

Exercise 5.4: Strong Duality Conditions Trace

Which of these problems is guaranteed to have strong duality (zero gap)?

Minimize x² + y² subject to x + y ≤ 5 (convex objective, linear constraint) Minimize sin(x) subject to x ∈ [0, 2π] Minimize x³ subject to x ≥ 1

Show reasoning

Option 1: x²+y² is convex, x+y ≤ 5 is a linear (hence convex) constraint, and Slater's condition holds (e.g., x=0, y=0 is strictly feasible since 0 < 5). Strong duality guaranteed. Options 2 and 3 have non-convex objectives (sin is non-convex, x³ is non-convex over all R though it IS convex on [1,∞)) — no guarantee of strong duality.

Exercise 5.5: Dual Computation Derive

Compute the Lagrange dual function for: min x² s.t. x ≥ 2 (rewritten: g(x) = 2−x ≤ 0).

L(x,μ) = x² + μ(2−x). Minimize over x: ∂L/∂x = 2x − μ = 0 ⇒ x = μ/2. Substitute back. What is g(μ)?

d* = max g(μ) for μ ≥ 0

Show derivation

L(x,μ) = x² + μ(2 − x)

∂L/∂x = 2x − μ = 0 ⇒ x = μ/2

g(μ) = (μ/2)² + μ(2 − μ/2) = μ²/4 + 2μ − μ²/2 = −μ²/4 + 2μ

Maximize: g'(μ) = −μ/2 + 2 = 0 ⇒ μ* = 4

d* = g(4) = −16/4 + 8 = −4 + 8 = 4

Check: p* = (2)² = 4. Duality gap = 4 − 4 = 0 ✓

Strong duality holds because the problem is convex. The dual computation independently confirms p* = 4 and μ* = 4, matching our KKT solution from Chapter 4.

Chapter 6: Proximal Operators

Some functions are non-smooth — they have kinks where the derivative doesn't exist. The absolute value |x| at zero, the L1 norm, the ReLU at zero. Gradient descent can't handle these directly because there's no unique gradient at the kink. Proximal operators solve this by wrapping the non-smooth function in a smooth envelope.

Proximal operator:
prox_ηf(x) = argmin_z { f(z) + ||z − x||² / (2η) }

// "Find z that balances minimizing f and staying close to x"

Soft thresholding (prox of L1 norm):
prox_ηλ|·|(x) = sign(x) · max(|x| − ηλ, 0)

Proximal gradient descent:
x_k+1 = prox_ηg(x_k − η∇f(x_k)) // for minimizing f(x) + g(x)
// f is smooth (GD on it), g is non-smooth (prox it)

Soft thresholding = L1 regularization in action. When you train with L1 penalty, the proximal step pushes small weights to exactly zero. If |w| − λ < 0, the weight becomes zero. This is why L1 produces sparse models while L2 only shrinks weights toward zero. The prox operator makes this precise: it's a thresholding operation, not a gradient step.

Exercise 6.1: Soft Threshold at x=3, λ=1 Derive

Compute prox(x) = sign(x) · max(|x| − λ, 0) for x = 3, λ = 1.

prox(3)

Show derivation

sign(3) = +1

max(|3| − 1, 0) = max(2, 0) = 2

prox(3) = (+1) × 2 = 2

The value 3 gets shrunk by λ=1 toward zero, becoming 2. It doesn't become zero because |3| > λ. Only values with |x| ≤ λ get zeroed out.

Exercise 6.2: Soft Threshold at x=0.5, λ=1 Derive

Compute prox(x) for x = 0.5, λ = 1.

prox(0.5)

Show derivation

sign(0.5) = +1

max(|0.5| − 1, 0) = max(−0.5, 0) = 0

prox(0.5) = (+1) × 0 = 0

Since |0.5| < λ = 1, the value gets thresholded to exactly zero. This is the sparsity-inducing property of L1. Any weight smaller than the threshold gets killed. This is exactly what happens in LASSO regression.

Exercise 6.3: Negative Input Derive

Compute prox(x) for x = −2.5, λ = 0.5.

prox(−2.5)

Show derivation

sign(−2.5) = −1

max(|−2.5| − 0.5, 0) = max(2, 0) = 2

prox(−2.5) = (−1) × 2 = −2

The sign is preserved, and the magnitude is shrunk by λ = 0.5. The soft thresholding operator is symmetric: it shrinks positive and negative values equally toward zero.

Exercise 6.4: Implement softThreshold() Build

Implement the soft thresholding operator for a vector (array of numbers).

Apply the soft thresholding formula to each element independently.

Show solution

javascript
function softThreshold(x, lambda) {
  return x.map(xi => {
    const s = Math.sign(xi);
    const m = Math.max(Math.abs(xi) - lambda, 0);
    return s * m;
  });
}

Exercise 6.5: Prox of L2 Squared Trace

The proximal operator of f(z) = (λ/2)||z||² is prox(x) = x/(1+ηλ). For x=6, η=1, λ=2, what is prox(x)?

prox(6) = 6/3 = 2 — L2 prox shrinks uniformly, never to zero prox(6) = 0 — L2 prox zeros out like L1 prox(6) = 4 — subtract λ from x

Show reasoning

prox(6) = 6/(1 + 1×2) = 6/3 = 2. L2 regularization shrinks values toward zero by a multiplicative factor but never reaches zero. L1 regularization thresholds values to zero if they're small enough. This is the fundamental difference: L1 produces sparsity, L2 produces small weights. In neural networks, weight decay (L2) keeps all weights small; L1 pruning kills unimportant weights entirely.

Exercise 6.6: Find the Bug Debug

This soft thresholding function doesn't produce sparsity — values near zero never become zero. Click the buggy line.

function softThresh(x, lam) {
  const result = [];
  for (let i = 0; i < x.length; i++) {
    const s = Math.sign(x[i]);
    const m = Math.abs(x[i]) - lam;  // missing max(..., 0)
    result.push(s * m);
  }
  return result;
}

Show explanation

Line 5 is the bug. It computes |x[i]| − lam without the max(..., 0) clamp. When |x[i]| < lam, this produces a negative value, and multiplied by sign gives a flipped-sign result instead of zero. For example, x=0.3, lam=1: should give 0, but gives sign(0.3)×(0.3−1) = −0.7. The fix: const m = Math.max(Math.abs(x[i]) - lam, 0);

Chapter 7: Convergence Rates

Not all optimizers converge at the same speed. The convergence rate tells you how many iterations you need to get within ε of the optimum. This is the theoretical yardstick for comparing gradient descent, momentum, Nesterov, and Newton's method.

Convergence rates for f(x_k) − f(x*) ≤ ε:

GD on convex: O(1/k) // sublinear — k ∼ 1/ε
GD on strongly convex: O((1−1/κ)^k) // linear — k ∼ κ ln(1/ε)
Nesterov on convex: O(1/k²) // accelerated — k ∼ 1/√ε
Nesterov on strongly convex: O((1−1/√κ)^k) // k ∼ √κ ln(1/ε)
Newton's method: O(c^{2^k}) // quadratic — k ∼ log log(1/ε)

// κ = condition number = L/μ (ratio of curvature extremes)
// "Linear" convergence = each step removes a constant fraction of error

Why these rates matter in practice. To reach ε = 10⁻⁶ on a convex problem: GD needs ~10⁶ steps, Nesterov needs ~10³ steps, and Newton needs ~20 steps. But each Newton step costs O(n³) for the Hessian inverse vs O(n) for a gradient step. The winner depends on problem dimension n: Newton wins for n < 1000, gradient methods win for n > 10,000 (like neural networks with millions of parameters).

Exercise 7.1: GD Iterations for ε=10⁻⁶ Derive

GD on a convex function converges as f(x_k) − f* ≤ C/k where C is a problem constant. How many iterations k to reach ε = 10⁻⁶ if C = 1?

iterations

Show derivation

C/k ≤ ε ⇒ k ≥ C/ε = 1/10⁻⁶ = 10⁶

k = 1,000,000 iterations

One million iterations — painfully slow. This is the O(1/k) sublinear rate. Doubling accuracy requires doubling iterations. This motivates using Nesterov acceleration or exploiting strong convexity for faster convergence.

Exercise 7.2: Nesterov Iterations for ε=10⁻⁶ Derive

Nesterov accelerated gradient on a convex function: f(x_k) − f* ≤ C/k². How many iterations for ε = 10⁻⁶ with C = 1?

iterations

Show derivation

C/k² ≤ ε ⇒ k² ≥ C/ε = 10⁶ ⇒ k ≥ 10³ = 1000

1,000 vs 1,000,000 — that's a 1000x speedup just from the Nesterov trick. The O(1/k²) rate is provably optimal for first-order methods on convex functions (Nesterov, 1983). You cannot do better with only gradient information — this is the information-theoretic lower bound.

Exercise 7.3: Strongly Convex GD Derive

For strongly convex f with κ = 100 (condition number), GD converges as (1 − 1/κ)^k ≤ ε. How many iterations for ε = 10⁻⁶?

Take log: k × ln(1 − 1/κ) ≤ ln(ε). Use ln(1 − x) ≈ −x for small x.

iterations

Show derivation

k ≥ ln(1/ε) / ln(1/(1 − 1/κ)) ≈ κ × ln(1/ε)

k ≈ 100 × ln(10⁶) = 100 × 6 × ln(10) = 100 × 13.816 ≈ 1382

Strong convexity gives linear convergence — each step removes a constant fraction (1/κ) of the remaining error. This is exponentially faster than the O(1/k) rate. The cost is proportional to κ: ill-conditioned problems (large κ) converge slowly. Most deep learning loss surfaces are extremely ill-conditioned (κ > 10⁶), which is why raw GD is impractical.

Exercise 7.4: Compare All Rates Trace

Rank these from slowest to fastest convergence for a strongly convex problem with κ=100, targeting ε=10⁻⁶:
A) GD: ~1382 iterations
B) Nesterov (strongly convex): ~√κ × ln(1/ε) = 10 × 13.8 ≈ 138
C) Newton: ~log₂(log₂(1/ε)) ≈ log₂(20) ≈ 4-5 steps

Slowest to fastest: GD (1382) → Nesterov (138) → Newton (5) Slowest to fastest: Newton → GD → Nesterov They all take roughly the same number of steps

Show reasoning

GD takes ~1382 steps, Nesterov takes ~138 steps (10x fewer), Newton takes ~5 steps (276x fewer). But Newton's method requires computing and inverting the Hessian at each step — O(n³) work per step where n is the problem dimension. For a neural network with n = 10⁸ parameters, one Newton step would require 10²⁴ operations. That's why deep learning uses first-order methods despite their slower convergence rate — the per-step cost dominates.

Exercise 7.5: Practical Implications Derive

You're training a model with 10M parameters. Each GD step takes 0.1 seconds. Each Newton step takes n² × 10⁻¹² seconds (simplified). How many seconds for 1000 GD steps vs 5 Newton steps?

GD total time (seconds)

Show derivation

GD: 1000 steps × 0.1 sec = 100 seconds

Newton: 5 steps × (10⁷)² × 10⁻¹² = 5 × 10¹⁴ × 10⁻¹² = 5 × 100 = 500 seconds

GD wins! 100 vs 500 seconds. Newton takes fewer steps but each step is enormously expensive. At 10M parameters, Newton is already 5x slower despite needing 200x fewer iterations. At 1B parameters, Newton would take 5 × 10⁶ seconds per step — completely infeasible. This is the fundamental reason deep learning uses first-order methods.

Chapter 8: Second-Order Methods

Gradient descent uses the first derivative (slope) to decide where to go. Newton's method uses the second derivative (curvature) to decide how far to go. If the surface curves sharply, take a small step. If it curves gently, take a big step. This curvature information is encoded in the Hessian matrix.

Newton's method:
x_k+1 = x_k − [H(x_k)]⁻¹ ∇f(x_k)

1D case: x_k+1 = x_k − f'(x_k) / f''(x_k)

Multi-dimensional:
H = [∂²f/∂x_i∂x_j] // n×n matrix of second derivatives
Cost per step: O(n²) to form H, O(n³) to invert it

For quadratic f(x) = x^TAx + b^Tx + c:
Newton converges in exactly 1 step (the update is exact)

Newton's method is perfect for quadratics. The second-order Taylor approximation of a quadratic IS the quadratic. So Newton solves quadratics in one step. Near any minimum of a smooth function, the landscape is approximately quadratic — this is why Newton has quadratic convergence rate: each step squares the error. The catch: forming the n×n Hessian for a neural network with 10⁹ parameters would require 10¹⁸ entries — a billion GB.

Exercise 8.1: Newton Step for f(x,y) = x² + 10y² Derive

Compute one Newton step from (x₀, y₀) = (10, 10). The Hessian is [[2, 0], [0, 20]], the gradient is [2x, 20y]. What is the new point (x₁, y₁)?

Newton: [x,y]_new = [x,y]_old − H⁻¹∇f. H⁻¹ = [[1/2, 0], [0, 1/20]].

x₁ (x-coordinate after Newton step)

Show derivation

∇f(10,10) = [2(10), 20(10)] = [20, 200]

H⁻¹ = [[1/2, 0], [0, 1/20]]

H⁻¹∇f = [1/2 × 20, 1/20 × 200] = [10, 10]

(x₁, y₁) = (10, 10) − (10, 10) = (0, 0)

One step to the exact minimum. Since f is quadratic, Newton solves it perfectly. Compare to GD with optimal α = 1/L = 1/20: after one step, x₁ = 10 − 20/20 = 9 (y would reach 0, but x barely moved). GD would need many iterations because of the 10:1 condition number. Newton doesn't care about conditioning — it normalizes each direction by its curvature.

Exercise 8.2: GD vs Newton on Same Problem Derive

For f(x,y) = x² + 10y² starting at (10, 10), one GD step with optimal α = 1/20 (1/L_max). What is x₁ after GD?

x₁ (GD)

Show derivation

∇f(10,10) = [20, 200]

x₁ = 10 − (1/20) × 20 = 10 − 1 = 9

y₁ = 10 − (1/20) × 200 = 10 − 10 = 0

After one GD step: (9, 0). The y-direction converges fast (curvature 20, step size matches) but x-direction crawls (curvature 2, but step size is tuned for curvature 20). This is the zigzag problem caused by κ = 20/2 = 10. Newton would need 1 step; GD needs ~10 × ln(10) ≈ 23 steps for the same accuracy.

Exercise 8.3: Hessian Size Derive

A transformer with 7B parameters. How many entries in its full Hessian matrix? Express in scientific notation.

×10¹⁹ entries

Show derivation

H is n × n where n = 7 × 10⁹

Entries = n² = (7 × 10⁹)² = 49 × 10¹⁸ = 4.9 × 10¹⁹

At 4 bytes each (float32): 4.9 × 10¹⁹ × 4 = ~200 exabytes

200 exabytes — that's ~200 million terabytes. The world's total data storage is ~120 exabytes. You literally cannot store the Hessian of a 7B model. This is why exact Newton's method is impossible for deep learning, and why approximations (Adam's per-parameter second moments, K-FAC's block-diagonal approximation, L-BFGS's low-rank updates) exist.

Exercise 8.4: Why Not Newton for Deep Learning? Trace

Which is the primary reason Newton's method isn't used for training neural networks?

The loss function is non-convex Newton converges to saddle points The n×n Hessian requires O(n²) storage and O(n³) inversion, infeasible for n > 10⁶ Newton requires exact gradients (no minibatching)

Show reasoning

All options are real concerns, but the computational cost is the dealbreaker. Non-convexity and saddle points can be handled with modifications (regularized Newton, trust regions). Stochastic Hessians exist. But there's no getting around O(n²) storage and O(n³) per-step cost. Even approximate methods like L-BFGS (which stores a low-rank Hessian approximation) struggle beyond ~1M parameters because the inner products still scale with n. Adam is the practical compromise: it maintains per-parameter second moment estimates (diagonal Hessian approximation) at O(n) cost.

Exercise 8.5: 1D Newton Step Build

Implement one step of Newton's method for 1D: x_new = x - f'(x)/f''(x).

One line: x - df(x)/ddf(x). But what if ddf(x) = 0?

Show solution

javascript
function newtonStep(df, ddf, x) {
  const h = ddf(x);
  if (Math.abs(h) < 1e-12) return x; // avoid div by zero
  return x - df(x) / h;
}

Exercise 8.6: Arrange Newton's Method Pipeline Design

Put the steps of one Newton iteration in the correct order for the multi-dimensional case.

→

Compute ∇f(x_k) Compute Hessian H(x_k) Solve H · Δx = −∇f x_k+1 = x_k + Δx Check convergence ||∇f|| < ε

Show explanation

Correct order: Compute gradient → Compute Hessian → Solve linear system → Update x → Check convergence. The key insight: you solve HΔx = −∇f instead of computing H⁻¹ explicitly. Solving a linear system is O(n³) but inverting a matrix is also O(n³) with a worse constant. In practice, Cholesky decomposition (for PD Hessians) or conjugate gradient (for approximate solves) is used.

Optimization Theory

Chapter 0: Convexity

Chapter 1: Gradient Descent

Chapter 2: SGD Variants

Chapter 3: Constrained Optimization

Chapter 4: KKT Conditions

Chapter 5: Duality

Chapter 6: Proximal Operators

Chapter 7: Convergence Rates

Chapter 8: Second-Order Methods

Chapter 9: Capstone — Loss Landscape Analysis