Ch 7: Continuous Optimization

Chapter 0: Why Optimize?

Training a machine learning model boils down to one thing: finding the parameters θ that minimize a loss function L(θ). The loss measures how badly your model fits the data. Optimization is the engine that drives that loss down.

Every ML algorithm you have seen — linear regression, logistic regression, SVMs, neural networks — reduces to solving an optimization problem. The specifics vary (least squares, cross-entropy, hinge loss), but the structure is always the same:

θ^* = argmin_θ L(θ)

The core idea: You are standing on a hilly landscape. The height at each point is the loss. You want to reach the lowest valley. Optimization is the process of walking downhill, step by step, guided by the slope under your feet. The slope is the gradient. The step is gradient descent.

This chapter builds the optimization toolkit: gradient descent and its variants (momentum, SGD) for unconstrained problems, Lagrange multipliers for constrained problems, and convexity as the property that guarantees a unique global minimum.

Unconstrained

Gradient descent, momentum, SGD

↓

Constrained

Lagrange multipliers, KKT conditions

↓

Convex Optimization

Convex sets/functions, LP, QP, duality

Let's start with the most important optimization algorithm in all of ML. Below is a 2D loss surface. Click anywhere to place a starting point. Watch gradient descent navigate the landscape.

Gradient Descent on a 2D Surface

Click to set the starting point. Adjust the learning rate and toggle momentum. The contour lines show the loss surface f(x₁, x₂) = x₁² + κ x₂². The condition number κ controls how elongated the contours are.

Momentum Click to set start

Learning rate η0.050

Condition κ10

Momentum β0.90

What to notice: With high condition number κ, vanilla GD oscillates along the steep direction while making slow progress along the flat direction. Momentum damps the oscillations and accelerates along the consistent gradient direction. A learning rate too large causes divergence; too small causes painfully slow convergence.

Check: What does the gradient of a loss function tell you?

The global minimum location The direction of steepest ascent at the current point The second derivative of the loss

Chapter 1: Gradient Descent

The idea is beautifully simple. The gradient ∇L(θ) points in the direction of steepest increase. So we step in the opposite direction:

θ_t+1 = θ_t − η ∇L(θ_t)

where η > 0 is the learning rate (or step size). Each step moves θ a little bit downhill. Repeat until the gradient is approximately zero — meaning we are at a (local) minimum, saddle point, or very flat region.

Why the negative gradient? The gradient ∇L points uphill. Among all unit-length directions d, the one that decreases L the fastest is d = −∇L / ||∇L||. This is a consequence of the Cauchy-Schwarz inequality: the inner product ∇L^Td is minimized when d is anti-parallel to ∇L.

Convergence conditions: For gradient descent to converge to a local minimum, we need L to be smooth (differentiable with Lipschitz-continuous gradients) and the learning rate to be small enough. Specifically, if the gradient is L-Lipschitz, then η < 2/L guarantees the loss decreases at every step.

pseudocode
Input: initial θ₀, learning rate η, tolerance ε

for t = 0, 1, 2, ...:
    g ← ∇L(θ_t)             # compute gradient
    θ_t+1 ← θ_t − η · g    # step downhill
    if ||g|| < ε: break     # convergence check

return θ_t+1

In the context of ML, θ might be millions of neural network weights, and ∇L is computed via backpropagation (Chapter 5). The loss is typically the average over all training examples. We will see in Chapter 4 that computing the full gradient is expensive, motivating stochastic approximations.

Local vs. global minima: Gradient descent finds local minima, not necessarily the global minimum. For convex functions (Chapter 7), every local minimum is global, so GD finds the best possible solution. For non-convex functions (like neural network losses), GD can get stuck in local minima or saddle points. In practice, overparameterized neural networks have surprisingly benign loss landscapes where most local minima are nearly as good as the global one.

Check: What is the gradient descent update rule?

θ_t+1 = θ_t − η ∇L(θ_t) θ_t+1 = θ_t + η ∇L(θ_t) θ_t+1 = η ∇L(θ_t)

Chapter 2: Step-Size Matters

The learning rate η is the single most important hyperparameter in gradient descent. Too large, and you overshoot the minimum — the loss increases. Too small, and convergence is painfully slow. Getting it right makes the difference between a model that trains in hours and one that never converges.

Consider the simplest case: minimizing f(x) = ½ x². The gradient is f'(x) = x, so the update is x_t+1 = x_t − η x_t = (1 − η) x_t.

η	Behavior
0 < η < 1	Converges monotonically: x shrinks by factor (1 − η) each step
η = 1	Converges in one step: x₁ = 0
1 < η < 2	Oscillates but converges: overshoots then corrects
η = 2	Oscillates forever: x_t = (−1)^t x₀
η > 2	Diverges: \|x\| grows exponentially

The stability boundary: For a quadratic function with curvature (second derivative) equal to L, the maximum stable learning rate is η = 2/L. This generalizes to multiple dimensions: if the Hessian has eigenvalues λ₁ ≤ … ≤ λ_D, you need η < 2/λ_max. But the convergence rate depends on η λ_min. The condition number κ = λ_max/λ_min determines how hard the problem is.

Learning Rate Explorer

Minimizing f(x) = ½x² from x₀ = 3. Watch how different learning rates affect convergence. The optimal rate is η = 1 (one-step convergence).

Step 0

η0.30

Adaptive learning rates: Modern optimizers like Adam, RMSProp, and AdaGrad maintain per-parameter learning rates that adapt based on the history of gradients. Parameters with large gradients get smaller steps; parameters with small gradients get larger steps. This effectively normalizes for the curvature, reducing the condition number problem.

Key insight: The condition number κ = λ_max/λ_min determines convergence speed. A well-conditioned problem (κ ≈ 1) converges fast — the contours are circular. An ill-conditioned problem (κ >> 1) converges slowly — the contours are elongated ellipses, and GD zigzags.

Check: For f(x) = ½λx², what is the maximum stable learning rate?

η = 2/λ η = λ η = 1/λ²

Chapter 3: Momentum

Vanilla gradient descent has a fundamental problem: it only uses the local gradient at each step. If the loss surface is a narrow valley (high condition number), GD bounces back and forth across the valley walls while making slow progress along the valley floor.

Momentum fixes this by adding a "velocity" term that accumulates past gradients. Think of a ball rolling down a hill — it picks up speed in consistent directions and smooths out oscillations.

v_t+1 = β v_t + ∇L(θ_t)

θ_t+1 = θ_t − η v_t+1

Here β ∈ [0, 1) is the momentum coefficient. When β = 0, we recover vanilla GD. When β = 0.9 (a common default), the velocity accumulates a moving average of the last ~10 gradients.

Why momentum works: In a narrow valley, the gradients oscillate — positive on one side, negative on the other. Momentum averages these oscillations out, leaving only the consistent component along the valley floor. The result: faster convergence along the dominant direction and damped oscillations along the others.

Vanilla GD:

• Uses only current gradient
• Zigzags in narrow valleys
• Convergence rate ~ O(1/κ)

GD + Momentum:

• Accumulates past gradients
• Smooth trajectory in narrow valleys
• Convergence rate ~ O(1/√κ)

The improvement from O(1/κ) to O(1/√κ) is significant. For κ = 100, momentum is ~10x faster. For κ = 10000, it is ~100x faster. This is why momentum is essentially universal in modern deep learning.

Nesterov momentum is a subtle refinement. Instead of computing the gradient at θ_t, you compute it at θ_t − ηβv_t — you "look ahead" to where momentum is about to take you, then correct. This improves the constant factor in the convergence rate and is provably optimal for first-order methods on convex functions.

Physical analogy: Imagine a ball rolling on the loss surface. The gradient is gravity (pulling downhill). Momentum is inertia (the ball keeps moving in its current direction). Friction (β < 1) prevents the ball from rolling forever. The ball naturally finds the bottom of valleys, overshoots a little, and settles.

Check: What problem does momentum solve in gradient descent?

It damps oscillations in narrow valleys and accelerates convergence along consistent gradient directions It computes the second derivative for faster convergence It eliminates the need for a learning rate

Chapter 4: Stochastic Gradient Descent

In ML, the loss function is typically an average over N training examples:

L(θ) = (1/N) ∑_i=1^N ℓ(θ, x_i, y_i)

Computing the full gradient requires evaluating ∇ℓ at every training example. With millions of examples, each gradient step becomes expensive. Stochastic gradient descent (SGD) approximates the full gradient using a random mini-batch of B examples:

g_t = (1/B) ∑_{i ∈ batch} ∇ℓ(θ_t, x_i, y_i)

θ_t+1 = θ_t − η g_t

The key tradeoff: The mini-batch gradient g_t is an unbiased estimate of the true gradient: E[g_t] = ∇L(θ_t). But it has variance — each step is noisy. Smaller batches = more noise, more steps per epoch. Larger batches = less noise, fewer steps but each is more expensive. The sweet spot depends on the problem.

SGD vs Full GD

Both optimize the same function. Full GD takes clean steps. SGD takes noisy but frequent steps. Watch how SGD's trajectory is erratic but reaches the minimum in fewer gradient evaluations.

Click Run to compare

Noise σ1.0

Property	Full Gradient	Mini-batch SGD
Gradient cost	O(N) per step	O(B) per step
Gradient variance	Zero	~ σ²/B
Convergence	Smooth, monotonic	Noisy, can escape local minima
GPU utilization	Often saturates	Better hardware parallelism

Noise as a feature: The stochasticity of SGD is not just a nuisance — it can help escape shallow local minima and saddle points. The noise acts as implicit regularization, pushing the optimizer toward flatter minima that generalize better. This is one reason SGD often outperforms full-batch GD in deep learning.

Check: Why is the mini-batch gradient an unbiased estimate of the true gradient?

Because the mini-batch is randomly sampled, so its expected gradient equals the full gradient Because mini-batches always contain the most important examples Because the learning rate compensates for the smaller batch

Chapter 5: Constrained Optimization

So far we have been free to set θ to any value. But many real problems have constraints. Probabilities must sum to 1. SVM margins must be non-negative. Resource allocations cannot be negative. How do we optimize while respecting constraints?

The general constrained optimization problem is:

minimize f(x) subject to g_i(x) ≤ 0, h_j(x) = 0

where g_i are inequality constraints and h_j are equality constraints. The set of x satisfying all constraints is called the feasible set.

Key insight: At a constrained minimum, one of two things is true at each constraint: either the constraint is inactive (slack), meaning the unconstrained optimum already satisfies it and it plays no role; or the constraint is active (tight), meaning it restricts the solution. Active constraints are what make constrained optimization hard.

For equality constraints, the idea is geometric. At the optimum, the gradient of f must be perpendicular to the constraint surface h(x) = 0. If it weren't, you could slide along the constraint surface and decrease f. Perpendicularity means ∇f is parallel to ∇h:

∇f(x^*) = λ ∇h(x^*)

for some scalar λ. This is the essence of the Lagrange multiplier method (next chapter).

Geometric picture: Imagine minimizing f(x, y) subject to h(x, y) = 0. The constraint defines a curve. You walk along this curve until you find the point where the contours of f are tangent to the constraint curve. At that point, the gradient of f points straight away from the constraint — any movement along the constraint would neither increase nor decrease f. That is the constrained optimum.

Check: At a constrained minimum with equality constraint h(x) = 0, what geometric condition holds?

The gradient of f is zero The gradient of f is parallel to the gradient of h (the contours are tangent) The gradient of h is zero

Chapter 6: Lagrange Multipliers

The Lagrange multiplier method converts a constrained problem into an unconstrained one by introducing auxiliary variables λ (the multipliers). For equality constraints:

L(x, λ) = f(x) + ∑_j λ_j h_j(x)

This is the Lagrangian. The constrained optimum satisfies ∇_xL = 0 and ∇_λL = 0 simultaneously. The first condition gives the balance between the objective gradient and constraint gradients. The second condition enforces the constraints themselves (h_j(x) = 0).

For inequality constraints g_i(x) ≤ 0, we extend to:

L(x, λ, μ) = f(x) + ∑_j λ_j h_j(x) + ∑_i μ_i g_i(x)

with the additional KKT conditions:

KKT conditions (necessary for optimality):
1. Stationarity: ∇_xL = 0
2. Primal feasibility: g_i(x) ≤ 0, h_j(x) = 0
3. Dual feasibility: μ_i ≥ 0
4. Complementary slackness: μ_i g_i(x) = 0

Complementary slackness is the elegant part: for each inequality, either the constraint is tight (g_i = 0) or its multiplier is zero (μ_i = 0). Inactive constraints have zero multipliers — they do not influence the solution.

SVM connection: The support vector machine is a beautiful application of KKT conditions. The optimization has inequality constraints (margin ≥ 1 for each training point). At the optimum, complementary slackness says μ_i = 0 for points with margin > 1 (they are irrelevant) and μ_i > 0 only for points exactly on the margin boundary. These are the support vectors — the few points that actually define the decision boundary. Chapter 12 explores this in detail.

Lagrange Multiplier Visualization

Minimize f(x, y) = x² + y² subject to the constraint x + y = 1 (red line). The constrained minimum (star) is where the level curve of f is tangent to the constraint. Drag the constraint offset to see how the solution changes.

Constraint: x + y =1.0

Check: What does "complementary slackness" mean for inequality constraints?

Either the constraint is tight (active) or its multiplier is zero — never both nonzero All constraints must be active at the optimum The Lagrange multipliers are always positive

Chapter 7: Convexity

Convexity is the single most important structural property in optimization. If your problem is convex, every local minimum is global. There are no bad local minima, no saddle points to worry about. The landscape has a single bowl, and any downhill path leads to the bottom.

A set C is convex if for any two points x, y ∈ C, the line segment between them is also in C:

∀ x, y ∈ C, ∀ α ∈ [0, 1]: αx + (1 − α)y ∈ C

A function f is convex if it lies below any chord. Formally, for all x, y in the domain and α ∈ [0, 1]:

f(αx + (1 − α)y) ≤ α f(x) + (1 − α) f(y)

The chord condition: Draw a line between any two points on the graph of f. If the function always lies below or on the line, f is convex. Equivalently: the epigraph (the set of points above the graph) is a convex set. Equivalently: the Hessian ∇²f is positive semi-definite everywhere.

Equivalent conditions for twice-differentiable f:

• f is convex ⇔ ∇²f(x) ⪰ 0 for all x (Hessian is PSD)

• f is strictly convex ⇔ ∇²f(x) ≻ 0 for all x (Hessian is PD)

• f is convex ⇔ f(y) ≥ f(x) + ∇f(x)^T(y − x) for all x, y (the first-order condition — the function lies above all tangent hyperplanes)

Key insight: The first-order condition says something profound: for a convex function, the tangent line at any point is a global lower bound. This means that if ∇f(x) = 0 anywhere, then f(x) ≤ f(y) for ALL y. A zero-gradient point is guaranteed to be a global minimum. No other structure gives you this luxury.

Operations that preserve convexity:

Operation	Result
Non-negative sum: αf + βg (α, β ≥ 0)	Convex
Pointwise maximum: max(f, g)	Convex
Composition with affine: f(Ax + b)	Convex
Infimum over some variables	Convex

Check: Why is convexity so desirable in optimization?

Every local minimum is a global minimum — no bad local optima Convex functions are always differentiable Gradient descent converges in exactly one step

Chapter 8: Linear & Quadratic Programs

Two important special cases of convex optimization have dedicated solution methods and appear throughout ML.

A linear program (LP) minimizes a linear objective subject to linear constraints:

minimize c^Tx subject to Ax ≤ b

Since the objective is linear (flat — no curvature), the minimum must be at a corner of the feasible polytope. The simplex algorithm walks along edges from corner to corner, improving the objective at each step. Interior point methods take a different approach, walking through the interior.

A quadratic program (QP) adds a quadratic term to the objective:

minimize ½ x^TQx + c^Tx subject to Ax ≤ b

If Q is positive semi-definite, the QP is convex and has a unique global minimum (or a set of global minima). QPs appear directly in SVMs (Chapter 12) and ridge regression.

SVMs are QPs: The SVM dual optimization problem is min ½α^TQα − 1^Tα subject to 0 ≤ α_i ≤ C and y^Tα = 0. Here Q_ij = y_iy_jx_i^Tx_j. This is a standard QP that can be solved with off-the-shelf solvers in polynomial time.

Linear Program:

• Linear objective + linear constraints
• Solution at a vertex of the polytope
• Simplex or interior point method
• Applications: scheduling, resource allocation

Quadratic Program:

• Quadratic objective + linear constraints
• Solution may be interior or at boundary
• Active-set or interior point method
• Applications: SVM, portfolio optimization

Hierarchy of difficulty: LP ⊂ QP ⊂ Convex ⊂ General nonlinear. Each level up is strictly harder. LPs are solvable in polynomial time. QPs with PSD Q are solvable in polynomial time. General convex problems are solvable in polynomial time (via interior point methods). Non-convex problems are NP-hard in general, though practical heuristics (like SGD on neural networks) work remarkably well.

Check: Why does the solution to a linear program always occur at a vertex of the feasible polytope?

Because the objective is linear (flat), so it cannot have an interior minimum — it is always maximized/minimized at a corner Because vertices are the only feasible points Because the simplex algorithm only checks vertices

Chapter 9: Convex Conjugates

The convex conjugate (or Legendre-Fenchel transform) provides a dual representation of convex functions that is central to duality theory. For a function f, its conjugate f^* is:

f^*(s) = sup_x { s^Tx − f(x) }

Geometrically, f^*(s) is the maximum gap between the linear function s^Tx and f(x). It encodes f through its tangent lines rather than its point values.

The Fenchel-Young inequality: For any x and s: f(x) + f^*(s) ≥ s^Tx. Equality holds when s = ∇f(x), i.e., when s is the gradient at x. This inequality is the backbone of duality — it connects the primal variable x with the dual variable s.

Key properties:

• f^* is always convex, even if f is not

• If f is convex and closed, then f^** = f (double conjugate recovers the original)

• The conjugate swaps between "primal" and "dual" representations

Function f(x)	Conjugate f^*(s)
½ x²	½ s²
\|x\|	0 if \|s\| ≤ 1, ∞ otherwise
½ x^TQx (Q ≻ 0)	½ s^TQ⁻¹s
e^x	s log s − s (s ≥ 0)

Why this matters: The conjugate appears in the dual of every convex optimization problem. Weak duality says the dual optimal value is always a lower bound on the primal. Strong duality (guaranteed under Slater's condition for convex problems) says the two are equal. This means you can solve the dual instead of the primal — whichever is easier. For SVMs, the dual is a QP in fewer variables. For maximum entropy problems, the dual is unconstrained.

The minimax inequality encapsulates duality:

max_λ min_x L(x, λ) ≤ min_x max_λ L(x, λ)

The left side is the dual problem; the right is the primal. Weak duality says the inequality always holds. Strong duality says equality holds for convex problems satisfying constraint qualification. This "max-min ≤ min-max" principle appears everywhere: game theory, GANs, robust optimization.

Connection to information theory: The conjugate of the log-partition function A(η) of an exponential family (Chapter 6) is the negative entropy. This connects maximum likelihood estimation (which maximizes the likelihood, related to A) with maximum entropy (which maximizes entropy, the conjugate). Two sides of the same coin, linked by the convex conjugate.

"The great watershed in optimization is not between linearity and nonlinearity,
but between convexity and non-convexity."
— R. Tyrrell Rockafellar

Check: What does the Fenchel-Young inequality state?

f(x) ≥ f^*(s) for all x and s f(x) = f^*(s) always f(x) + f^*(s) ≥ s^Tx, with equality when s = ∇f(x)