Deisenroth et al., Chapter 7

Continuous Optimization

Finding the best parameters — gradient descent, momentum, constraints, and convexity.

Prerequisites: Chapter 5 (vector calculus). Gradients, Jacobians, and the chain rule.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Why Optimize?

Training a machine learning model boils down to one thing: finding the parameters θ that minimize a loss function L(θ). The loss measures how badly your model fits the data. Optimization is the engine that drives that loss down.

Every ML algorithm you have seen — linear regression, logistic regression, SVMs, neural networks — reduces to solving an optimization problem. The specifics vary (least squares, cross-entropy, hinge loss), but the structure is always the same:

θ* = argminθ L(θ)
The core idea: You are standing on a hilly landscape. The height at each point is the loss. You want to reach the lowest valley. Optimization is the process of walking downhill, step by step, guided by the slope under your feet. The slope is the gradient. The step is gradient descent.

This chapter builds the optimization toolkit: gradient descent and its variants (momentum, SGD) for unconstrained problems, Lagrange multipliers for constrained problems, and convexity as the property that guarantees a unique global minimum.

Unconstrained
Gradient descent, momentum, SGD
Constrained
Lagrange multipliers, KKT conditions
Convex Optimization
Convex sets/functions, LP, QP, duality

Let's start with the most important optimization algorithm in all of ML. Below is a 2D loss surface. Click anywhere to place a starting point. Watch gradient descent navigate the landscape.

Gradient Descent on a 2D Surface

Click to set the starting point. Adjust the learning rate and toggle momentum. The contour lines show the loss surface f(x1, x2) = x12 + κ x22. The condition number κ controls how elongated the contours are.

Click to set start
Learning rate η0.050
Condition κ10
Momentum β0.90
What to notice: With high condition number κ, vanilla GD oscillates along the steep direction while making slow progress along the flat direction. Momentum damps the oscillations and accelerates along the consistent gradient direction. A learning rate too large causes divergence; too small causes painfully slow convergence.
Check: What does the gradient of a loss function tell you?

Chapter 1: Gradient Descent

The idea is beautifully simple. The gradient ∇L(θ) points in the direction of steepest increase. So we step in the opposite direction:

θt+1 = θt − η ∇L(θt)

where η > 0 is the learning rate (or step size). Each step moves θ a little bit downhill. Repeat until the gradient is approximately zero — meaning we are at a (local) minimum, saddle point, or very flat region.

Why the negative gradient? The gradient ∇L points uphill. Among all unit-length directions d, the one that decreases L the fastest is d = −∇L / ||∇L||. This is a consequence of the Cauchy-Schwarz inequality: the inner product ∇LTd is minimized when d is anti-parallel to ∇L.

Convergence conditions: For gradient descent to converge to a local minimum, we need L to be smooth (differentiable with Lipschitz-continuous gradients) and the learning rate to be small enough. Specifically, if the gradient is L-Lipschitz, then η < 2/L guarantees the loss decreases at every step.

pseudocode
Input: initial θ0, learning rate η, tolerance ε

for t = 0, 1, 2, ...:
    g ← ∇L(θt)             # compute gradient
    θt+1 ← θt − η · g    # step downhill
    if ||g|| < ε: break     # convergence check

return θt+1

In the context of ML, θ might be millions of neural network weights, and ∇L is computed via backpropagation (Chapter 5). The loss is typically the average over all training examples. We will see in Chapter 4 that computing the full gradient is expensive, motivating stochastic approximations.

Local vs. global minima: Gradient descent finds local minima, not necessarily the global minimum. For convex functions (Chapter 7), every local minimum is global, so GD finds the best possible solution. For non-convex functions (like neural network losses), GD can get stuck in local minima or saddle points. In practice, overparameterized neural networks have surprisingly benign loss landscapes where most local minima are nearly as good as the global one.
Check: What is the gradient descent update rule?

Chapter 2: Step-Size Matters

The learning rate η is the single most important hyperparameter in gradient descent. Too large, and you overshoot the minimum — the loss increases. Too small, and convergence is painfully slow. Getting it right makes the difference between a model that trains in hours and one that never converges.

Consider the simplest case: minimizing f(x) = ½ x2. The gradient is f'(x) = x, so the update is xt+1 = xt − η xt = (1 − η) xt.

ηBehavior
0 < η < 1Converges monotonically: x shrinks by factor (1 − η) each step
η = 1Converges in one step: x1 = 0
1 < η < 2Oscillates but converges: overshoots then corrects
η = 2Oscillates forever: xt = (−1)t x0
η > 2Diverges: |x| grows exponentially
The stability boundary: For a quadratic function with curvature (second derivative) equal to L, the maximum stable learning rate is η = 2/L. This generalizes to multiple dimensions: if the Hessian has eigenvalues λ1 ≤ … ≤ λD, you need η < 2/λmax. But the convergence rate depends on η λmin. The condition number κ = λmaxmin determines how hard the problem is.
Learning Rate Explorer

Minimizing f(x) = ½x2 from x0 = 3. Watch how different learning rates affect convergence. The optimal rate is η = 1 (one-step convergence).

Step 0
η0.30

Adaptive learning rates: Modern optimizers like Adam, RMSProp, and AdaGrad maintain per-parameter learning rates that adapt based on the history of gradients. Parameters with large gradients get smaller steps; parameters with small gradients get larger steps. This effectively normalizes for the curvature, reducing the condition number problem.

Key insight: The condition number κ = λmaxmin determines convergence speed. A well-conditioned problem (κ ≈ 1) converges fast — the contours are circular. An ill-conditioned problem (κ >> 1) converges slowly — the contours are elongated ellipses, and GD zigzags.
Check: For f(x) = ½λx2, what is the maximum stable learning rate?

Chapter 3: Momentum

Vanilla gradient descent has a fundamental problem: it only uses the local gradient at each step. If the loss surface is a narrow valley (high condition number), GD bounces back and forth across the valley walls while making slow progress along the valley floor.

Momentum fixes this by adding a "velocity" term that accumulates past gradients. Think of a ball rolling down a hill — it picks up speed in consistent directions and smooths out oscillations.

vt+1 = β vt + ∇L(θt)
θt+1 = θt − η vt+1

Here β ∈ [0, 1) is the momentum coefficient. When β = 0, we recover vanilla GD. When β = 0.9 (a common default), the velocity accumulates a moving average of the last ~10 gradients.

Why momentum works: In a narrow valley, the gradients oscillate — positive on one side, negative on the other. Momentum averages these oscillations out, leaving only the consistent component along the valley floor. The result: faster convergence along the dominant direction and damped oscillations along the others.

Vanilla GD:

• Uses only current gradient
• Zigzags in narrow valleys
• Convergence rate ~ O(1/κ)

GD + Momentum:

• Accumulates past gradients
• Smooth trajectory in narrow valleys
• Convergence rate ~ O(1/√κ)

The improvement from O(1/κ) to O(1/√κ) is significant. For κ = 100, momentum is ~10x faster. For κ = 10000, it is ~100x faster. This is why momentum is essentially universal in modern deep learning.

Nesterov momentum is a subtle refinement. Instead of computing the gradient at θt, you compute it at θt − ηβvt — you "look ahead" to where momentum is about to take you, then correct. This improves the constant factor in the convergence rate and is provably optimal for first-order methods on convex functions.

Physical analogy: Imagine a ball rolling on the loss surface. The gradient is gravity (pulling downhill). Momentum is inertia (the ball keeps moving in its current direction). Friction (β < 1) prevents the ball from rolling forever. The ball naturally finds the bottom of valleys, overshoots a little, and settles.
Check: What problem does momentum solve in gradient descent?

Chapter 4: Stochastic Gradient Descent

In ML, the loss function is typically an average over N training examples:

L(θ) = (1/N) ∑i=1N ℓ(θ, xi, yi)

Computing the full gradient requires evaluating ∇ℓ at every training example. With millions of examples, each gradient step becomes expensive. Stochastic gradient descent (SGD) approximates the full gradient using a random mini-batch of B examples:

gt = (1/B) ∑i ∈ batch ∇ℓ(θt, xi, yi)
θt+1 = θt − η gt
The key tradeoff: The mini-batch gradient gt is an unbiased estimate of the true gradient: E[gt] = ∇L(θt). But it has variance — each step is noisy. Smaller batches = more noise, more steps per epoch. Larger batches = less noise, fewer steps but each is more expensive. The sweet spot depends on the problem.
SGD vs Full GD

Both optimize the same function. Full GD takes clean steps. SGD takes noisy but frequent steps. Watch how SGD's trajectory is erratic but reaches the minimum in fewer gradient evaluations.

Click Run to compare
Noise σ1.0
PropertyFull GradientMini-batch SGD
Gradient costO(N) per stepO(B) per step
Gradient varianceZero~ σ2/B
ConvergenceSmooth, monotonicNoisy, can escape local minima
GPU utilizationOften saturatesBetter hardware parallelism
Noise as a feature: The stochasticity of SGD is not just a nuisance — it can help escape shallow local minima and saddle points. The noise acts as implicit regularization, pushing the optimizer toward flatter minima that generalize better. This is one reason SGD often outperforms full-batch GD in deep learning.
Check: Why is the mini-batch gradient an unbiased estimate of the true gradient?

Chapter 5: Constrained Optimization

So far we have been free to set θ to any value. But many real problems have constraints. Probabilities must sum to 1. SVM margins must be non-negative. Resource allocations cannot be negative. How do we optimize while respecting constraints?

The general constrained optimization problem is:

minimize f(x)  subject to  gi(x) ≤ 0,  hj(x) = 0

where gi are inequality constraints and hj are equality constraints. The set of x satisfying all constraints is called the feasible set.

Key insight: At a constrained minimum, one of two things is true at each constraint: either the constraint is inactive (slack), meaning the unconstrained optimum already satisfies it and it plays no role; or the constraint is active (tight), meaning it restricts the solution. Active constraints are what make constrained optimization hard.

For equality constraints, the idea is geometric. At the optimum, the gradient of f must be perpendicular to the constraint surface h(x) = 0. If it weren't, you could slide along the constraint surface and decrease f. Perpendicularity means ∇f is parallel to ∇h:

∇f(x*) = λ ∇h(x*)

for some scalar λ. This is the essence of the Lagrange multiplier method (next chapter).

Geometric picture: Imagine minimizing f(x, y) subject to h(x, y) = 0. The constraint defines a curve. You walk along this curve until you find the point where the contours of f are tangent to the constraint curve. At that point, the gradient of f points straight away from the constraint — any movement along the constraint would neither increase nor decrease f. That is the constrained optimum.
Check: At a constrained minimum with equality constraint h(x) = 0, what geometric condition holds?

Chapter 6: Lagrange Multipliers

The Lagrange multiplier method converts a constrained problem into an unconstrained one by introducing auxiliary variables λ (the multipliers). For equality constraints:

L(x, λ) = f(x) + ∑j λj hj(x)

This is the Lagrangian. The constrained optimum satisfies ∇xL = 0 and ∇λL = 0 simultaneously. The first condition gives the balance between the objective gradient and constraint gradients. The second condition enforces the constraints themselves (hj(x) = 0).

For inequality constraints gi(x) ≤ 0, we extend to:

L(x, λ, μ) = f(x) + ∑j λj hj(x) + ∑i μi gi(x)

with the additional KKT conditions:

KKT conditions (necessary for optimality):
1. Stationarity:xL = 0
2. Primal feasibility: gi(x) ≤ 0, hj(x) = 0
3. Dual feasibility: μi ≥ 0
4. Complementary slackness: μi gi(x) = 0

Complementary slackness is the elegant part: for each inequality, either the constraint is tight (gi = 0) or its multiplier is zero (μi = 0). Inactive constraints have zero multipliers — they do not influence the solution.

SVM connection: The support vector machine is a beautiful application of KKT conditions. The optimization has inequality constraints (margin ≥ 1 for each training point). At the optimum, complementary slackness says μi = 0 for points with margin > 1 (they are irrelevant) and μi > 0 only for points exactly on the margin boundary. These are the support vectors — the few points that actually define the decision boundary. Chapter 12 explores this in detail.
Lagrange Multiplier Visualization

Minimize f(x, y) = x2 + y2 subject to the constraint x + y = 1 (red line). The constrained minimum (star) is where the level curve of f is tangent to the constraint. Drag the constraint offset to see how the solution changes.

Constraint: x + y =1.0
Check: What does "complementary slackness" mean for inequality constraints?

Chapter 7: Convexity

Convexity is the single most important structural property in optimization. If your problem is convex, every local minimum is global. There are no bad local minima, no saddle points to worry about. The landscape has a single bowl, and any downhill path leads to the bottom.

A set C is convex if for any two points x, y ∈ C, the line segment between them is also in C:

∀ x, y ∈ C, ∀ α ∈ [0, 1]: αx + (1 − α)y ∈ C

A function f is convex if it lies below any chord. Formally, for all x, y in the domain and α ∈ [0, 1]:

f(αx + (1 − α)y) ≤ α f(x) + (1 − α) f(y)
The chord condition: Draw a line between any two points on the graph of f. If the function always lies below or on the line, f is convex. Equivalently: the epigraph (the set of points above the graph) is a convex set. Equivalently: the Hessian ∇2f is positive semi-definite everywhere.

Equivalent conditions for twice-differentiable f:

• f is convex ⇔ ∇2f(x) ⪰ 0 for all x (Hessian is PSD)

• f is strictly convex ⇔ ∇2f(x) ≻ 0 for all x (Hessian is PD)

• f is convex ⇔ f(y) ≥ f(x) + ∇f(x)T(y − x) for all x, y (the first-order condition — the function lies above all tangent hyperplanes)

Key insight: The first-order condition says something profound: for a convex function, the tangent line at any point is a global lower bound. This means that if ∇f(x) = 0 anywhere, then f(x) ≤ f(y) for ALL y. A zero-gradient point is guaranteed to be a global minimum. No other structure gives you this luxury.

Operations that preserve convexity:

OperationResult
Non-negative sum: αf + βg (α, β ≥ 0)Convex
Pointwise maximum: max(f, g)Convex
Composition with affine: f(Ax + b)Convex
Infimum over some variablesConvex
Check: Why is convexity so desirable in optimization?

Chapter 8: Linear & Quadratic Programs

Two important special cases of convex optimization have dedicated solution methods and appear throughout ML.

A linear program (LP) minimizes a linear objective subject to linear constraints:

minimize cTx  subject to  Ax ≤ b

Since the objective is linear (flat — no curvature), the minimum must be at a corner of the feasible polytope. The simplex algorithm walks along edges from corner to corner, improving the objective at each step. Interior point methods take a different approach, walking through the interior.

A quadratic program (QP) adds a quadratic term to the objective:

minimize ½ xTQx + cTx  subject to  Ax ≤ b

If Q is positive semi-definite, the QP is convex and has a unique global minimum (or a set of global minima). QPs appear directly in SVMs (Chapter 12) and ridge regression.

SVMs are QPs: The SVM dual optimization problem is min ½αTQα − 1Tα subject to 0 ≤ αi ≤ C and yTα = 0. Here Qij = yiyjxiTxj. This is a standard QP that can be solved with off-the-shelf solvers in polynomial time.

Linear Program:

• Linear objective + linear constraints
• Solution at a vertex of the polytope
• Simplex or interior point method
• Applications: scheduling, resource allocation

Quadratic Program:

• Quadratic objective + linear constraints
• Solution may be interior or at boundary
• Active-set or interior point method
• Applications: SVM, portfolio optimization

Hierarchy of difficulty: LP ⊂ QP ⊂ Convex ⊂ General nonlinear. Each level up is strictly harder. LPs are solvable in polynomial time. QPs with PSD Q are solvable in polynomial time. General convex problems are solvable in polynomial time (via interior point methods). Non-convex problems are NP-hard in general, though practical heuristics (like SGD on neural networks) work remarkably well.
Check: Why does the solution to a linear program always occur at a vertex of the feasible polytope?

Chapter 9: Convex Conjugates

The convex conjugate (or Legendre-Fenchel transform) provides a dual representation of convex functions that is central to duality theory. For a function f, its conjugate f* is:

f*(s) = supx { sTx − f(x) }

Geometrically, f*(s) is the maximum gap between the linear function sTx and f(x). It encodes f through its tangent lines rather than its point values.

The Fenchel-Young inequality: For any x and s: f(x) + f*(s) ≥ sTx. Equality holds when s = ∇f(x), i.e., when s is the gradient at x. This inequality is the backbone of duality — it connects the primal variable x with the dual variable s.

Key properties:

• f* is always convex, even if f is not

• If f is convex and closed, then f** = f (double conjugate recovers the original)

• The conjugate swaps between "primal" and "dual" representations

Function f(x)Conjugate f*(s)
½ x2½ s2
|x|0 if |s| ≤ 1, ∞ otherwise
½ xTQx (Q ≻ 0)½ sTQ−1s
exs log s − s (s ≥ 0)
Why this matters: The conjugate appears in the dual of every convex optimization problem. Weak duality says the dual optimal value is always a lower bound on the primal. Strong duality (guaranteed under Slater's condition for convex problems) says the two are equal. This means you can solve the dual instead of the primal — whichever is easier. For SVMs, the dual is a QP in fewer variables. For maximum entropy problems, the dual is unconstrained.

The minimax inequality encapsulates duality:

maxλ minx L(x, λ) ≤ minx maxλ L(x, λ)

The left side is the dual problem; the right is the primal. Weak duality says the inequality always holds. Strong duality says equality holds for convex problems satisfying constraint qualification. This "max-min ≤ min-max" principle appears everywhere: game theory, GANs, robust optimization.

Connection to information theory: The conjugate of the log-partition function A(η) of an exponential family (Chapter 6) is the negative entropy. This connects maximum likelihood estimation (which maximizes the likelihood, related to A) with maximum entropy (which maximizes entropy, the conjugate). Two sides of the same coin, linked by the convex conjugate.
"The great watershed in optimization is not between linearity and nonlinearity,
but between convexity and non-convexity."
— R. Tyrrell Rockafellar
Check: What does the Fenchel-Young inequality state?