Finding the best parameters — gradient descent, momentum, constraints, and convexity.
Training a machine learning model boils down to one thing: finding the parameters θ that minimize a loss function L(θ). The loss measures how badly your model fits the data. Optimization is the engine that drives that loss down.
Every ML algorithm you have seen — linear regression, logistic regression, SVMs, neural networks — reduces to solving an optimization problem. The specifics vary (least squares, cross-entropy, hinge loss), but the structure is always the same:
This chapter builds the optimization toolkit: gradient descent and its variants (momentum, SGD) for unconstrained problems, Lagrange multipliers for constrained problems, and convexity as the property that guarantees a unique global minimum.
Let's start with the most important optimization algorithm in all of ML. Below is a 2D loss surface. Click anywhere to place a starting point. Watch gradient descent navigate the landscape.
Click to set the starting point. Adjust the learning rate and toggle momentum. The contour lines show the loss surface f(x1, x2) = x12 + κ x22. The condition number κ controls how elongated the contours are.
The idea is beautifully simple. The gradient ∇L(θ) points in the direction of steepest increase. So we step in the opposite direction:
where η > 0 is the learning rate (or step size). Each step moves θ a little bit downhill. Repeat until the gradient is approximately zero — meaning we are at a (local) minimum, saddle point, or very flat region.
Convergence conditions: For gradient descent to converge to a local minimum, we need L to be smooth (differentiable with Lipschitz-continuous gradients) and the learning rate to be small enough. Specifically, if the gradient is L-Lipschitz, then η < 2/L guarantees the loss decreases at every step.
pseudocode Input: initial θ0, learning rate η, tolerance ε for t = 0, 1, 2, ...: g ← ∇L(θt) # compute gradient θt+1 ← θt − η · g # step downhill if ||g|| < ε: break # convergence check return θt+1
In the context of ML, θ might be millions of neural network weights, and ∇L is computed via backpropagation (Chapter 5). The loss is typically the average over all training examples. We will see in Chapter 4 that computing the full gradient is expensive, motivating stochastic approximations.
The learning rate η is the single most important hyperparameter in gradient descent. Too large, and you overshoot the minimum — the loss increases. Too small, and convergence is painfully slow. Getting it right makes the difference between a model that trains in hours and one that never converges.
Consider the simplest case: minimizing f(x) = ½ x2. The gradient is f'(x) = x, so the update is xt+1 = xt − η xt = (1 − η) xt.
| η | Behavior |
|---|---|
| 0 < η < 1 | Converges monotonically: x shrinks by factor (1 − η) each step |
| η = 1 | Converges in one step: x1 = 0 |
| 1 < η < 2 | Oscillates but converges: overshoots then corrects |
| η = 2 | Oscillates forever: xt = (−1)t x0 |
| η > 2 | Diverges: |x| grows exponentially |
Minimizing f(x) = ½x2 from x0 = 3. Watch how different learning rates affect convergence. The optimal rate is η = 1 (one-step convergence).
Adaptive learning rates: Modern optimizers like Adam, RMSProp, and AdaGrad maintain per-parameter learning rates that adapt based on the history of gradients. Parameters with large gradients get smaller steps; parameters with small gradients get larger steps. This effectively normalizes for the curvature, reducing the condition number problem.
Vanilla gradient descent has a fundamental problem: it only uses the local gradient at each step. If the loss surface is a narrow valley (high condition number), GD bounces back and forth across the valley walls while making slow progress along the valley floor.
Momentum fixes this by adding a "velocity" term that accumulates past gradients. Think of a ball rolling down a hill — it picks up speed in consistent directions and smooths out oscillations.
Here β ∈ [0, 1) is the momentum coefficient. When β = 0, we recover vanilla GD. When β = 0.9 (a common default), the velocity accumulates a moving average of the last ~10 gradients.
Vanilla GD:
• Uses only current gradient
• Zigzags in narrow valleys
• Convergence rate ~ O(1/κ)
GD + Momentum:
• Accumulates past gradients
• Smooth trajectory in narrow valleys
• Convergence rate ~ O(1/√κ)
The improvement from O(1/κ) to O(1/√κ) is significant. For κ = 100, momentum is ~10x faster. For κ = 10000, it is ~100x faster. This is why momentum is essentially universal in modern deep learning.
Nesterov momentum is a subtle refinement. Instead of computing the gradient at θt, you compute it at θt − ηβvt — you "look ahead" to where momentum is about to take you, then correct. This improves the constant factor in the convergence rate and is provably optimal for first-order methods on convex functions.
In ML, the loss function is typically an average over N training examples:
Computing the full gradient requires evaluating ∇ℓ at every training example. With millions of examples, each gradient step becomes expensive. Stochastic gradient descent (SGD) approximates the full gradient using a random mini-batch of B examples:
Both optimize the same function. Full GD takes clean steps. SGD takes noisy but frequent steps. Watch how SGD's trajectory is erratic but reaches the minimum in fewer gradient evaluations.
| Property | Full Gradient | Mini-batch SGD |
|---|---|---|
| Gradient cost | O(N) per step | O(B) per step |
| Gradient variance | Zero | ~ σ2/B |
| Convergence | Smooth, monotonic | Noisy, can escape local minima |
| GPU utilization | Often saturates | Better hardware parallelism |
So far we have been free to set θ to any value. But many real problems have constraints. Probabilities must sum to 1. SVM margins must be non-negative. Resource allocations cannot be negative. How do we optimize while respecting constraints?
The general constrained optimization problem is:
where gi are inequality constraints and hj are equality constraints. The set of x satisfying all constraints is called the feasible set.
For equality constraints, the idea is geometric. At the optimum, the gradient of f must be perpendicular to the constraint surface h(x) = 0. If it weren't, you could slide along the constraint surface and decrease f. Perpendicularity means ∇f is parallel to ∇h:
for some scalar λ. This is the essence of the Lagrange multiplier method (next chapter).
The Lagrange multiplier method converts a constrained problem into an unconstrained one by introducing auxiliary variables λ (the multipliers). For equality constraints:
This is the Lagrangian. The constrained optimum satisfies ∇xL = 0 and ∇λL = 0 simultaneously. The first condition gives the balance between the objective gradient and constraint gradients. The second condition enforces the constraints themselves (hj(x) = 0).
For inequality constraints gi(x) ≤ 0, we extend to:
with the additional KKT conditions:
Complementary slackness is the elegant part: for each inequality, either the constraint is tight (gi = 0) or its multiplier is zero (μi = 0). Inactive constraints have zero multipliers — they do not influence the solution.
Minimize f(x, y) = x2 + y2 subject to the constraint x + y = 1 (red line). The constrained minimum (star) is where the level curve of f is tangent to the constraint. Drag the constraint offset to see how the solution changes.
Convexity is the single most important structural property in optimization. If your problem is convex, every local minimum is global. There are no bad local minima, no saddle points to worry about. The landscape has a single bowl, and any downhill path leads to the bottom.
A set C is convex if for any two points x, y ∈ C, the line segment between them is also in C:
A function f is convex if it lies below any chord. Formally, for all x, y in the domain and α ∈ [0, 1]:
Equivalent conditions for twice-differentiable f:
• f is convex ⇔ ∇2f(x) ⪰ 0 for all x (Hessian is PSD)
• f is strictly convex ⇔ ∇2f(x) ≻ 0 for all x (Hessian is PD)
• f is convex ⇔ f(y) ≥ f(x) + ∇f(x)T(y − x) for all x, y (the first-order condition — the function lies above all tangent hyperplanes)
Operations that preserve convexity:
| Operation | Result |
|---|---|
| Non-negative sum: αf + βg (α, β ≥ 0) | Convex |
| Pointwise maximum: max(f, g) | Convex |
| Composition with affine: f(Ax + b) | Convex |
| Infimum over some variables | Convex |
Two important special cases of convex optimization have dedicated solution methods and appear throughout ML.
A linear program (LP) minimizes a linear objective subject to linear constraints:
Since the objective is linear (flat — no curvature), the minimum must be at a corner of the feasible polytope. The simplex algorithm walks along edges from corner to corner, improving the objective at each step. Interior point methods take a different approach, walking through the interior.
A quadratic program (QP) adds a quadratic term to the objective:
If Q is positive semi-definite, the QP is convex and has a unique global minimum (or a set of global minima). QPs appear directly in SVMs (Chapter 12) and ridge regression.
Linear Program:
• Linear objective + linear constraints
• Solution at a vertex of the polytope
• Simplex or interior point method
• Applications: scheduling, resource allocation
Quadratic Program:
• Quadratic objective + linear constraints
• Solution may be interior or at boundary
• Active-set or interior point method
• Applications: SVM, portfolio optimization
The convex conjugate (or Legendre-Fenchel transform) provides a dual representation of convex functions that is central to duality theory. For a function f, its conjugate f* is:
Geometrically, f*(s) is the maximum gap between the linear function sTx and f(x). It encodes f through its tangent lines rather than its point values.
Key properties:
• f* is always convex, even if f is not
• If f is convex and closed, then f** = f (double conjugate recovers the original)
• The conjugate swaps between "primal" and "dual" representations
| Function f(x) | Conjugate f*(s) |
|---|---|
| ½ x2 | ½ s2 |
| |x| | 0 if |s| ≤ 1, ∞ otherwise |
| ½ xTQx (Q ≻ 0) | ½ sTQ−1s |
| ex | s log s − s (s ≥ 0) |
The minimax inequality encapsulates duality:
The left side is the dual problem; the right is the primal. Weak duality says the inequality always holds. Strong duality says equality holds for convex problems satisfying constraint qualification. This "max-min ≤ min-max" principle appears everywhere: game theory, GANs, robust optimization.