Kochenderfer & Wheeler, Chapter 6

Second-Order Methods

Use curvature to take smarter steps. Newton lands on the minimum of a quadratic in one shot.

Prerequisites: Chapter 5 (First-Order Methods).
10
Chapters
1
Simulations
10
Quizzes

Chapter 0: Why Second-Order?

First-order methods use the gradient to determine which direction is downhill. But the gradient says nothing about how far the hill extends. A function could be gently sloping or plunging steeply — the gradient alone can't tell the difference.

The core idea: The Hessian H = ∇²f (the matrix of second derivatives) captures the curvature of f. A steep curve means the minimum is close; a gentle curve means the minimum is far. Second-order methods use this curvature information to take better-calibrated steps.

The tradeoff is cost. For n parameters, the Hessian is an n × n matrix. Computing it costs O(n²) and inverting it costs O(n³). This is fine for n = 100, but impossible for n = 106. Quasi-Newton methods approximate the Hessian cheaply.

The Hessian matrix captures what information about the function?

Chapter 1: Newton's Method

Approximate f locally by a quadratic (Taylor expansion to second order) and jump to the minimum of that quadratic:

x(k+1) = x(k) − H−1 g

where g = ∇f(x(k)) is the gradient and H = ∇²f(x(k)) is the Hessian, both evaluated at the current point.

Key insight: For a quadratic function f(x) = xTAx/2 + bTx + c, Newton's method converges in exactly one step from any starting point. For non-quadratic functions near the minimum (where the quadratic approximation is good), Newton's method converges quadratically — the number of correct digits doubles every iteration.

The catch: Newton's method can fail badly far from the minimum. The Hessian might not be positive definite (the quadratic approximation has a saddle point, not a minimum), causing Newton to step toward a maximum instead. A trust region or line search is essential for robustness.

PropertyGradient DescentNewton's Method
Per-step costO(n)O(n³) (Hessian inversion)
Convergence rateLinearQuadratic (near minimum)
RobustnessAlways descendsCan ascend if H not PD
Newton's method converges in one step for:

Chapter 2: The Secant Method

In 1D, Newton's method requires f'' at each step. The secant method avoids computing the second derivative by approximating it from two consecutive gradient evaluations:

f''(x) ≈ (f'(x(k)) − f'(x(k−1))) / (x(k) − x(k−1))

The secant method converges superlinearly with rate φ ≈ 1.618 (the golden ratio). This is slower than Newton's quadratic convergence but requires no second derivatives.

Tradeoff: Newton requires one function evaluation (of f'') per step with quadratic convergence. Secant requires zero second-derivative evaluations with superlinear convergence. In many practical problems, avoiding f'' is well worth the slightly slower rate.

In multiple dimensions, the secant idea generalizes to quasi-Newton methods — we approximate the Hessian from gradient differences rather than computing it directly.

The secant method approximates the second derivative using:

Chapter 3: Levenberg-Marquardt

Newton's method fails when the Hessian is not positive definite. The Levenberg-Marquardt algorithm fixes this by adding a damping term to the Hessian:

x(k+1) = x(k) − (H + λI)−1 g

The damping parameter λ interpolates between Newton's method (λ = 0) and gradient descent (λ → ∞):

Key insight: When λ is large, (H + λI)−1 ≈ (1/λ)I, so the step becomes (1/λ)g — a small gradient descent step. When λ is small, the step is approximately Newton's step. By adjusting λ, Levenberg-Marquardt smoothly transitions between cautious gradient descent (far from the minimum) and aggressive Newton steps (close to it).

This is effectively a trust region method. Large λ = small trust region. Small λ = large trust region. It is the standard algorithm for nonlinear least-squares problems (curve fitting, neural network training in the pre-deep-learning era).

When λ is very large, Levenberg-Marquardt behaves like:

Chapter 4: Quasi-Newton Ideas

Computing the full Hessian is O(n²) per step. Inverting it is O(n³). For large problems, this is prohibitive. Quasi-Newton methods build an approximation Q ≈ H−1 incrementally from gradient information alone.

The key constraint is the secant condition: the approximate Hessian must be consistent with the most recent gradient change.

δ = x(k+1) − x(k)    γ = g(k+1) − g(k)
Q(k+1) δ = γ    (secant condition)
Key insight: There are infinitely many matrices satisfying the secant condition. Different choices lead to different methods: DFP, BFGS, SR1. The best choice (BFGS) picks the matrix closest to the current approximation while satisfying the secant condition.

The update to Q is a rank-2 correction — it adds two outer products. This means each update is O(n²) instead of O(n³), and the approximation improves with every step.

The secant condition requires the approximate Hessian to:

Chapter 5: BFGS

The Broyden-Fletcher-Goldfarb-Shanno (BFGS) method is the gold standard of quasi-Newton methods. It maintains an approximation Q of the inverse Hessian H−1 and updates it after each step using a rank-2 formula:

Q(k+1) = (I − ρ δ γT) Q(k) (I − ρ γ δT) + ρ δ δT

where ρ = 1 / (δT γ). The step direction is d = −Q g, and a line search determines the step size.

Why BFGS is special: It automatically preserves positive definiteness of Q (so d is always a descent direction), it satisfies the secant condition (so the approximation is consistent with recent curvature), and it converges superlinearly even though it never computes the true Hessian. For medium-sized problems (n up to a few thousand), BFGS is often the best choice.

The cost is O(n²) per step (for the matrix-vector product) and O(n²) storage (for Q). This is far better than Newton's O(n³) per step, but still too much for n = 106.

BFGS maintains an approximation of:

Chapter 6: L-BFGS

Limited-memory BFGS solves the storage problem. Instead of storing the full n × n matrix Q, it stores only the last m step differences (δ1, ..., δm) and gradient differences (γ1, ..., γm), where m is typically 5–20.

The inverse-Hessian-vector product Q g is computed implicitly using a two-loop recursion that only touches the stored vectors. Cost: O(mn) per step. Storage: O(mn).

Key insight: L-BFGS trades curvature accuracy for memory. With m = 10 stored pairs, it captures the curvature of the last 10 steps. This is enough to dramatically outperform gradient descent, while using negligible memory compared to full BFGS. It is the workhorse optimizer for large-scale machine learning problems where second-order information is helpful but full Hessians are impossible.
MethodStorageCost / stepConvergence
Gradient descentO(n)O(n)Linear
BFGSO(n²)O(n²)Superlinear
L-BFGS (m pairs)O(mn)O(mn)Superlinear
NewtonO(n²)O(n³)Quadratic
L-BFGS reduces storage from O(n²) to O(mn) by:

Chapter 7: Convergence Rates

Let's formalize what we mean by convergence speed. Near a minimum x*, the error e(k) = ||x(k) − x*|| shrinks at a rate that depends on the method:

RateMeaningMethod
Lineare(k+1) ≤ c · e(k), c < 1Gradient descent
Superlineare(k+1) / e(k) → 0BFGS, L-BFGS, secant
Quadratice(k+1) ≤ c · (e(k)Newton
What quadratic convergence means: If Newton's method has 3 correct digits at step k, it has 6 at step k+1, 12 at step k+2, and 24 at step k+3. The number of correct digits doubles every step. This is astonishingly fast — but only kicks in once you're close enough for the quadratic Taylor approximation to be accurate.

In practice, the convergence rate only matters near the minimum. Far away, all methods are approximately equally slow (or fast). The initial phase is about getting into the "basin of convergence" where the local rate kicks in.

Quadratic convergence means the number of correct digits:

Chapter 8: Newton Simulator

Watch Newton's method converge on a 2D quadratic. Compare its path to gradient descent — Newton heads straight to the minimum regardless of the curvature.

Newton vs. Gradient Descent

Click Step to advance both methods. Newton (purple) vs. GD (red) on an ill-conditioned quadratic.

Step 0
● Newton ● Gradient Descent
On a quadratic function, Newton's method converges in:

Chapter 9: Summary

MethodUses Hessian?StorageConvergence
NewtonExactO(n²)Quadratic
Secant (1D)ApproximatedO(1)Superlinear (φ ≈ 1.618)
Levenberg-MarquardtExact + dampingO(n²)Quadratic near min
BFGSRank-2 approxO(n²)Superlinear
L-BFGSm-pair approxO(mn)Superlinear
Looking ahead: What if you have no derivatives at all — no gradient, no Hessian? Chapter 7 covers direct methods that optimize using only function evaluations. Nelder-Mead, coordinate descent, and the DIRECT algorithm work in derivative-free settings where gradient-based methods cannot.
The best quasi-Newton method for large-scale problems with limited memory is: