Use curvature to take smarter steps. Newton lands on the minimum of a quadratic in one shot.
First-order methods use the gradient to determine which direction is downhill. But the gradient says nothing about how far the hill extends. A function could be gently sloping or plunging steeply — the gradient alone can't tell the difference.
The tradeoff is cost. For n parameters, the Hessian is an n × n matrix. Computing it costs O(n²) and inverting it costs O(n³). This is fine for n = 100, but impossible for n = 106. Quasi-Newton methods approximate the Hessian cheaply.
Approximate f locally by a quadratic (Taylor expansion to second order) and jump to the minimum of that quadratic:
where g = ∇f(x(k)) is the gradient and H = ∇²f(x(k)) is the Hessian, both evaluated at the current point.
The catch: Newton's method can fail badly far from the minimum. The Hessian might not be positive definite (the quadratic approximation has a saddle point, not a minimum), causing Newton to step toward a maximum instead. A trust region or line search is essential for robustness.
| Property | Gradient Descent | Newton's Method |
|---|---|---|
| Per-step cost | O(n) | O(n³) (Hessian inversion) |
| Convergence rate | Linear | Quadratic (near minimum) |
| Robustness | Always descends | Can ascend if H not PD |
In 1D, Newton's method requires f'' at each step. The secant method avoids computing the second derivative by approximating it from two consecutive gradient evaluations:
The secant method converges superlinearly with rate φ ≈ 1.618 (the golden ratio). This is slower than Newton's quadratic convergence but requires no second derivatives.
In multiple dimensions, the secant idea generalizes to quasi-Newton methods — we approximate the Hessian from gradient differences rather than computing it directly.
Newton's method fails when the Hessian is not positive definite. The Levenberg-Marquardt algorithm fixes this by adding a damping term to the Hessian:
The damping parameter λ interpolates between Newton's method (λ = 0) and gradient descent (λ → ∞):
This is effectively a trust region method. Large λ = small trust region. Small λ = large trust region. It is the standard algorithm for nonlinear least-squares problems (curve fitting, neural network training in the pre-deep-learning era).
Computing the full Hessian is O(n²) per step. Inverting it is O(n³). For large problems, this is prohibitive. Quasi-Newton methods build an approximation Q ≈ H−1 incrementally from gradient information alone.
The key constraint is the secant condition: the approximate Hessian must be consistent with the most recent gradient change.
The update to Q is a rank-2 correction — it adds two outer products. This means each update is O(n²) instead of O(n³), and the approximation improves with every step.
The Broyden-Fletcher-Goldfarb-Shanno (BFGS) method is the gold standard of quasi-Newton methods. It maintains an approximation Q of the inverse Hessian H−1 and updates it after each step using a rank-2 formula:
where ρ = 1 / (δT γ). The step direction is d = −Q g, and a line search determines the step size.
The cost is O(n²) per step (for the matrix-vector product) and O(n²) storage (for Q). This is far better than Newton's O(n³) per step, but still too much for n = 106.
Limited-memory BFGS solves the storage problem. Instead of storing the full n × n matrix Q, it stores only the last m step differences (δ1, ..., δm) and gradient differences (γ1, ..., γm), where m is typically 5–20.
The inverse-Hessian-vector product Q g is computed implicitly using a two-loop recursion that only touches the stored vectors. Cost: O(mn) per step. Storage: O(mn).
| Method | Storage | Cost / step | Convergence |
|---|---|---|---|
| Gradient descent | O(n) | O(n) | Linear |
| BFGS | O(n²) | O(n²) | Superlinear |
| L-BFGS (m pairs) | O(mn) | O(mn) | Superlinear |
| Newton | O(n²) | O(n³) | Quadratic |
Let's formalize what we mean by convergence speed. Near a minimum x*, the error e(k) = ||x(k) − x*|| shrinks at a rate that depends on the method:
| Rate | Meaning | Method |
|---|---|---|
| Linear | e(k+1) ≤ c · e(k), c < 1 | Gradient descent |
| Superlinear | e(k+1) / e(k) → 0 | BFGS, L-BFGS, secant |
| Quadratic | e(k+1) ≤ c · (e(k))² | Newton |
In practice, the convergence rate only matters near the minimum. Far away, all methods are approximately equally slow (or fast). The initial phase is about getting into the "basin of convergence" where the local rate kicks in.
Watch Newton's method converge on a 2D quadratic. Compare its path to gradient descent — Newton heads straight to the minimum regardless of the curvature.
Click Step to advance both methods. Newton (purple) vs. GD (red) on an ill-conditioned quadratic.
| Method | Uses Hessian? | Storage | Convergence |
|---|---|---|---|
| Newton | Exact | O(n²) | Quadratic |
| Secant (1D) | Approximated | O(1) | Superlinear (φ ≈ 1.618) |
| Levenberg-Marquardt | Exact + damping | O(n²) | Quadratic near min |
| BFGS | Rank-2 approx | O(n²) | Superlinear |
| L-BFGS | m-pair approx | O(mn) | Superlinear |