Goodfellow et al., Chapter 4

Numerical Computation

Computers use finite precision. Gradients can explode or vanish. Here is how to make deep learning work anyway.

Prerequisites: Chapters 2-3 (linear algebra, probability).
9
Chapters
4+
Simulations
9
Quizzes

Chapter 0: Why Numerics?

You write a formula on a whiteboard: e1000. Mathematically it is a perfectly well-defined number. But ask your computer to evaluate it in 32-bit floating point and it returns inf — overflow. Ask it for e−1000 and it returns 0.0 — underflow. The math is exact; the machine is not.

Deep learning lives on machines. Every weight update, every forward pass, every loss computation happens in finite-precision arithmetic. Numbers that are too large overflow to infinity. Numbers that are too small collapse to zero. Divide by an underflowed zero and you get NaN — the silent killer of training runs.

The core tension: Mathematics gives us beautiful exact solutions. Computers give us 32 or 64 bits per number. This chapter teaches the tricks that bridge the gap — numerical stability, gradient-based optimization, and the calculus tools (Jacobians, Hessians) that make training possible.
Overflow & Underflow
Finite precision traps and how to avoid them
Gradient Descent
Follow the slope downhill to minimize loss
Jacobians & Hessians
First and second-order derivative matrices
Constrained Optimization
Optimizing with constraints via Lagrange multipliers
Why can't we simply implement mathematical formulas directly in code and expect correct results?

Chapter 1: Overflow & Underflow

Overflow occurs when a number is too large for the floating-point format. In 32-bit float, anything above ~3.4 × 1038 becomes inf. Underflow is the opposite: numbers too close to zero collapse to exactly 0.0, losing all information about their relative sizes.

A classic example is the softmax function. Given a vector z, softmax(z)i = exp(zi) / ∑j exp(zj). If zi = 1000, exp(1000) overflows. If all zi = −1000, every exp underflows to zero and we divide 0/0.

softmax(z)i = exp(zi) / ∑j exp(zj)

The fix: subtract max(z) from every element before exponentiating. Since softmax(z) = softmax(z − c) for any constant c, choose c = max(z). Now the largest exponent is exp(0) = 1 — no overflow. And at least one term in the denominator equals 1 — no division by zero.

Key insight: The log-sum-exp trick stabilizes log(softmax) similarly: log ∑ exp(zi) = max(z) + log ∑ exp(zi − max(z)). PyTorch's F.log_softmax and F.cross_entropy use this internally. If you compute softmax then take log yourself, you risk numerical death.
Softmax Stability

Increase the logit scale to see overflow happen. Toggle the stability trick to fix it.

Logit scale1
How does the "subtract max" trick stabilize softmax?

Chapter 2: Derivatives & Gradients

The derivative f'(x) tells you how fast f changes when you nudge x. If f'(x) > 0, increasing x increases f. If f'(x) < 0, increasing x decreases f. If f'(x) = 0, you are at a critical point — a local minimum, local maximum, or saddle point.

When f takes a vector input, the derivative becomes a vector called the gradient, written ∇xf. Each component ∂f/∂xi tells you how f changes when you nudge the i-th element of x. The gradient points in the direction of steepest ascent.

xf(x) = (∂f/∂x1, ∂f/∂x2, ..., ∂f/∂xn)T

To minimize f, we move in the direction opposite the gradient: x ← x − ε ∇xf(x), where ε is the learning rate. This is gradient descent — the engine that powers all of deep learning.

Partial derivatives: When f depends on many variables, the partial derivative ∂f/∂xi holds all other variables fixed and differentiates only with respect to xi. You can think of it as slicing through a mountain and looking at the slope along one axis.
What does the gradient of a function point toward?

Chapter 3: Gradient Descent

Imagine you are blindfolded on a hilly landscape and want to reach the lowest valley. You feel the slope under your feet (the gradient) and take a step downhill. Repeat. That is gradient descent.

Formally: start at some initial point x0. At each step, update x ← x − ε ∇f(x). The learning rate ε controls step size. Too large and you overshoot. Too small and you crawl. The right ε is problem-dependent.

xt+1 = xt − ε ∇f(xt)
Gradient Descent on a 1D Function

Click "Step" to take one gradient descent step. Adjust the learning rate to see the effect.

Learning rate ε0.100

x = 3.00, f(x) = 9.00, f'(x) = 6.00

Convergence: Gradient descent converges to a local minimum when the gradient is zero. For convex functions (bowl-shaped), every local minimum is the global minimum. Neural network losses are non-convex, so gradient descent may find different local minima depending on initialization.
What happens when the learning rate is too large?

Chapter 4: Jacobians & Hessians

When a function maps a vector to a vector — f: Rm → Rn — its derivative is not a single number or a vector. It is a matrix called the Jacobian, J, where Jij = ∂fi/∂xj.

Jij = ∂fi / ∂xj

The Jacobian tells you how every output changes with respect to every input. In backpropagation, you chain Jacobians together — that is the chain rule applied to vectors. If y = g(x) and z = f(y), then ∂z/∂x = (∂z/∂y)(∂y/∂x) — a product of Jacobians.

The Hessian H is the matrix of second derivatives of a scalar function: Hij = ∂2f / ∂xi∂xj. It captures curvature — how the gradient itself changes as you move. The Hessian is always symmetric (for smooth functions), meaning Hij = Hji.

Why curvature matters: Along a direction d, the function curves like f(x + εd) ≈ f(x) + ε gTd + ½ ε2 dTHd. The term dTHd tells you whether the surface curves up (positive → minimum) or down (negative → maximum). When dTHd = 0, the surface is flat in that direction.
What does the Hessian matrix capture that the gradient does not?

Chapter 5: Second-Order Information

The Hessian's eigenvalues reveal the shape of the loss surface. If all eigenvalues are positive, the curvature is upward in every direction — you are at a local minimum. If all are negative — a local maximum. If some are positive and some negative — a saddle point, which looks like a mountain pass.

Curvature: Minimum vs Saddle Point

Toggle between a minimum (all positive eigenvalues) and a saddle point (mixed signs). The contours show the loss surface.

Type: minimum | Eigenvalues: λ1=2.0, λ2=1.0

The condition number of the Hessian is the ratio of the largest to the smallest eigenvalue. A large condition number means the surface is much steeper in some directions than others — an ill-conditioned problem. Gradient descent struggles here because the optimal learning rate differs wildly across directions.

Key insight: Newton's method uses the Hessian to rescale the gradient: x ← x − H−1∇f. This accounts for curvature and converges much faster on ill-conditioned problems. But inverting the Hessian costs O(n3) for n parameters — far too expensive for neural networks with millions of parameters.
What does a saddle point look like in terms of Hessian eigenvalues?

Chapter 6: Constrained Optimization

Sometimes we want to minimize f(x) subject to constraints. For example, minimize loss while keeping weight norms below a threshold. This is constrained optimization.

The Lagrangian converts a constrained problem into an unconstrained one. For equality constraints g(x) = 0, define L(x, λ) = f(x) + λ g(x). The optimal solution satisfies ∇xL = 0 and ∇λL = 0 simultaneously. The variable λ is called a Lagrange multiplier.

L(x, λ) = f(x) + λ · g(x)

For inequality constraints g(x) ≤ 0, we use the KKT conditions (Karush-Kuhn-Tucker). The key addition: λ ≥ 0 and λ · g(x) = 0. This last condition means either the constraint is active (g(x) = 0 and λ > 0) or inactive (g(x) < 0 and λ = 0).

Deep learning connection: Weight decay (L2 regularization) is equivalent to constrained optimization: minimize the loss subject to ||w||2 ≤ c. The Lagrange multiplier λ becomes the weight decay coefficient. Larger λ → tighter constraint on weight magnitude.
The KKT conditions: (1) The gradient of the Lagrangian is zero. (2) The constraints are satisfied: g(x) ≤ 0. (3) The multipliers are non-negative: λ ≥ 0. (4) Complementary slackness: λ · g(x) = 0. Any point satisfying all four is a candidate optimum.
What does the complementary slackness condition λ · g(x) = 0 mean in words?

Chapter 7: Optimization Playground

This playground lets you run gradient descent on a 2D loss surface. Watch the optimizer navigate valleys, saddle points, and ridges. Try different learning rates and see which paths converge, diverge, or get stuck.

2D Gradient Descent Explorer

Click on the surface to place a starting point. Watch gradient descent find (or miss) the minimum. Choose different surfaces to explore.

Learning rate0.050

Surface: bowl | Click to place start point

Experiments to try: (1) On the bowl, crank up learning rate until it overshoots. (2) On the saddle, start at (0, 0) and see what happens. (3) On the Rosenbrock valley, notice how gradient descent struggles with the narrow curved valley. (4) On multi-modal, start in different places to find different minima.
Why does gradient descent struggle with the Rosenbrock valley (a narrow, curved minimum)?

Chapter 8: Connections

Numerical computation is the bridge between the beautiful math of Chapters 2-3 and the practical algorithms of Chapters 6-12. Every concept here reappears throughout the book:

ConceptWhere It Appears
Overflow / underflowSoftmax implementation, log-likelihood computation, mixed-precision training
Gradient descentTHE training algorithm for all neural networks (Ch 6, 8)
JacobiansBackpropagation is Jacobian chain multiplication (Ch 6)
Hessians & curvatureAdaptive optimizers (Adam), second-order methods, loss landscape analysis (Ch 8)
Condition numberWhy batch normalization helps (Ch 8), preconditioning
Constrained optimizationWeight decay = L2 constraint, max-norm regularization (Ch 7)
What you should take away: Deep learning is optimization. The loss function defines what we want; gradient descent (and its variants) is how we get there. Numerical stability tricks like the log-sum-exp trick are not optional — they are the difference between a training run that works and one that produces NaN.

Up next: Chapter 6: Deep Feedforward Networks — the first real neural network architecture, where all of this math comes together.

Why is backpropagation described as "chaining Jacobians together"?