Computers use finite precision. Gradients can explode or vanish. Here is how to make deep learning work anyway.
You write a formula on a whiteboard: e1000. Mathematically it is a perfectly well-defined number. But ask your computer to evaluate it in 32-bit floating point and it returns inf — overflow. Ask it for e−1000 and it returns 0.0 — underflow. The math is exact; the machine is not.
Deep learning lives on machines. Every weight update, every forward pass, every loss computation happens in finite-precision arithmetic. Numbers that are too large overflow to infinity. Numbers that are too small collapse to zero. Divide by an underflowed zero and you get NaN — the silent killer of training runs.
Overflow occurs when a number is too large for the floating-point format. In 32-bit float, anything above ~3.4 × 1038 becomes inf. Underflow is the opposite: numbers too close to zero collapse to exactly 0.0, losing all information about their relative sizes.
A classic example is the softmax function. Given a vector z, softmax(z)i = exp(zi) / ∑j exp(zj). If zi = 1000, exp(1000) overflows. If all zi = −1000, every exp underflows to zero and we divide 0/0.
The fix: subtract max(z) from every element before exponentiating. Since softmax(z) = softmax(z − c) for any constant c, choose c = max(z). Now the largest exponent is exp(0) = 1 — no overflow. And at least one term in the denominator equals 1 — no division by zero.
F.log_softmax and F.cross_entropy use this internally. If you compute softmax then take log yourself, you risk numerical death.Increase the logit scale to see overflow happen. Toggle the stability trick to fix it.
The derivative f'(x) tells you how fast f changes when you nudge x. If f'(x) > 0, increasing x increases f. If f'(x) < 0, increasing x decreases f. If f'(x) = 0, you are at a critical point — a local minimum, local maximum, or saddle point.
When f takes a vector input, the derivative becomes a vector called the gradient, written ∇xf. Each component ∂f/∂xi tells you how f changes when you nudge the i-th element of x. The gradient points in the direction of steepest ascent.
To minimize f, we move in the direction opposite the gradient: x ← x − ε ∇xf(x), where ε is the learning rate. This is gradient descent — the engine that powers all of deep learning.
Imagine you are blindfolded on a hilly landscape and want to reach the lowest valley. You feel the slope under your feet (the gradient) and take a step downhill. Repeat. That is gradient descent.
Formally: start at some initial point x0. At each step, update x ← x − ε ∇f(x). The learning rate ε controls step size. Too large and you overshoot. Too small and you crawl. The right ε is problem-dependent.
Click "Step" to take one gradient descent step. Adjust the learning rate to see the effect.
x = 3.00, f(x) = 9.00, f'(x) = 6.00
When a function maps a vector to a vector — f: Rm → Rn — its derivative is not a single number or a vector. It is a matrix called the Jacobian, J, where Jij = ∂fi/∂xj.
The Jacobian tells you how every output changes with respect to every input. In backpropagation, you chain Jacobians together — that is the chain rule applied to vectors. If y = g(x) and z = f(y), then ∂z/∂x = (∂z/∂y)(∂y/∂x) — a product of Jacobians.
The Hessian H is the matrix of second derivatives of a scalar function: Hij = ∂2f / ∂xi∂xj. It captures curvature — how the gradient itself changes as you move. The Hessian is always symmetric (for smooth functions), meaning Hij = Hji.
The Hessian's eigenvalues reveal the shape of the loss surface. If all eigenvalues are positive, the curvature is upward in every direction — you are at a local minimum. If all are negative — a local maximum. If some are positive and some negative — a saddle point, which looks like a mountain pass.
Toggle between a minimum (all positive eigenvalues) and a saddle point (mixed signs). The contours show the loss surface.
Type: minimum | Eigenvalues: λ1=2.0, λ2=1.0
The condition number of the Hessian is the ratio of the largest to the smallest eigenvalue. A large condition number means the surface is much steeper in some directions than others — an ill-conditioned problem. Gradient descent struggles here because the optimal learning rate differs wildly across directions.
Sometimes we want to minimize f(x) subject to constraints. For example, minimize loss while keeping weight norms below a threshold. This is constrained optimization.
The Lagrangian converts a constrained problem into an unconstrained one. For equality constraints g(x) = 0, define L(x, λ) = f(x) + λ g(x). The optimal solution satisfies ∇xL = 0 and ∇λL = 0 simultaneously. The variable λ is called a Lagrange multiplier.
For inequality constraints g(x) ≤ 0, we use the KKT conditions (Karush-Kuhn-Tucker). The key addition: λ ≥ 0 and λ · g(x) = 0. This last condition means either the constraint is active (g(x) = 0 and λ > 0) or inactive (g(x) < 0 and λ = 0).
This playground lets you run gradient descent on a 2D loss surface. Watch the optimizer navigate valleys, saddle points, and ridges. Try different learning rates and see which paths converge, diverge, or get stuck.
Click on the surface to place a starting point. Watch gradient descent find (or miss) the minimum. Choose different surfaces to explore.
Surface: bowl | Click to place start point
Numerical computation is the bridge between the beautiful math of Chapters 2-3 and the practical algorithms of Chapters 6-12. Every concept here reappears throughout the book:
| Concept | Where It Appears |
|---|---|
| Overflow / underflow | Softmax implementation, log-likelihood computation, mixed-precision training |
| Gradient descent | THE training algorithm for all neural networks (Ch 6, 8) |
| Jacobians | Backpropagation is Jacobian chain multiplication (Ch 6) |
| Hessians & curvature | Adaptive optimizers (Adam), second-order methods, loss landscape analysis (Ch 8) |
| Condition number | Why batch normalization helps (Ch 8), preconditioning |
| Constrained optimization | Weight decay = L2 constraint, max-norm regularization (Ch 7) |
Up next: Chapter 6: Deep Feedforward Networks — the first real neural network architecture, where all of this math comes together.