Deep Learning Internals

PyTorch Autograd

How automatic differentiation builds computation graphs and computes all your gradients — without you lifting a pen.

Prerequisites: Basic calculus (derivatives) + Python familiarity. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Autograd?

You have a simple function: y = (3x + 2)². Compute dy/dx. Easy — apply the chain rule, get 6(3x + 2). Takes ten seconds with a pencil.

Now imagine you have a neural network. 10 million parameters. 50 layers of matrix multiplications, nonlinearities, normalizations, attention mechanisms. You need the gradient of the loss with respect to every single one of those 10 million parameters. By hand? You'd need a lifetime.

This is the problem automatic differentiation solves. Not symbolic differentiation (which produces enormous expressions). Not numerical differentiation (which is slow and imprecise). Automatic differentiation computes exact gradients by recording every operation you perform and then replaying them backwards, applying the chain rule at each step.

The core insight: Every computation in a neural network is just a sequence of simple operations (add, multiply, exp, etc.). Each simple operation has a known derivative. Autograd chains these together automatically. You write the forward pass — autograd gives you the backward pass for free.

PyTorch's autograd engine is what makes deep learning practical. Without it, training would require manually deriving and implementing gradient formulas for every architecture. With it, you invent any crazy forward computation and PyTorch figures out all the gradients.

Consider what this means in practice. A researcher invents a new attention mechanism with 15 novel mathematical operations. In the old days (pre-autograd), they'd spend weeks deriving gradients by hand, implementing them in C, debugging off-by-one errors in indices. With autograd, they write the forward pass in 20 lines of Python and get perfect gradients instantly. This is why deep learning research accelerated so dramatically after frameworks like PyTorch appeared.

In this lesson, we'll build autograd from scratch. By the end, you'll understand exactly what happens when you call .backward() — no magic, just the chain rule applied systematically by a clever piece of software.

python
# The simplest possible autograd demo
import torch

x = torch.tensor(1.0, requires_grad=True)  # "Track this!"
y = (3 * x + 2) ** 2                       # Forward: compute y
y.backward()                                 # Backward: compute dy/dx
print(x.grad)                                # tensor(30.) — the gradient!

# Without autograd, you'd need to derive this by hand:
# y = (3x+2)², dy/dx = 2(3x+2)·3 = 6(3x+2)
# At x=1: 6(5) = 30 ✓
Forward & Backward Pass

Watch operations flow forward (left to right) building the computation graph, then gradients flow backward (right to left). Click Step to advance.

Click Step to begin forward pass

The animation above shows the computation y = (3x + 2)² with x = 1. In the forward pass, values flow left to right: x=1 → 3*1=3 → 3+2=5 → 5²=25. In the backward pass, gradients flow right to left: dy/dy=1 → dy/du=2u=10 → dy/d(3x+2)=10 → dy/dx=10*3=30.

Let's compare three approaches to computing gradients:

MethodHow it worksCost for N paramsExact?
SymbolicAlgebraically simplify derivative expressionExpression grows exponentiallyYes
NumericalCompute (f(x+h)-f(x))/h for each paramN+1 forward passesNo (rounding errors)
Automatic (reverse)Record ops, replay chain rule backward1 forward + 1 backwardYes

Automatic differentiation is the clear winner for neural networks: exact gradients, constant cost regardless of parameter count. A network with 175 billion parameters (GPT-3) gets all 175 billion gradients in a single backward pass — the same cost as a network with 100 parameters.

Key insight: The cost of one backward pass is approximately 2-3x the cost of one forward pass, regardless of how many parameters you have. This is because backward must compute one local gradient per operation, and there are roughly the same number of operations as in the forward pass (with some extra multiplications).

That's autograd in a nutshell. Let's learn how it actually works.

Why is numerical differentiation (computing f(x+h)-f(x))/h) inadequate for neural networks?

Chapter 1: Computation Graphs

When you write PyTorch code like y = x * 3 + 2, something invisible happens behind the scenes. PyTorch doesn't just compute the result — it builds a computation graph. This graph records what operations were performed and in what order, creating a complete audit trail from inputs to output.

A computation graph is a directed acyclic graph (DAG). Each node represents either a tensor (data) or an operation (function). Edges connect inputs to operations to outputs. "Directed" means edges have a direction (input → output). "Acyclic" means no loops — you can't feed an output back to its own input (that would create a time paradox for gradients).

Why "acyclic"? Because autograd needs to traverse the graph in a definite order. If there were a cycle (A depends on B, B depends on A), there's no valid starting point for the backward pass. Recurrent neural networks appear to have cycles, but they're actually unrolled into a DAG — each timestep is a separate node.

Think of it this way: The computation graph is a receipt. It records every mathematical step you took. Later, when you need gradients, PyTorch reads this receipt backwards to figure out how each input contributed to the final output.

The key flag is requires_grad=True. When you create a tensor with this flag, PyTorch knows to track every operation involving that tensor. Without it, operations proceed normally but no graph is built — saving memory and compute when you don't need gradients (like during inference).

The memory overhead of tracking is significant. Every tracked operation allocates a grad_fn node, stores pointers to input tensors, and may save intermediate values for backward. For a typical transformer layer with ~20 operations, that's ~20 additional objects per forward pass. Multiply by batch size and sequence length and you see why inference always uses torch.no_grad().

python
import torch

# This tensor is tracked — autograd builds a graph
x = torch.tensor([3.0], requires_grad=True)

# Every operation on x creates graph nodes
a = x * 2        # MulBackward node created
b = a + 5        # AddBackward node created
c = b ** 2       # PowBackward node created

# The graph remembers: c came from b, b from a, a from x
print(c.grad_fn)           # PowBackward0
print(c.grad_fn.next_functions)  # shows the chain

Every tensor produced by an operation has a grad_fn attribute — a pointer back to the operation that created it. This forms a linked list (really a DAG) from the output all the way back to the leaf tensors (the ones you created directly with requires_grad=True).

Leaf tensors (like your model parameters) have grad_fn = None because they weren't produced by any operation — they're the starting points. But they have requires_grad = True, which tells autograd: "I want gradients with respect to this tensor."

Key insight: The graph is built dynamically at runtime, not statically at definition time. This means you can use Python if-statements, loops, and any control flow — the graph records whatever actually executes. This is why PyTorch is called a "define-by-run" framework. TensorFlow 1.x built graphs statically before execution; PyTorch builds them as you go.

This dynamic nature means the graph can be different on every forward pass. A model that uses an if-statement based on the input value will produce different graphs for different inputs — and autograd handles this perfectly, because it just records what actually happened:

python
def weird_function(x):
    # Different computation paths based on x's value!
    if x.item() > 0:
        return x ** 2          # Positive: square it
    else:
        return -x * 3 + 1     # Negative: linear

x = torch.tensor(2.0, requires_grad=True)
y = weird_function(x)
y.backward()
print(x.grad)  # 4.0 (d(x²)/dx = 2x = 4, took the positive branch)

x = torch.tensor(-1.0, requires_grad=True)
y = weird_function(x)
y.backward()
print(x.grad)  # -3.0 (d(-3x+1)/dx = -3, took the negative branch)
Interactive Graph Builder

Click Add Op to add operations one at a time. Watch the computation graph grow. Each node shows the operation and current value.

Graph: x = 2.0 (requires_grad=True)

Notice how each new operation creates a new node that points back to its inputs. This backward-pointing structure is what makes the backward pass possible — we can start at the output and follow the pointers back to compute gradients for every intermediate node.

One subtle but important point: the graph is ephemeral by default. After you call .backward(), PyTorch destroys the graph to free memory. If you need to backward through the same graph twice (rare, but it happens), you must pass retain_graph=True. We'll cover this in Chapter 5.

Let's inspect the graph structure programmatically. Every tensor has a grad_fn that points to its creating operation, and grad_fn.next_functions that points to the previous nodes:

python
import torch

x = torch.tensor(2.0, requires_grad=True)
a = x ** 2
b = a * 3
c = b + 1

# Walk the graph backward from c
print(c.grad_fn)                    # AddBackward0
print(c.grad_fn.next_functions)      # ((MulBackward0, 0), ...)
print(c.grad_fn.next_functions[0][0]) # MulBackward0 (= b's grad_fn)
print(b.grad_fn.next_functions[0][0]) # PowBackward0 (= a's grad_fn)
print(a.grad_fn.next_functions[0][0]) # AccumulateGrad (= leaf node, x)

# Leaf tensors have no grad_fn (they're starting points)
print(x.grad_fn)  # None
print(x.is_leaf)  # True
Leaf vs. non-leaf tensors: A leaf tensor is one you created directly (not from an operation). Your model's nn.Parameter objects are leaves. Intermediate tensors (produced by operations) are non-leaf. By default, PyTorch only stores .grad for leaf tensors — that's where your parameter updates come from. If you need gradients for intermediate tensors (for debugging), call .retain_grad() on them before backward.
What does `requires_grad=True` tell PyTorch to do?

Chapter 2: The Chain Rule

Autograd isn't magic. It's the chain rule from calculus, applied systematically. If you understand the chain rule, you understand autograd. Let's make sure that's rock solid.

The chain rule says: if y = f(g(x)), then dy/dx = f'(g(x)) · g'(x). In words: the derivative of a composition is the product of the derivatives at each stage. You multiply the "local gradients" along the path from output to input.

dy/dx = dy/du · du/dx

Let's work a concrete example by hand. Take y = (3x + 2)². Let's decompose this into simple steps:

Step 1
u = 3x + 2 (linear)
Step 2
y = u² (square)

Now compute each local derivative:

Chain rule gives us: dy/dx = dy/du · du/dx = 2u · 3 = 6u = 6(3x + 2).

At x = 1: u = 3(1) + 2 = 5, so dy/dx = 6 · 5 = 30.

This is exactly what autograd does. It decomposes your computation into elementary operations, computes the local derivative at each one, then multiplies them together along the path. The only difference is that autograd does it for millions of operations in milliseconds.

Let's verify with PyTorch:

python
import torch

x = torch.tensor(1.0, requires_grad=True)
y = (3 * x + 2) ** 2
y.backward()

print(x.grad)  # tensor(30.) ✓ matches our hand calculation!

Now let's extend to a longer chain. Consider y = sin(exp(x²)). Three operations:

a = x²
da/dx = 2x
b = exp(a)
db/da = exp(a)
y = sin(b)
dy/db = cos(b)

Chain rule: dy/dx = dy/db · db/da · da/dx = cos(exp(x²)) · exp(x²) · 2x.

At x = 0.5: a = 0.25, b = exp(0.25) ≈ 1.284, dy/dx = cos(1.284) · 1.284 · 1.0 ≈ 0.278 · 1.284 · 1.0 ≈ 0.357.

Let's verify this longer chain with PyTorch too:

python
import torch

x = torch.tensor(0.5, requires_grad=True)
a = x ** 2
b = torch.exp(a)
y = torch.sin(b)
y.backward()

print(f"dy/dx = {x.grad:.4f}")  # 0.3570 ✓ matches hand calculation

# Manual verification
import math
manual = math.cos(math.exp(0.25)) * math.exp(0.25) * 2 * 0.5
print(f"manual  = {manual:.4f}")  # 0.3570 ✓

Notice the pattern: each node only needs to know its own local derivative. Node "sin" knows that d(sin(b))/db = cos(b). Node "exp" knows that d(exp(a))/da = exp(a). They don't need to know what came before them. Autograd just multiplies these local derivatives along the chain.

Forward mode vs. reverse mode: There are actually two ways to apply the chain rule. Forward mode starts from the input and propagates derivatives forward: compute da/dx, then db/dx = db/da · da/dx, etc. Reverse mode starts from the output and propagates backward: compute dy/db, then dy/da = dy/db · db/da, etc. For a function with N inputs and 1 output (like a loss function), reverse mode needs 1 pass to get all N gradients. Forward mode would need N passes. That's why neural network training uses reverse mode (backpropagation).
Chain Rule Visualizer

Adjust x with the slider. Watch the local gradients multiply along the chain for y = (3x + 2)².

x 1.0

The key realization: no matter how deep the chain goes (50 layers? 100?), the math is the same — multiply local derivatives along the path. Autograd just automates this multiplication across potentially millions of paths in a computation graph.

Why "backward" and not "forward"? You could apply the chain rule starting from the input (forward mode). But for neural networks with millions of inputs (parameters) and one output (loss), backward mode is vastly more efficient: one backward pass gives you ALL gradients at once. Forward mode would require one pass per parameter.

Let's make this concrete with numbers. A GPT-2 model has ~117 million parameters and one scalar loss output. Forward-mode autodiff would need 117,000,000 passes to get all gradients. Reverse-mode (backpropagation) needs exactly ONE backward pass. The cost difference is 117-million-fold. This is why reverse mode won and why we call it "backpropagation" — it's literally just reverse-mode automatic differentiation applied to neural networks.

When IS forward mode useful? When you have few inputs and many outputs. Computing the Jacobian of a function f: R³ → R¹&sup0;&sup0;&sup0; is cheaper with forward mode (3 passes) than reverse mode (1000 passes). PyTorch supports forward mode via torch.autograd.forward_ad for these cases.

For y = (3x + 2)², what is dy/dx at x = 2?

Chapter 3: The Backward Pass

You've called loss.backward() a thousand times. But what actually happens inside? Let's trace through it step by step.

When you call .backward() on a tensor, PyTorch does three things:

  1. Topological sort — Order all nodes so that every node comes after the nodes it depends on. Reverse this order for the backward pass.
  2. Walk backwards — Starting from the output, visit each node in reverse topological order.
  3. At each node — Compute local_gradient × incoming_gradient, then pass the result to parent nodes.

The seed gradient is always 1.0 (because dL/dL = 1). But you can actually pass a different seed to .backward(grad_tensor) when you want to compute a Jacobian-vector product with a specific vector. This is used in advanced scenarios like GAN training with gradient penalties.

Topological sort is just a fancy way of saying: "process things in the right order." If c = a + b, you must compute gradients for c before you can compute gradients for a and b. The sort ensures this ordering.

Let's trace through a concrete graph. Consider:

python
x = torch.tensor(2.0, requires_grad=True)
a = x * 3       # a = 6
b = a + 1       # b = 7
c = b ** 2      # c = 49
c.backward()

Forward pass computed: x=2 → a=6 → b=7 → c=49.

Now the backward pass. We start at c and work backwards:

dc/dc = 1
Seed gradient (always starts at 1)
dc/db = 2b = 14
Local grad of b² is 2b. Incoming = 1. Pass 14 to b.
dc/da = 14 · 1 = 14
Local grad of a+1 is 1. Incoming = 14. Pass 14 to a.
dc/dx = 14 · 3 = 42
Local grad of x*3 is 3. Incoming = 14. Pass 42 to x.

So x.grad = 42. Let's verify: c = (3x+1)², dc/dx = 2(3x+1)·3 = 6(3x+1). At x=2: 6(7) = 42. Correct.

Let's write out the full trace in table form to make it crystal clear:

NodeForward ValueLocal GradientIncoming GradOutput Grad (local × incoming)
c = b²49dc/db = 2b = 141 (seed)14 → send to b
b = a+17db/da = 11414 → send to a
a = x*36da/dx = 31442 → send to x
x (leaf)242x.grad = 42 (stored)

This table is the backward pass. Every row follows the same formula: output_grad = local_grad × incoming_grad. The "incoming grad" for each node is the "output grad" of the node above it. The process is completely mechanical — no creativity required, just table-filling.

The "incoming gradient" pattern: Each node receives a gradient from above (the "incoming gradient" or "upstream gradient"). It multiplies this by its own local derivative and passes the result down. This is ALL that backward does. There is no other magic.

In PyTorch's source code, this incoming gradient is called grad_output in custom Function classes, and sometimes called the "cotangent" in more mathematical contexts. But the concept is always the same: "what gradient arrived at this node from the nodes it feeds into."

Let's work a slightly more complex example with a branching computation — where one tensor is used in two different operations:

python
x = torch.tensor(3.0, requires_grad=True)
a = x ** 2       # a = 9
b = x * 5       # b = 15  (x is used TWICE!)
c = a + b       # c = 24
c.backward()

# x.grad = dc/dx = dc/da · da/dx + dc/db · db/dx
#        = 1·(2x) + 1·5 = 6 + 5 = 11
print(x.grad)   # tensor(11.) ✓

When x is used in multiple operations, gradients from ALL paths are summed at x. This is the multivariate chain rule: if the output depends on x through multiple paths, the total derivative is the sum of derivatives along each path. Autograd handles this automatically through gradient accumulation (the += in each _backward).

What about nodes with multiple inputs? Consider c = a * b. During backward:

Each input gets its own gradient, computed independently. If a node has multiple outputs (its result is used in two places), the gradients from both paths are summed — we'll cover this more in Chapter 4.

Backward Pass Animator

Watch gradients propagate backward through x → *3 → +1 → ² → c. Click Step Back to advance one node at a time.

Forward done. Ready for backward pass.
python
# Verify our manual trace
import torch

x = torch.tensor(2.0, requires_grad=True)
a = x * 3
b = a + 1
c = b ** 2
c.backward()

print(f"x.grad = {x.grad}")  # tensor(42.) ✓

# We can also get intermediate gradients with retain_grad()
x = torch.tensor(2.0, requires_grad=True)
a = x * 3
a.retain_grad()
b = a + 1
b.retain_grad()
c = b ** 2
c.backward()
print(f"b.grad = {b.grad}")  # tensor(14.) — matches our trace
print(f"a.grad = {a.grad}")  # tensor(14.) — matches our trace
print(f"x.grad = {x.grad}")  # tensor(42.) — matches our trace

Let's also see how this works for vector/matrix operations, since real neural networks operate on tensors, not scalars. Consider a simple linear layer: y = W @ x + b, where W is 2×3, x is 3×1, b is 2×1.

python
import torch

# Simple linear layer: y = Wx + b
W = torch.randn(2, 3, requires_grad=True)
x = torch.randn(3, 1, requires_grad=True)
b = torch.randn(2, 1, requires_grad=True)

y = W @ x + b    # Shape: (2,1)
loss = y.sum()   # Scalar loss for backward
loss.backward()

print(f"W.grad shape: {W.grad.shape}")  # (2, 3) — same as W
print(f"x.grad shape: {x.grad.shape}")  # (3, 1) — same as x
print(f"b.grad shape: {b.grad.shape}")  # (2, 1) — same as b

# The backward rules for matmul:
# dL/dW = (dL/dy) @ x.T  →  shape (2,1) @ (1,3) = (2,3) ✓
# dL/dx = W.T @ (dL/dy)  →  shape (3,2) @ (2,1) = (3,1) ✓
# dL/db = dL/dy           →  shape (2,1) ✓ (addition passes grad through)
Shape rule: The gradient of a tensor always has the SAME shape as the tensor itself. If W is 2×3, then dL/dW is also 2×3. This must be true because gradient descent does W -= lr * W.grad, which requires matching shapes. The backward formulas for matmul are specifically constructed to produce the right shapes.
In the backward pass, what does each node multiply together?

Chapter 4: Gradient Accumulation

Here's a bug that bites every PyTorch beginner at least once. You run your training loop, and the loss doesn't converge. It oscillates wildly or diverges. You debug for hours. The culprit? You forgot optimizer.zero_grad().

The reason: gradients in PyTorch accumulate by addition. When you call .backward(), the computed gradients are added to whatever is already in the .grad attribute, not replaced. If you don't zero them out between batches, you get the sum of gradients from every batch you've ever processed.

Why does this happen? It's not a bug — it's a feature. Gradient accumulation is useful when you want to simulate a larger batch size than fits in GPU memory. Process 4 mini-batches, accumulate gradients, then step once = equivalent to 1 batch that's 4x larger. But you must explicitly choose when to accumulate vs when to reset.

Let's see the bug in action:

python
import torch

x = torch.tensor(3.0, requires_grad=True)

# First backward: gradient = 2x = 6
y = x ** 2
y.backward()
print(x.grad)  # tensor(6.) ✓

# Second backward WITHOUT zeroing: gradient ADDS
y = x ** 2
y.backward()
print(x.grad)  # tensor(12.) ← 6 + 6, NOT just 6!

# Third backward: keeps piling up
y = x ** 2
y.backward()
print(x.grad)  # tensor(18.) ← 6 + 6 + 6

The fix is simple: zero gradients before each backward pass (or before each optimizer step, depending on whether you want accumulation).

python
# The correct training loop pattern
for batch in dataloader:
    optimizer.zero_grad()          # Reset gradients to 0
    output = model(batch)          # Forward pass
    loss = loss_fn(output, target) # Compute loss
    loss.backward()                # Backward pass (accumulates into .grad)
    optimizer.step()               # Update parameters using .grad

# Gradient accumulation pattern (simulating 4x batch size)
for i, batch in enumerate(dataloader):
    output = model(batch)
    loss = loss_fn(output, target) / 4  # Scale loss
    loss.backward()                      # Accumulate
    if (i + 1) % 4 == 0:
        optimizer.step()                 # Step every 4 batches
        optimizer.zero_grad()            # THEN zero

There's another subtle case: when a tensor is used in multiple operations. If x feeds into both a and b, and both contribute to the loss, then during backward the gradients from both paths are summed at x. This is mathematically correct — by the multivariate chain rule, if L = f(a(x), b(x)), then dL/dx = dL/da · da/dx + dL/db · db/dx.

python
x = torch.tensor(2.0, requires_grad=True)
a = x * 3    # path 1: uses x
b = x ** 2   # path 2: uses x
c = a + b    # both paths merge here
c.backward()
print(x.grad)  # tensor(7.) = 3 (from a) + 4 (from b, 2x=4)
Accumulation Demo

Click Backward multiple times without zeroing. Watch gradients pile up. Then click Zero Grad to reset.

x.grad = 0 (clean slate)
Mental model: Think of .grad as a bucket. Each .backward() pours water in. .zero_grad() empties the bucket. If you never empty it, it overflows. The bucket metaphor also explains gradient accumulation: pour from 4 small cups before measuring = same as 1 big cup.

Here's a real training scenario showing the difference between correct and buggy code:

python
import torch
import torch.nn as nn

model = nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.MSELoss()

# === BUGGY VERSION (gradients explode) ===
for epoch in range(100):
    x = torch.randn(32, 10)
    y = torch.randn(32, 1)
    output = model(x)
    loss = loss_fn(output, y)
    loss.backward()         # Gradients ACCUMULATE each iteration!
    optimizer.step()        # Steps get larger and larger!
    # By iteration 100, gradients are 100x too large

# === CORRECT VERSION ===
for epoch in range(100):
    x = torch.randn(32, 10)
    y = torch.randn(32, 1)
    optimizer.zero_grad()   # ← THIS LINE FIXES EVERYTHING
    output = model(x)
    loss = loss_fn(output, y)
    loss.backward()
    optimizer.step()

The symptom of forgetting zero_grad() is distinctive: your loss might decrease initially (the first few accumulated gradients point in roughly the right direction), but then starts oscillating or exploding. If you ever see wild loss values after a few iterations, check your zero_grad call first.

There's a newer alternative to optimizer.zero_grad(): passing set_to_none=True. Instead of setting gradients to zero tensors, it sets them to None. This is slightly faster (avoids a memset operation) and uses less memory momentarily:

python
# Slightly more efficient version (default since PyTorch 2.0)
optimizer.zero_grad(set_to_none=True)

# What this does internally:
# param.grad = None  (vs param.grad.zero_() with set_to_none=False)
# Next backward will allocate a new grad tensor
# Pro: faster, less memory peak
# Con: code that checks `if param.grad is not None` needs to be careful
You call .backward() 3 times on y = 5x (with x=2) without zeroing gradients. What is x.grad?

Chapter 5: Detach, no_grad, retain_graph

Sometimes you need to stop autograd from tracking operations. Maybe you're doing inference and don't want the memory overhead of a computation graph. Maybe you need to freeze part of a network. Maybe you need to compute a value that shouldn't receive gradients. PyTorch gives you three tools for this.

1. torch.no_grad()

torch.no_grad() is a context manager that disables gradient tracking for everything inside it. Operations still execute normally, but no graph is built. This is your go-to for inference and evaluation.

python
# During inference — no gradients needed, saves memory
with torch.no_grad():
    predictions = model(test_data)
    # No computation graph built — much faster, less memory

# Also useful for manual parameter updates
with torch.no_grad():
    param -= learning_rate * param.grad  # No graph for this update!

2. tensor.detach()

detach() creates a new tensor that shares the same data but is disconnected from the computation graph. It's like cutting a wire — gradients can't flow through the detached tensor. The detached tensor shares the same underlying memory (it's a view, not a copy), so it's memory-efficient.

This is the workhorse of stop-gradient patterns in deep learning. Any time you want to use a value in a computation but don't want gradients to flow through it, you detach:

python
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2          # y is connected to x in the graph
z = y.detach()      # z has same value (9.0) but NO graph connection

w = z * 2           # operations on z don't connect back to x
w.backward()        # ERROR: z doesn't require grad

# Common use: freeze encoder, train decoder
features = encoder(image).detach()  # Stop gradients here
output = decoder(features)          # Only decoder gets gradients

3. retain_graph=True

retain_graph=True prevents PyTorch from destroying the computation graph after .backward(). Normally, the graph is freed to reclaim memory. But if you need multiple backward passes through the same graph (e.g., for multiple losses), you need to keep it alive.

python
x = torch.tensor(2.0, requires_grad=True)
y = x ** 3

# First backward — keep graph alive
y.backward(retain_graph=True)
print(x.grad)  # 12.0 (3x² = 12)

# Second backward through same graph
x.grad.zero_()
y.backward()    # Works because graph was retained
print(x.grad)  # 12.0 again

# Without retain_graph=True, the second .backward() would crash
When do you actually need retain_graph? It's rare. The main cases are: (1) Multiple losses that share a computation path. (2) Reinforcement learning algorithms that need the log-probability graph after computing the value loss. (3) Higher-order gradients (gradients of gradients). In normal supervised training, you never need it.

A useful alternative to .backward() with retain_graph is torch.autograd.grad() — a functional API that computes gradients without storing them in .grad attributes and without consuming the graph by default:

python
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2 + 2 * x

# Functional gradient — doesn't modify x.grad, doesn't consume graph
grad_y = torch.autograd.grad(y, x, retain_graph=True)[0]
print(grad_y)   # tensor(8.) — dy/dx = 2x + 2 = 8
print(x.grad)   # None — .grad not touched!

# Can call again because retain_graph=True
grad_y2 = torch.autograd.grad(y, x)[0]  # Now graph is consumed
print(grad_y2)  # tensor(8.)

Here's a concrete example where retain_graph is necessary — a GAN-style setup with two losses sharing a generator:

python
# Two different losses that share computation
x = torch.tensor(2.0, requires_grad=True)
shared = x ** 2  # Shared computation

loss1 = shared * 3   # First loss branch
loss2 = shared + 5   # Second loss branch

# Need retain_graph for the first backward
loss1.backward(retain_graph=True)  # Graph kept alive
print(x.grad)  # 12.0 (d(3x²)/dx = 6x = 12)

# Second backward through the SAME shared computation
loss2.backward()  # Graph freed after this
print(x.grad)  # 16.0 (12 + 4, accumulated!)
ToolWhat it doesWhen to use
torch.no_grad()Disables tracking for all ops in blockInference, manual param updates
.detach()Cuts one tensor from the graphFreezing part of a network, stop-gradient
retain_graph=TrueKeeps graph alive after backwardMultiple backward passes, higher-order grads
Graph Control Visualizer

The graph shows x → a → b → c. Click buttons to see how each tool affects gradient flow.

Normal mode: gradients flow through all nodes

Here's a summary of when each tool is the right choice in common deep learning patterns:

python
# Pattern 1: Inference / evaluation
with torch.no_grad():
    preds = model(test_batch)

# Pattern 2: Freeze part of model (e.g., fine-tuning only the head)
features = backbone(images).detach()
logits = classification_head(features)  # Only head gets gradients

# Pattern 3: Target networks (DQN, MoCo, EMA)
with torch.no_grad():
    target_q = target_network(next_states)  # No grad for target
current_q = policy_network(states)         # Grad flows here
loss = F.mse_loss(current_q, target_q.detach())  # Extra safety

# Pattern 4: Gradient penalty (WGAN-GP) — needs retain + create_graph
interp = torch.lerp(real, fake, alpha)
interp.requires_grad_(True)
d_interp = discriminator(interp)
grads = torch.autograd.grad(d_interp.sum(), interp,
                            create_graph=True)[0]
gp = ((grads.norm(2, dim=1) - 1) ** 2).mean()  # Differentiable!
You want to use a pretrained encoder's output as input to a new decoder, but don't want to update the encoder. What do you use?

Chapter 6: Custom Autograd Functions

What if you invent a new operation that PyTorch doesn't have a built-in gradient for? Or what if you want to override the gradient of an existing operation (like the straight-through estimator for quantization)? You write a custom torch.autograd.Function.

A custom function has two static methods: forward() computes the output, and backward() computes the gradients. You're essentially telling PyTorch: "Here's how to compute this operation, and here's the derivative."

Think of it this way: Every built-in PyTorch operation (add, multiply, relu, etc.) is implemented exactly like a custom autograd function internally. When you write one yourself, you're doing the same thing the PyTorch developers did — providing both the forward computation and its derivative.

Let's implement ReLU from scratch:

python
import torch

class MyReLU(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        # ctx = context object for saving things needed in backward
        ctx.save_for_backward(x)
        return x.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        # grad_output = the incoming gradient from above
        x, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[x < 0] = 0  # gradient is 0 where x < 0
        return grad_input

# Use it
x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0], requires_grad=True)
y = MyReLU.apply(x)
y.sum().backward()
print(x.grad)  # tensor([0., 0., 0., 1., 1.]) — gradient is 1 where x>0, else 0

The ctx object is the bridge between forward and backward. In forward, you save any tensors you'll need for gradient computation using ctx.save_for_backward(). In backward, you retrieve them with ctx.saved_tensors. This is critical for memory: if you save too many tensors, you waste GPU memory. If you save too few, you can't compute the gradient. The art of custom functions is saving the minimum necessary.

Important: you can only save tensors with save_for_backward. For non-tensor values (like integers, booleans, or shapes), store them directly as attributes on ctx: ctx.my_value = 42.

Now let's build something more interesting: a straight-through estimator. This is used in quantization — the forward pass rounds values to integers (non-differentiable!), but the backward pass pretends rounding didn't happen and passes gradients straight through.

python
class StraightThroughRound(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        return torch.round(x)  # Round to nearest integer

    @staticmethod
    def backward(ctx, grad_output):
        return grad_output  # Pass gradient through unchanged!

# Forward: 2.7 → 3.0 (rounded)
# Backward: gradient passes through as if rounding never happened
x = torch.tensor(2.7, requires_grad=True)
y = StraightThroughRound.apply(x)
y.backward()
print(x.grad)  # tensor(1.) — gradient flows through!
Why does this work? Mathematically, d(round(x))/dx = 0 almost everywhere (and undefined at 0.5, 1.5, etc.). But that makes training impossible — no gradient signal. The straight-through estimator is a lie we tell autograd: "pretend the derivative is 1." It works surprisingly well in practice and is the foundation of quantization-aware training.

The general pattern for custom functions:

forward(ctx, *inputs)
Compute output. Save what backward needs in ctx.
backward(ctx, *grad_outputs)
Return one gradient per input. grad = local_jac × grad_output.

The backward method must return exactly as many gradients as there were inputs to forward. If an input doesn't need a gradient (like an integer parameter), return None for it.

Let's implement one more custom function to really solidify the pattern: a clamp operation that clips values to a range [min, max], with correct gradients (gradient is 0 outside the range, 1 inside):

python
class MyClamp(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, min_val, max_val):
        ctx.save_for_backward(x)
        ctx.min_val = min_val
        ctx.max_val = max_val
        return x.clamp(min=min_val, max=max_val)

    @staticmethod
    def backward(ctx, grad_output):
        x, = ctx.saved_tensors
        # Gradient passes through where x is in range, zero otherwise
        mask = (x >= ctx.min_val) & (x <= ctx.max_val)
        return grad_output * mask.float(), None, None
        # None, None for min_val and max_val (no grad needed)

# Test
x = torch.tensor([-2.0, 0.5, 3.0], requires_grad=True)
y = MyClamp.apply(x, 0.0, 1.0)
print(y)  # tensor([0.0, 0.5, 1.0])
y.sum().backward()
print(x.grad)  # tensor([0., 1., 0.]) — only middle value gets gradient
Common mistake with custom functions: Forgetting to return None for non-differentiable inputs. If your forward takes 3 inputs but only the first needs a gradient, backward must return (grad_for_first, None, None). Returning the wrong number of values causes a cryptic error about "expected 3 gradients but got 1."
Custom Function Behavior

Compare standard ReLU vs custom straight-through estimator. Drag x to see how gradients differ for negative inputs.

x 0.5
In a custom autograd Function, what does ctx.save_for_backward(x) do?

Chapter 7: Autograd Engine from Scratch

Now for the payoff. We're going to build a working autograd engine from scratch in ~50 lines of Python. Not a toy — a real engine that can compute gradients for arbitrary expressions. This is the same approach Andrej Karpathy's micrograd uses, stripped to its essence.

Our engine has one class: Value. Each Value wraps a number and remembers (1) what operation produced it, (2) what its inputs were, and (3) a _backward function that computes local gradients.

python
class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += out.grad   # d(a+b)/da = 1
            other.grad += out.grad  # d(a+b)/db = 1
        out._backward = _backward
        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad  # d(a*b)/da = b
            other.grad += self.data * out.grad  # d(a*b)/db = a
        out._backward = _backward
        return out

    def __pow__(self, n):
        out = Value(self.data ** n, (self,), f'**{n}')
        def _backward():
            self.grad += n * (self.data ** (n-1)) * out.grad
        out._backward = _backward
        return out

    def relu(self):
        out = Value(0 if self.data < 0 else self.data, (self,), 'ReLU')
        def _backward():
            self.grad += (out.data > 0) * out.grad
        out._backward = _backward
        return out

    def backward(self):
        # Topological sort
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        # Seed gradient and walk backwards
        self.grad = 1.0
        for v in reversed(topo):
            v._backward()

    def __radd__(self, other): return self + other
    def __rmul__(self, other): return self * other
    def __neg__(self): return self * -1
    def __sub__(self, other): return self + (-other)

That's the entire engine. Let's use it:

python
# Test: y = (3x + 2)² at x = 1
x = Value(1.0)
y = (3 * x + 2) ** 2
y.backward()
print(f"y = {y.data}")     # 25.0
print(f"dy/dx = {x.grad}") # 30.0 ✓ (matches Chapter 2!)

# A tiny neural network: 2 inputs, 1 hidden neuron, 1 output
x1 = Value(2.0)
x2 = Value(3.0)
w1 = Value(0.5)
w2 = Value(-0.3)
b = Value(0.1)

# Forward: neuron = relu(w1*x1 + w2*x2 + b)
neuron = (w1 * x1 + w2 * x2 + b).relu()
print(f"neuron = {neuron.data}")  # relu(1.0 - 0.9 + 0.1) = relu(0.2) = 0.2
neuron.backward()
print(f"dout/dw1 = {w1.grad}")  # 2.0 (= x1, because relu is active)
print(f"dout/dw2 = {w2.grad}")  # 3.0 (= x2)
Look how simple it is! The entire engine is just: (1) Each operation stores a closure that knows its local gradient. (2) backward() does a topological sort and calls each closure in reverse order. (3) Gradients accumulate with +=. That's it. That's PyTorch autograd, minus the GPU kernels and C++ performance optimizations.

Let's also use our engine for a slightly larger example — a 2-layer neural network with one hidden unit:

python
# 2-input, 1-hidden, 1-output network
# Forward: h = relu(w1*x1 + w2*x2 + b1), out = w3*h + b2
x1 = Value(1.5)
x2 = Value(-2.0)
w1 = Value(0.8)
w2 = Value(0.6)
b1 = Value(0.1)
w3 = Value(1.2)
b2 = Value(-0.5)

# Hidden layer
h_pre = w1 * x1 + w2 * x2 + b1  # 0.8*1.5 + 0.6*(-2) + 0.1 = 0.1
h = h_pre.relu()                  # relu(0.1) = 0.1

# Output layer
out = w3 * h + b2                 # 1.2*0.1 + (-0.5) = -0.38

# Suppose target is 1.0, loss = (out - target)²
target = Value(1.0)
loss = (out + (-1) * target) ** 2   # (−0.38 - 1)² = 1.9044

loss.backward()
print(f"dL/dw1 = {w1.grad:.4f}")  # How much does w1 affect the loss?
print(f"dL/dw2 = {w2.grad:.4f}")  # How much does w2 affect the loss?
print(f"dL/dw3 = {w3.grad:.4f}")  # How much does w3 affect the loss?

# These gradients tell the optimizer which direction to nudge each weight
# to reduce the loss. Gradient descent: w -= lr * w.grad

This is a complete training iteration! Forward pass (compute loss), backward pass (compute gradients), update (nudge weights). Every modern deep learning framework does exactly this loop billions of times.

The interactive simulation below implements this exact engine in JavaScript. Adjust the input value, watch the computation graph form, then click "Backward" to see gradients flow through every node.

Live Autograd Engine

Adjust x to change the input. The graph computes y = (3x+2)². Click Backward to watch gradient propagation with numerical values at each node.

x 1.0
Adjust x, then click Backward

Key observations from the engine:

Let's trace through the topological sort for our example y = (3x+2)² to make it concrete:

python
# Graph structure: x → (x*3) → (a+2) → (b**2) = y
# Topological order (leaves first): [x, 3, a=x*3, 2, b=a+2, y=b²]
# Reversed (for backward):          [y=b², b=a+2, a=x*3, 2, 3, x]
#
# Backward execution:
# 1. y.grad = 1.0  (seed)
# 2. y._backward(): b.grad += 2*b.data * y.grad = 2*5*1 = 10
# 3. b._backward(): a.grad += b.grad = 10, two.grad += b.grad = 10
# 4. a._backward(): x.grad += 3 * a.grad = 30, three.grad += 1 * a.grad = 10
# Result: x.grad = 30 ✓

The beauty of this design is its composability. Each operation only defines its local gradient rule. The engine handles the rest: ordering, accumulation, traversal. Want to add a new operation? Just define _backward for it. Everything else stays the same.

You just built PyTorch autograd. The real PyTorch engine is the same algorithm, implemented in C++ with GPU support, memory optimizations, and thread safety. But the core logic — closure-based local gradients, topological sort, backward traversal — is exactly what you see above. There's no hidden magic.

Let's also try different expressions in our engine to build confidence. Each one you should be able to verify by hand:

python
# Expression 1: y = x³ at x=2
# dy/dx = 3x² = 12
x = Value(2.0)
y = x * x * x
y.backward()
print(f"x³ at x=2: grad = {x.grad}")  # 12.0 ✓

# Expression 2: y = (x + 1) * (x - 1) = x² - 1 at x=3
# dy/dx = 2x = 6
x = Value(3.0)
y = (x + 1) * (x + (-1))
y.backward()
print(f"(x+1)(x-1) at x=3: grad = {x.grad}")  # 6.0 ✓

# Expression 3: y = relu(x - 2) at x=1 (inactive)
# dy/dx = 0 (relu is off)
x = Value(1.0)
y = (x + (-2)).relu()
y.backward()
print(f"relu(x-2) at x=1: grad = {x.grad}")  # 0.0 ✓

# Expression 4: y = relu(x - 2) at x=5 (active)
# dy/dx = 1 (relu is on, pass-through)
x = Value(5.0)
y = (x + (-2)).relu()
y.backward()
print(f"relu(x-2) at x=5: grad = {x.grad}")  # 1.0 ✓

Each of these can be verified by hand in seconds. The engine handles them all with the same generic mechanism. This is the power of automatic differentiation: one algorithm for any computation.

In our micrograd engine, why does __mul__'s _backward use `self.grad += other.data * out.grad`?

Chapter 8: Gotchas & Performance

Autograd is elegant, but it has sharp edges. Ignore these gotchas and you'll spend hours debugging silent failures or mysterious memory leaks. Let's catalog the most common ones.

Gotcha 1: In-place operations break autograd

An in-place operation modifies a tensor directly (like x.add_(1) or x[0] = 5). These can corrupt the computation graph because backward relies on the original values being intact.

python
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x * 2
y.add_(1)  # In-place! Modifies y directly
# y.backward() may crash or give wrong gradients
# RuntimeError: one of the variables needed for gradient computation
# has been modified by an inplace operation

# FIX: use out-of-place operations
y = y + 1  # Creates new tensor, graph stays valid
Rule of thumb: Never use in-place operations (anything ending in _) on tensors that participate in gradient computation. PyTorch will catch many cases with a runtime error, but not all. When in doubt, use out-of-place.

Gotcha 2: .item() and .numpy() detach

Calling .item() on a tensor extracts a plain Python number — completely disconnected from the graph. Similarly, .numpy() converts to a NumPy array (no graph). If you use these in a computation that needs gradients, you'll silently break the gradient chain.

python
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2

# BAD: .item() extracts a plain float — no gradient!
loss_value = y.item()  # Just the number 9.0, no graph

# GOOD: use the tensor directly for computations that need gradients
# Only use .item() for logging/printing
print(f"Loss: {y.item():.4f}")  # Fine for display

Gotcha 3: Double backward needs create_graph=True

If you want gradients of gradients (second-order derivatives, used in meta-learning or some regularization), the first backward pass must itself be differentiable. Pass create_graph=True to make the backward pass build its own computation graph.

python
x = torch.tensor(2.0, requires_grad=True)
y = x ** 3  # dy/dx = 3x², d²y/dx² = 6x

# First derivative with create_graph so we can differentiate again
grad = torch.autograd.grad(y, x, create_graph=True)[0]
print(grad)  # tensor(12., grad_fn=MulBackward0) — it's a graph node!

# Second derivative
grad2 = torch.autograd.grad(grad, x)[0]
print(grad2)  # tensor(12.) — 6x = 6*2 = 12 ✓

Gotcha 4: Memory — graphs hold ALL intermediate tensors

The computation graph keeps references to every intermediate tensor (needed for backward). For a 50-layer network processing a large batch, this means 50 layers worth of activations sitting in GPU memory. This is often the main memory bottleneck in training.

The solution: gradient checkpointing (also called activation checkpointing). Instead of storing all intermediates, only store some "checkpoints." During backward, recompute the missing intermediates from the nearest checkpoint. Trades compute for memory.

To understand the memory problem concretely: a GPT-3-sized model (175B params) with batch size 1 and sequence length 2048 stores approximately 200GB of activations during a forward pass. That's just the intermediate tensors the graph keeps alive for backward — far more than the 15GB for the parameters themselves. Gradient checkpointing can reduce this 200GB to ~40GB at the cost of recomputing roughly 33% of the forward pass.

python
from torch.utils.checkpoint import checkpoint

# Without checkpointing: stores ALL layer outputs in memory
def forward_normal(x, layers):
    for layer in layers:
        x = layer(x)  # Each output stored for backward
    return x

# With checkpointing: recomputes during backward, saves ~50% memory
def forward_checkpoint(x, layers):
    for layer in layers:
        x = checkpoint(layer, x, use_reentrant=False)
    return x
The tradeoff: Gradient checkpointing uses ~30% less memory at the cost of ~25% more compute (one extra forward pass through checkpointed segments). For training very large models on limited GPUs, this tradeoff is worth it. Most transformer training uses it.

Gotcha 6: Non-contiguous tensors and views

PyTorch tensors can be views of other tensors (sharing the same memory). Operations like .T, .reshape(), and slicing create views. The autograd graph tracks through views correctly, but in-place modifications to a view can corrupt the graph of the original tensor:

python
x = torch.tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
y = x[0]       # y is a VIEW of x's first row
z = y * 2       # Operation on the view
z.sum().backward()

print(x.grad)   # tensor([[2., 2.], [0., 0.]]) — gradient only in first row

# DANGER: in-place on a view affects the parent
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x * 2
z = y[0:2]     # View of y
z.mul_(3)      # In-place on view → corrupts y's grad_fn!
# y.backward() will now give wrong results or crash

Gotcha 7: Autograd and mixed precision

When training in fp16/bf16, small gradients can underflow to zero. PyTorch's GradScaler multiplies the loss by a large number before backward (so gradients stay in representable range), then divides the gradients back down before the optimizer step:

python
from torch.amp import GradScaler, autocast

scaler = GradScaler()
for batch in dataloader:
    optimizer.zero_grad()
    with autocast(device_type='cuda'):
        output = model(batch)
        loss = loss_fn(output)
    scaler.scale(loss).backward()  # Scaled loss → scaled gradients
    scaler.step(optimizer)          # Unscales then steps
    scaler.update()                 # Adjusts scale factor
Memory vs Compute Tradeoff

Compare memory usage with and without gradient checkpointing for networks of varying depth.

Layers 20

Gotcha 5: Leaf tensor reassignment

If you reassign a leaf tensor with an operation, it stops being a leaf:

python
x = torch.tensor(3.0, requires_grad=True)
print(x.is_leaf)  # True

x = x * 2  # x is now a NEW tensor (result of multiplication)
print(x.is_leaf)  # False! .grad won't be stored here

# This is why optimizer.step() uses torch.no_grad():
# It needs to modify parameters WITHOUT creating graph nodes
with torch.no_grad():
    param -= lr * param.grad  # In-place update, stays as leaf
GotchaSymptomFix
In-place opsRuntimeError or wrong gradientsUse out-of-place: y = y + 1, not y.add_(1)
.item() / .numpy()Gradient is None or missingOnly use for logging, not computation
No create_graphCan't compute second derivativesPass create_graph=True to first backward
Memory blowupOOM on large modelsGradient checkpointing
Leaf vs non-leaf.grad is None for intermediatesUse .retain_grad() or torch.autograd.grad()
Leaf reassignmentParameter no longer receives .gradUpdate in-place under torch.no_grad()
Debugging tip: When gradients are unexpectedly None or zero, use this checklist: (1) Does the tensor have requires_grad=True? (2) Is it a leaf tensor? (3) Is there a connected path from the loss to this tensor? (4) Did you accidentally detach or use .item() somewhere? (5) Did an in-place op corrupt the graph? PyTorch's torch.autograd.set_detect_anomaly(True) can help catch issues at the point they occur.
Why do in-place operations break autograd?

Chapter 9: Mastery & Connections

You now understand autograd from the inside out. You know how computation graphs are built dynamically, how the backward pass walks them in reverse topological order, how gradients accumulate, and how to control the process. Let's consolidate everything.

Complete Autograd Cheat Sheet

Function / ConceptWhat it doesWhen to use
requires_grad=TrueEnables gradient tracking for a tensorModel parameters, inputs you want to differentiate
.backward()Computes gradients for all leaf tensorsAfter computing loss, before optimizer.step()
.gradAccumulated gradient (tensor attribute)Access after backward(), zero before next backward
.zero_grad()Resets .grad to zeroBefore each backward pass (unless accumulating)
.detach()Creates graph-disconnected copyFreezing encoders, stop-gradient, targets
torch.no_grad()Context: disables all trackingInference, evaluation, manual param updates
retain_graph=TrueKeeps graph after backwardMultiple backward passes, RL policy + value
create_graph=TrueMakes backward differentiableHigher-order gradients, meta-learning (MAML)
.retain_grad()Saves grad for non-leaf tensorDebugging intermediate gradients
torch.autograd.grad()Functional gradient APIWhen you want grads without .backward()
torch.autograd.FunctionCustom forward + backwardNovel ops, straight-through, custom gradients
checkpoint()Recompute instead of store activationsMemory-limited training of deep models
.grad_fnPointer to the creating operationDebugging, inspecting the graph structure

Derivation Challenge: Backward for Softmax

As a final test of your understanding, let's derive the backward pass for softmax. Given input z (vector of length n), softmax produces:

si = exp(zi) / ∑j exp(zj)

The Jacobian is: ∂si/∂zj = siij - sj), where δij is 1 if i=j, else 0.

In matrix form: J = diag(s) - s sT.

Given incoming gradient g (from the loss above), the output gradient is:

∂L/∂z = JT g = s ⊙ g - s · (s · g)

Where ⊙ is elementwise multiplication and (s · g) is the dot product (a scalar). This is why softmax backward is O(n), not O(n²) — you never need to materialize the full Jacobian.

Let's derive this step by step. Start with the Jacobian entry ∂si/∂zj:

Combined: ∂si/∂zj = siij - sj). Now multiply by the incoming gradient vector g:

(∂L/∂z)j = ∑i gi · siij - sj) = gjsj - sjigisi = sj(gj - ⟨s,g⟩)

Which gives us the vector form: ∂L/∂z = s ⊙ (g - ⟨s,g⟩). This is equivalent to our earlier formulation and requires only O(n) computation: one dot product and one elementwise multiply.

python
class MySoftmax(torch.autograd.Function):
    @staticmethod
    def forward(ctx, z):
        s = torch.exp(z - z.max())  # subtract max for numerical stability
        s = s / s.sum()
        ctx.save_for_backward(s)
        return s

    @staticmethod
    def backward(ctx, grad_output):
        s, = ctx.saved_tensors
        # J^T @ g = s * g - s * (s @ g)
        dot = (s * grad_output).sum()
        return s * grad_output - s * dot

# Verify against PyTorch's built-in
z = torch.randn(5, requires_grad=True)
s1 = torch.softmax(z, dim=0)
s2 = MySoftmax.apply(z.detach().requires_grad_(True))
# Both give same gradients ✓

Implementation Challenge: Backward for LayerNorm

For an extra challenge, think about implementing backward for layer normalization. The forward is:

yi = (xi - μ) / σ · γ + β, where μ = mean(x), σ = std(x)

The tricky part: μ and σ both depend on ALL elements of x. So ∂yi/∂xj is non-zero even when i ≠ j (because changing xj changes the mean and std that affect yi). This cross-dependency makes the Jacobian dense, but like softmax, it can be applied in O(n) time without materializing the full matrix.

python
# Sketch of LayerNorm backward (the key insight)
# Given: x_hat = (x - mean) / std
# dy/dx_i = gamma/std * (grad_i - mean(grad) - x_hat_i * mean(grad * x_hat))
# All O(n) operations: no n×n Jacobian needed!

Where Autograd Connects

Autograd is the foundation that everything else in deep learning training is built on:

If you want to go deeper into PyTorch internals, here are the key source files:

That last file is remarkable — every backward formula PyTorch knows is declared in a single YAML file. Code generation turns these declarations into C++ implementations. Want to know how PyTorch computes the gradient for any operation? Look it up in derivatives.yaml.

Historical Note: From Backpropagation to Autograd

Backpropagation (reverse-mode autodiff for neural networks) was popularized by Rumelhart, Hinton, and Williams in 1986, though the mathematics was discovered independently multiple times before that. The key insight was applying the chain rule to networks with many layers — showing that you could efficiently compute gradients for ALL weights in a single backward sweep.

For decades, researchers hand-implemented backward passes for each new architecture. Theano (2010) was the first framework to automate this with a static graph compiler. PyTorch (2017) introduced dynamic graphs — recording operations at runtime — which made debugging natural (use print, use pdb, use if-statements) and research faster. Today's PyTorch autograd engine handles thousands of operations with hand-tuned CUDA kernels, vmap support for batched gradients, and torch.compile for fusion.

The evolution continues: torch.func (functorch) brings functional transformations like vmap (vectorized map), grad (functional gradient), and jacrev/jacfwd (full Jacobian computation) as composable transforms. These build on autograd but add new capabilities like per-sample gradients without a for-loop.

python
from torch.func import grad, vmap, jacrev

# Functional gradient (no .backward needed)
def f(x): return (x ** 3).sum()
gradient_fn = grad(f)
print(gradient_fn(torch.tensor([1.0, 2.0, 3.0])))  # [3, 12, 27]

# Per-sample gradients (useful for differential privacy)
def loss_fn(params, x, y):
    return ((params @ x - y) ** 2).sum()
per_sample_grad = vmap(grad(loss_fn), in_dims=(None, 0, 0))

# Full Jacobian (for small functions)
def g(x): return torch.stack([x[0]**2 + x[1], x[0]*x[1]])
J = jacrev(g)(torch.tensor([2.0, 3.0]))
print(J)  # [[4, 1], [3, 2]] — the 2×2 Jacobian matrix
You've built autograd from scratch. You understand computation graphs, the chain rule applied as backward traversal, gradient accumulation, graph control (detach/no_grad/retain), custom functions, and the gotchas. You can now read PyTorch's C++ autograd engine source code and understand what it does — it's the same algorithm, just faster.

Quick Reference: Common Backward Formulas

OperationForward: y = Backward: ∂L/∂x =
Adda + b∂L/∂a = g, ∂L/∂b = g
Multiplya · b∂L/∂a = b·g, ∂L/∂b = a·g
Powerxnn·xn-1·g
Expexex·g = y·g
Logln(x)g / x
ReLUmax(0,x)g · (x > 0)
Sigmoidσ(x)g · σ(x)(1-σ(x)) = g·y(1-y)
Tanhtanh(x)g · (1 - tanh²(x)) = g·(1-y²)
MatMulA @ B∂L/∂A = g@BT, ∂L/∂B = AT@g
Sum∑xg · ones (broadcast)
Meanmean(x)g / n (broadcast)

In this table, g always means the incoming gradient (∂L/∂y). Notice that exp and sigmoid/tanh can reuse the forward output y in their backward — no need to save the input! This is a common optimization: save the output instead of the input when the backward formula allows it.

The Complete Training Loop, Annotated

Here's every piece of a training loop annotated with what autograd is doing at each step:

python
import torch
import torch.nn as nn

# Model parameters are leaf tensors with requires_grad=True
model = nn.TransformerEncoderLayer(d_model=512, nhead=8)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
loss_fn = nn.CrossEntropyLoss()

for batch_idx, (tokens, labels) in enumerate(dataloader):

    # 1. ZERO GRADIENTS
    # Autograd: clear .grad on all parameters (set to None or zero)
    optimizer.zero_grad()

    # 2. FORWARD PASS
    # Autograd: builds computation graph, node by node
    # Each nn.Linear, LayerNorm, attention, etc. creates grad_fn nodes
    # All intermediate activations are stored (needed for backward)
    logits = model(tokens)  # Graph: tokens → ... → logits

    # 3. COMPUTE LOSS
    # Autograd: adds a few more nodes (softmax, nll, reduction)
    loss = loss_fn(logits, labels)  # Graph: ... → logits → loss (scalar)

    # 4. BACKWARD PASS
    # Autograd: topological sort, then walk backward
    # Each node computes local_grad × incoming_grad
    # Results accumulate into param.grad for each leaf parameter
    # After backward, graph is FREED (intermediates released)
    loss.backward()

    # 5. OPTIMIZER STEP
    # NOT autograd! Optimizer reads .grad, applies update rule
    # Adam: m = β1*m + (1-β1)*grad, v = β2*v + (1-β2)*grad²
    # param -= lr * m_hat / (√v_hat + ε)
    # Happens inside torch.no_grad() to avoid building a graph for updates
    optimizer.step()

    # 6. LOGGING (optional)
    # .item() extracts scalar for display — disconnected from graph
    if batch_idx % 100 == 0:
        print(f"Step {batch_idx}: loss = {loss.item():.4f}")

Every modern training framework (HuggingFace Trainer, PyTorch Lightning, DeepSpeed) is built on exactly this loop. They add distribution, mixed precision, logging, and checkpointing — but the core autograd flow (zero → forward → loss → backward → step) is always the same.

Debugging Autograd Issues

When things go wrong, PyTorch provides tools to diagnose autograd problems:

python
# 1. Anomaly detection: pinpoints where gradients become NaN/Inf
torch.autograd.set_detect_anomaly(True)
# Now if backward produces NaN, you get a stack trace to the FORWARD op

# 2. Gradient checking: numerically verify your custom backward
from torch.autograd import gradcheck
x = torch.randn(3, dtype=torch.double, requires_grad=True)
result = gradcheck(MyCustomFunction.apply, (x,), eps=1e-6)
# Returns True if analytical gradient matches numerical approximation

# 3. Profiling: see which backward ops take the most time
with torch.profiler.profile(with_stack=True) as prof:
    loss.backward()
print(prof.key_averages().table(sort_by="cpu_time_total"))

# 4. Graph visualization: export graph to graphviz DOT format
from torchviz import make_dot
dot = make_dot(loss, params={"W": W, "b": b})
dot.render("computation_graph")  # Saves PDF

The gradcheck function is essential when writing custom autograd functions. It computes numerical derivatives (using finite differences) and compares them against your analytical backward. If they disagree by more than a tolerance, your backward has a bug. Always test with double precision (float64) since float32's numerical errors can cause false failures.

One more critical debugging technique: gradient hooks. You can register a function that gets called every time a tensor's gradient is computed, letting you inspect, modify, or log gradients as they flow:

python
# Register a hook to inspect gradients during backward
def print_grad(name):
    def hook(grad):
        print(f"{name}: grad shape={grad.shape}, norm={grad.norm():.4f}")
        return grad  # Return modified grad, or None to keep original
    return hook

for name, param in model.named_parameters():
    param.register_hook(print_grad(name))

# Now during backward, you'll see every parameter's gradient info
loss.backward()  # Prints gradient stats for every parameter

How PyTorch autograd differs from our micrograd

Our engine works on scalars. PyTorch operates on tensors — multi-dimensional arrays. This changes the backward functions: instead of simple scalar derivatives, each operation computes a Jacobian-vector product (JVP). For example, matrix multiply C = A @ B has backward:

dL/dA = dL/dC @ BT  dL/dB = AT @ dL/dC

This is why PyTorch backward functions receive grad_output (the "vector" in JVP) and return the result of multiplying it by the local Jacobian. The Jacobian itself is never materialized — only its product with the incoming gradient is computed.

Our microgradPyTorch autograd
Scalar valuesTensor values (any shape)
Python closuresC++ Function objects
Single-threadedMulti-threaded (parallel backward)
No memory managementFrees intermediates after use
No GPU supportCUDA kernels for every operation
Scalar derivativesJacobian-vector products

Suggested Next Steps

Now that you understand autograd from the inside, here are productive directions to deepen your knowledge:

The progression from here: understanding autograd unlocks understanding of training dynamics (why learning rates matter, why batch norm helps, why residual connections prevent vanishing gradients) and optimization (why Adam works better than SGD for transformers, why warmup helps, why gradient clipping prevents explosions). All of these build directly on the gradient computation you now understand.

"What I cannot create, I do not understand." — Richard Feynman

One Last Insight: Why .backward() Only Works on Scalars

You might have noticed that .backward() is called on a scalar (the loss). Why? Because the gradient of a scalar with respect to a vector is a vector (same shape as the parameters). If you tried to backward a vector output, you'd get a matrix (the Jacobian) for each parameter — that's N×M values instead of N, which explodes memory.

When you DO need to backward a non-scalar, you must provide a gradient argument that specifies which linear combination of the output dimensions you want the gradient for:

python
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2  # Vector output: [1, 4, 9]

# Can't just do y.backward() — need to specify WHICH gradient
# This computes d(y[0]*1 + y[1]*1 + y[2]*1)/dx = d(sum(y))/dx
y.backward(torch.ones_like(y))
print(x.grad)  # tensor([2., 4., 6.]) — same as (y.sum()).backward()

# Or weight the outputs differently:
x.grad.zero_()
y = x ** 2
y.backward(torch.tensor([1.0, 0.0, 0.0]))  # Only care about y[0]
print(x.grad)  # tensor([2., 0., 0.]) — gradient of just y[0]
In the softmax backward formula (s ⊙ g - s · dot(s,g)), why don't we need to form the full n×n Jacobian?