How automatic differentiation builds computation graphs and computes all your gradients — without you lifting a pen.
You have a simple function: y = (3x + 2)². Compute dy/dx. Easy — apply the chain rule, get 6(3x + 2). Takes ten seconds with a pencil.
Now imagine you have a neural network. 10 million parameters. 50 layers of matrix multiplications, nonlinearities, normalizations, attention mechanisms. You need the gradient of the loss with respect to every single one of those 10 million parameters. By hand? You'd need a lifetime.
This is the problem automatic differentiation solves. Not symbolic differentiation (which produces enormous expressions). Not numerical differentiation (which is slow and imprecise). Automatic differentiation computes exact gradients by recording every operation you perform and then replaying them backwards, applying the chain rule at each step.
PyTorch's autograd engine is what makes deep learning practical. Without it, training would require manually deriving and implementing gradient formulas for every architecture. With it, you invent any crazy forward computation and PyTorch figures out all the gradients.
Consider what this means in practice. A researcher invents a new attention mechanism with 15 novel mathematical operations. In the old days (pre-autograd), they'd spend weeks deriving gradients by hand, implementing them in C, debugging off-by-one errors in indices. With autograd, they write the forward pass in 20 lines of Python and get perfect gradients instantly. This is why deep learning research accelerated so dramatically after frameworks like PyTorch appeared.
In this lesson, we'll build autograd from scratch. By the end, you'll understand exactly what happens when you call .backward() — no magic, just the chain rule applied systematically by a clever piece of software.
python # The simplest possible autograd demo import torch x = torch.tensor(1.0, requires_grad=True) # "Track this!" y = (3 * x + 2) ** 2 # Forward: compute y y.backward() # Backward: compute dy/dx print(x.grad) # tensor(30.) — the gradient! # Without autograd, you'd need to derive this by hand: # y = (3x+2)², dy/dx = 2(3x+2)·3 = 6(3x+2) # At x=1: 6(5) = 30 ✓
Watch operations flow forward (left to right) building the computation graph, then gradients flow backward (right to left). Click Step to advance.
The animation above shows the computation y = (3x + 2)² with x = 1. In the forward pass, values flow left to right: x=1 → 3*1=3 → 3+2=5 → 5²=25. In the backward pass, gradients flow right to left: dy/dy=1 → dy/du=2u=10 → dy/d(3x+2)=10 → dy/dx=10*3=30.
Let's compare three approaches to computing gradients:
| Method | How it works | Cost for N params | Exact? |
|---|---|---|---|
| Symbolic | Algebraically simplify derivative expression | Expression grows exponentially | Yes |
| Numerical | Compute (f(x+h)-f(x))/h for each param | N+1 forward passes | No (rounding errors) |
| Automatic (reverse) | Record ops, replay chain rule backward | 1 forward + 1 backward | Yes |
Automatic differentiation is the clear winner for neural networks: exact gradients, constant cost regardless of parameter count. A network with 175 billion parameters (GPT-3) gets all 175 billion gradients in a single backward pass — the same cost as a network with 100 parameters.
That's autograd in a nutshell. Let's learn how it actually works.
When you write PyTorch code like y = x * 3 + 2, something invisible happens behind the scenes. PyTorch doesn't just compute the result — it builds a computation graph. This graph records what operations were performed and in what order, creating a complete audit trail from inputs to output.
A computation graph is a directed acyclic graph (DAG). Each node represents either a tensor (data) or an operation (function). Edges connect inputs to operations to outputs. "Directed" means edges have a direction (input → output). "Acyclic" means no loops — you can't feed an output back to its own input (that would create a time paradox for gradients).
Why "acyclic"? Because autograd needs to traverse the graph in a definite order. If there were a cycle (A depends on B, B depends on A), there's no valid starting point for the backward pass. Recurrent neural networks appear to have cycles, but they're actually unrolled into a DAG — each timestep is a separate node.
The key flag is requires_grad=True. When you create a tensor with this flag, PyTorch knows to track every operation involving that tensor. Without it, operations proceed normally but no graph is built — saving memory and compute when you don't need gradients (like during inference).
The memory overhead of tracking is significant. Every tracked operation allocates a grad_fn node, stores pointers to input tensors, and may save intermediate values for backward. For a typical transformer layer with ~20 operations, that's ~20 additional objects per forward pass. Multiply by batch size and sequence length and you see why inference always uses torch.no_grad().
python import torch # This tensor is tracked — autograd builds a graph x = torch.tensor([3.0], requires_grad=True) # Every operation on x creates graph nodes a = x * 2 # MulBackward node created b = a + 5 # AddBackward node created c = b ** 2 # PowBackward node created # The graph remembers: c came from b, b from a, a from x print(c.grad_fn) # PowBackward0 print(c.grad_fn.next_functions) # shows the chain
Every tensor produced by an operation has a grad_fn attribute — a pointer back to the operation that created it. This forms a linked list (really a DAG) from the output all the way back to the leaf tensors (the ones you created directly with requires_grad=True).
Leaf tensors (like your model parameters) have grad_fn = None because they weren't produced by any operation — they're the starting points. But they have requires_grad = True, which tells autograd: "I want gradients with respect to this tensor."
This dynamic nature means the graph can be different on every forward pass. A model that uses an if-statement based on the input value will produce different graphs for different inputs — and autograd handles this perfectly, because it just records what actually happened:
python def weird_function(x): # Different computation paths based on x's value! if x.item() > 0: return x ** 2 # Positive: square it else: return -x * 3 + 1 # Negative: linear x = torch.tensor(2.0, requires_grad=True) y = weird_function(x) y.backward() print(x.grad) # 4.0 (d(x²)/dx = 2x = 4, took the positive branch) x = torch.tensor(-1.0, requires_grad=True) y = weird_function(x) y.backward() print(x.grad) # -3.0 (d(-3x+1)/dx = -3, took the negative branch)
Click Add Op to add operations one at a time. Watch the computation graph grow. Each node shows the operation and current value.
Notice how each new operation creates a new node that points back to its inputs. This backward-pointing structure is what makes the backward pass possible — we can start at the output and follow the pointers back to compute gradients for every intermediate node.
One subtle but important point: the graph is ephemeral by default. After you call .backward(), PyTorch destroys the graph to free memory. If you need to backward through the same graph twice (rare, but it happens), you must pass retain_graph=True. We'll cover this in Chapter 5.
Let's inspect the graph structure programmatically. Every tensor has a grad_fn that points to its creating operation, and grad_fn.next_functions that points to the previous nodes:
python import torch x = torch.tensor(2.0, requires_grad=True) a = x ** 2 b = a * 3 c = b + 1 # Walk the graph backward from c print(c.grad_fn) # AddBackward0 print(c.grad_fn.next_functions) # ((MulBackward0, 0), ...) print(c.grad_fn.next_functions[0][0]) # MulBackward0 (= b's grad_fn) print(b.grad_fn.next_functions[0][0]) # PowBackward0 (= a's grad_fn) print(a.grad_fn.next_functions[0][0]) # AccumulateGrad (= leaf node, x) # Leaf tensors have no grad_fn (they're starting points) print(x.grad_fn) # None print(x.is_leaf) # True
nn.Parameter objects are leaves. Intermediate tensors (produced by operations) are non-leaf. By default, PyTorch only stores .grad for leaf tensors — that's where your parameter updates come from. If you need gradients for intermediate tensors (for debugging), call .retain_grad() on them before backward.Autograd isn't magic. It's the chain rule from calculus, applied systematically. If you understand the chain rule, you understand autograd. Let's make sure that's rock solid.
The chain rule says: if y = f(g(x)), then dy/dx = f'(g(x)) · g'(x). In words: the derivative of a composition is the product of the derivatives at each stage. You multiply the "local gradients" along the path from output to input.
Let's work a concrete example by hand. Take y = (3x + 2)². Let's decompose this into simple steps:
Now compute each local derivative:
Chain rule gives us: dy/dx = dy/du · du/dx = 2u · 3 = 6u = 6(3x + 2).
At x = 1: u = 3(1) + 2 = 5, so dy/dx = 6 · 5 = 30.
Let's verify with PyTorch:
python import torch x = torch.tensor(1.0, requires_grad=True) y = (3 * x + 2) ** 2 y.backward() print(x.grad) # tensor(30.) ✓ matches our hand calculation!
Now let's extend to a longer chain. Consider y = sin(exp(x²)). Three operations:
Chain rule: dy/dx = dy/db · db/da · da/dx = cos(exp(x²)) · exp(x²) · 2x.
At x = 0.5: a = 0.25, b = exp(0.25) ≈ 1.284, dy/dx = cos(1.284) · 1.284 · 1.0 ≈ 0.278 · 1.284 · 1.0 ≈ 0.357.
Let's verify this longer chain with PyTorch too:
python import torch x = torch.tensor(0.5, requires_grad=True) a = x ** 2 b = torch.exp(a) y = torch.sin(b) y.backward() print(f"dy/dx = {x.grad:.4f}") # 0.3570 ✓ matches hand calculation # Manual verification import math manual = math.cos(math.exp(0.25)) * math.exp(0.25) * 2 * 0.5 print(f"manual = {manual:.4f}") # 0.3570 ✓
Notice the pattern: each node only needs to know its own local derivative. Node "sin" knows that d(sin(b))/db = cos(b). Node "exp" knows that d(exp(a))/da = exp(a). They don't need to know what came before them. Autograd just multiplies these local derivatives along the chain.
Adjust x with the slider. Watch the local gradients multiply along the chain for y = (3x + 2)².
The key realization: no matter how deep the chain goes (50 layers? 100?), the math is the same — multiply local derivatives along the path. Autograd just automates this multiplication across potentially millions of paths in a computation graph.
Let's make this concrete with numbers. A GPT-2 model has ~117 million parameters and one scalar loss output. Forward-mode autodiff would need 117,000,000 passes to get all gradients. Reverse-mode (backpropagation) needs exactly ONE backward pass. The cost difference is 117-million-fold. This is why reverse mode won and why we call it "backpropagation" — it's literally just reverse-mode automatic differentiation applied to neural networks.
When IS forward mode useful? When you have few inputs and many outputs. Computing the Jacobian of a function f: R³ → R¹&sup0;&sup0;&sup0; is cheaper with forward mode (3 passes) than reverse mode (1000 passes). PyTorch supports forward mode via torch.autograd.forward_ad for these cases.
You've called loss.backward() a thousand times. But what actually happens inside? Let's trace through it step by step.
When you call .backward() on a tensor, PyTorch does three things:
The seed gradient is always 1.0 (because dL/dL = 1). But you can actually pass a different seed to .backward(grad_tensor) when you want to compute a Jacobian-vector product with a specific vector. This is used in advanced scenarios like GAN training with gradient penalties.
Let's trace through a concrete graph. Consider:
python x = torch.tensor(2.0, requires_grad=True) a = x * 3 # a = 6 b = a + 1 # b = 7 c = b ** 2 # c = 49 c.backward()
Forward pass computed: x=2 → a=6 → b=7 → c=49.
Now the backward pass. We start at c and work backwards:
So x.grad = 42. Let's verify: c = (3x+1)², dc/dx = 2(3x+1)·3 = 6(3x+1). At x=2: 6(7) = 42. Correct.
Let's write out the full trace in table form to make it crystal clear:
| Node | Forward Value | Local Gradient | Incoming Grad | Output Grad (local × incoming) |
|---|---|---|---|---|
| c = b² | 49 | dc/db = 2b = 14 | 1 (seed) | 14 → send to b |
| b = a+1 | 7 | db/da = 1 | 14 | 14 → send to a |
| a = x*3 | 6 | da/dx = 3 | 14 | 42 → send to x |
| x (leaf) | 2 | — | 42 | x.grad = 42 (stored) |
This table is the backward pass. Every row follows the same formula: output_grad = local_grad × incoming_grad. The "incoming grad" for each node is the "output grad" of the node above it. The process is completely mechanical — no creativity required, just table-filling.
In PyTorch's source code, this incoming gradient is called grad_output in custom Function classes, and sometimes called the "cotangent" in more mathematical contexts. But the concept is always the same: "what gradient arrived at this node from the nodes it feeds into."
Let's work a slightly more complex example with a branching computation — where one tensor is used in two different operations:
python x = torch.tensor(3.0, requires_grad=True) a = x ** 2 # a = 9 b = x * 5 # b = 15 (x is used TWICE!) c = a + b # c = 24 c.backward() # x.grad = dc/dx = dc/da · da/dx + dc/db · db/dx # = 1·(2x) + 1·5 = 6 + 5 = 11 print(x.grad) # tensor(11.) ✓
When x is used in multiple operations, gradients from ALL paths are summed at x. This is the multivariate chain rule: if the output depends on x through multiple paths, the total derivative is the sum of derivatives along each path. Autograd handles this automatically through gradient accumulation (the += in each _backward).
What about nodes with multiple inputs? Consider c = a * b. During backward:
Each input gets its own gradient, computed independently. If a node has multiple outputs (its result is used in two places), the gradients from both paths are summed — we'll cover this more in Chapter 4.
Watch gradients propagate backward through x → *3 → +1 → ² → c. Click Step Back to advance one node at a time.
python # Verify our manual trace import torch x = torch.tensor(2.0, requires_grad=True) a = x * 3 b = a + 1 c = b ** 2 c.backward() print(f"x.grad = {x.grad}") # tensor(42.) ✓ # We can also get intermediate gradients with retain_grad() x = torch.tensor(2.0, requires_grad=True) a = x * 3 a.retain_grad() b = a + 1 b.retain_grad() c = b ** 2 c.backward() print(f"b.grad = {b.grad}") # tensor(14.) — matches our trace print(f"a.grad = {a.grad}") # tensor(14.) — matches our trace print(f"x.grad = {x.grad}") # tensor(42.) — matches our trace
Let's also see how this works for vector/matrix operations, since real neural networks operate on tensors, not scalars. Consider a simple linear layer: y = W @ x + b, where W is 2×3, x is 3×1, b is 2×1.
python import torch # Simple linear layer: y = Wx + b W = torch.randn(2, 3, requires_grad=True) x = torch.randn(3, 1, requires_grad=True) b = torch.randn(2, 1, requires_grad=True) y = W @ x + b # Shape: (2,1) loss = y.sum() # Scalar loss for backward loss.backward() print(f"W.grad shape: {W.grad.shape}") # (2, 3) — same as W print(f"x.grad shape: {x.grad.shape}") # (3, 1) — same as x print(f"b.grad shape: {b.grad.shape}") # (2, 1) — same as b # The backward rules for matmul: # dL/dW = (dL/dy) @ x.T → shape (2,1) @ (1,3) = (2,3) ✓ # dL/dx = W.T @ (dL/dy) → shape (3,2) @ (2,1) = (3,1) ✓ # dL/db = dL/dy → shape (2,1) ✓ (addition passes grad through)
W -= lr * W.grad, which requires matching shapes. The backward formulas for matmul are specifically constructed to produce the right shapes.Here's a bug that bites every PyTorch beginner at least once. You run your training loop, and the loss doesn't converge. It oscillates wildly or diverges. You debug for hours. The culprit? You forgot optimizer.zero_grad().
The reason: gradients in PyTorch accumulate by addition. When you call .backward(), the computed gradients are added to whatever is already in the .grad attribute, not replaced. If you don't zero them out between batches, you get the sum of gradients from every batch you've ever processed.
Let's see the bug in action:
python import torch x = torch.tensor(3.0, requires_grad=True) # First backward: gradient = 2x = 6 y = x ** 2 y.backward() print(x.grad) # tensor(6.) ✓ # Second backward WITHOUT zeroing: gradient ADDS y = x ** 2 y.backward() print(x.grad) # tensor(12.) ← 6 + 6, NOT just 6! # Third backward: keeps piling up y = x ** 2 y.backward() print(x.grad) # tensor(18.) ← 6 + 6 + 6
The fix is simple: zero gradients before each backward pass (or before each optimizer step, depending on whether you want accumulation).
python # The correct training loop pattern for batch in dataloader: optimizer.zero_grad() # Reset gradients to 0 output = model(batch) # Forward pass loss = loss_fn(output, target) # Compute loss loss.backward() # Backward pass (accumulates into .grad) optimizer.step() # Update parameters using .grad # Gradient accumulation pattern (simulating 4x batch size) for i, batch in enumerate(dataloader): output = model(batch) loss = loss_fn(output, target) / 4 # Scale loss loss.backward() # Accumulate if (i + 1) % 4 == 0: optimizer.step() # Step every 4 batches optimizer.zero_grad() # THEN zero
There's another subtle case: when a tensor is used in multiple operations. If x feeds into both a and b, and both contribute to the loss, then during backward the gradients from both paths are summed at x. This is mathematically correct — by the multivariate chain rule, if L = f(a(x), b(x)), then dL/dx = dL/da · da/dx + dL/db · db/dx.
python x = torch.tensor(2.0, requires_grad=True) a = x * 3 # path 1: uses x b = x ** 2 # path 2: uses x c = a + b # both paths merge here c.backward() print(x.grad) # tensor(7.) = 3 (from a) + 4 (from b, 2x=4)
Click Backward multiple times without zeroing. Watch gradients pile up. Then click Zero Grad to reset.
.grad as a bucket. Each .backward() pours water in. .zero_grad() empties the bucket. If you never empty it, it overflows. The bucket metaphor also explains gradient accumulation: pour from 4 small cups before measuring = same as 1 big cup.Here's a real training scenario showing the difference between correct and buggy code:
python import torch import torch.nn as nn model = nn.Linear(10, 1) optimizer = torch.optim.SGD(model.parameters(), lr=0.01) loss_fn = nn.MSELoss() # === BUGGY VERSION (gradients explode) === for epoch in range(100): x = torch.randn(32, 10) y = torch.randn(32, 1) output = model(x) loss = loss_fn(output, y) loss.backward() # Gradients ACCUMULATE each iteration! optimizer.step() # Steps get larger and larger! # By iteration 100, gradients are 100x too large # === CORRECT VERSION === for epoch in range(100): x = torch.randn(32, 10) y = torch.randn(32, 1) optimizer.zero_grad() # ← THIS LINE FIXES EVERYTHING output = model(x) loss = loss_fn(output, y) loss.backward() optimizer.step()
The symptom of forgetting zero_grad() is distinctive: your loss might decrease initially (the first few accumulated gradients point in roughly the right direction), but then starts oscillating or exploding. If you ever see wild loss values after a few iterations, check your zero_grad call first.
There's a newer alternative to optimizer.zero_grad(): passing set_to_none=True. Instead of setting gradients to zero tensors, it sets them to None. This is slightly faster (avoids a memset operation) and uses less memory momentarily:
python # Slightly more efficient version (default since PyTorch 2.0) optimizer.zero_grad(set_to_none=True) # What this does internally: # param.grad = None (vs param.grad.zero_() with set_to_none=False) # Next backward will allocate a new grad tensor # Pro: faster, less memory peak # Con: code that checks `if param.grad is not None` needs to be careful
Sometimes you need to stop autograd from tracking operations. Maybe you're doing inference and don't want the memory overhead of a computation graph. Maybe you need to freeze part of a network. Maybe you need to compute a value that shouldn't receive gradients. PyTorch gives you three tools for this.
torch.no_grad() is a context manager that disables gradient tracking for everything inside it. Operations still execute normally, but no graph is built. This is your go-to for inference and evaluation.
python # During inference — no gradients needed, saves memory with torch.no_grad(): predictions = model(test_data) # No computation graph built — much faster, less memory # Also useful for manual parameter updates with torch.no_grad(): param -= learning_rate * param.grad # No graph for this update!
detach() creates a new tensor that shares the same data but is disconnected from the computation graph. It's like cutting a wire — gradients can't flow through the detached tensor. The detached tensor shares the same underlying memory (it's a view, not a copy), so it's memory-efficient.
This is the workhorse of stop-gradient patterns in deep learning. Any time you want to use a value in a computation but don't want gradients to flow through it, you detach:
python x = torch.tensor(3.0, requires_grad=True) y = x ** 2 # y is connected to x in the graph z = y.detach() # z has same value (9.0) but NO graph connection w = z * 2 # operations on z don't connect back to x w.backward() # ERROR: z doesn't require grad # Common use: freeze encoder, train decoder features = encoder(image).detach() # Stop gradients here output = decoder(features) # Only decoder gets gradients
retain_graph=True prevents PyTorch from destroying the computation graph after .backward(). Normally, the graph is freed to reclaim memory. But if you need multiple backward passes through the same graph (e.g., for multiple losses), you need to keep it alive.
python x = torch.tensor(2.0, requires_grad=True) y = x ** 3 # First backward — keep graph alive y.backward(retain_graph=True) print(x.grad) # 12.0 (3x² = 12) # Second backward through same graph x.grad.zero_() y.backward() # Works because graph was retained print(x.grad) # 12.0 again # Without retain_graph=True, the second .backward() would crash
A useful alternative to .backward() with retain_graph is torch.autograd.grad() — a functional API that computes gradients without storing them in .grad attributes and without consuming the graph by default:
python x = torch.tensor(3.0, requires_grad=True) y = x ** 2 + 2 * x # Functional gradient — doesn't modify x.grad, doesn't consume graph grad_y = torch.autograd.grad(y, x, retain_graph=True)[0] print(grad_y) # tensor(8.) — dy/dx = 2x + 2 = 8 print(x.grad) # None — .grad not touched! # Can call again because retain_graph=True grad_y2 = torch.autograd.grad(y, x)[0] # Now graph is consumed print(grad_y2) # tensor(8.)
Here's a concrete example where retain_graph is necessary — a GAN-style setup with two losses sharing a generator:
python # Two different losses that share computation x = torch.tensor(2.0, requires_grad=True) shared = x ** 2 # Shared computation loss1 = shared * 3 # First loss branch loss2 = shared + 5 # Second loss branch # Need retain_graph for the first backward loss1.backward(retain_graph=True) # Graph kept alive print(x.grad) # 12.0 (d(3x²)/dx = 6x = 12) # Second backward through the SAME shared computation loss2.backward() # Graph freed after this print(x.grad) # 16.0 (12 + 4, accumulated!)
| Tool | What it does | When to use |
|---|---|---|
torch.no_grad() | Disables tracking for all ops in block | Inference, manual param updates |
.detach() | Cuts one tensor from the graph | Freezing part of a network, stop-gradient |
retain_graph=True | Keeps graph alive after backward | Multiple backward passes, higher-order grads |
The graph shows x → a → b → c. Click buttons to see how each tool affects gradient flow.
Here's a summary of when each tool is the right choice in common deep learning patterns:
python # Pattern 1: Inference / evaluation with torch.no_grad(): preds = model(test_batch) # Pattern 2: Freeze part of model (e.g., fine-tuning only the head) features = backbone(images).detach() logits = classification_head(features) # Only head gets gradients # Pattern 3: Target networks (DQN, MoCo, EMA) with torch.no_grad(): target_q = target_network(next_states) # No grad for target current_q = policy_network(states) # Grad flows here loss = F.mse_loss(current_q, target_q.detach()) # Extra safety # Pattern 4: Gradient penalty (WGAN-GP) — needs retain + create_graph interp = torch.lerp(real, fake, alpha) interp.requires_grad_(True) d_interp = discriminator(interp) grads = torch.autograd.grad(d_interp.sum(), interp, create_graph=True)[0] gp = ((grads.norm(2, dim=1) - 1) ** 2).mean() # Differentiable!
What if you invent a new operation that PyTorch doesn't have a built-in gradient for? Or what if you want to override the gradient of an existing operation (like the straight-through estimator for quantization)? You write a custom torch.autograd.Function.
A custom function has two static methods: forward() computes the output, and backward() computes the gradients. You're essentially telling PyTorch: "Here's how to compute this operation, and here's the derivative."
Let's implement ReLU from scratch:
python import torch class MyReLU(torch.autograd.Function): @staticmethod def forward(ctx, x): # ctx = context object for saving things needed in backward ctx.save_for_backward(x) return x.clamp(min=0) @staticmethod def backward(ctx, grad_output): # grad_output = the incoming gradient from above x, = ctx.saved_tensors grad_input = grad_output.clone() grad_input[x < 0] = 0 # gradient is 0 where x < 0 return grad_input # Use it x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0], requires_grad=True) y = MyReLU.apply(x) y.sum().backward() print(x.grad) # tensor([0., 0., 0., 1., 1.]) — gradient is 1 where x>0, else 0
The ctx object is the bridge between forward and backward. In forward, you save any tensors you'll need for gradient computation using ctx.save_for_backward(). In backward, you retrieve them with ctx.saved_tensors. This is critical for memory: if you save too many tensors, you waste GPU memory. If you save too few, you can't compute the gradient. The art of custom functions is saving the minimum necessary.
Important: you can only save tensors with save_for_backward. For non-tensor values (like integers, booleans, or shapes), store them directly as attributes on ctx: ctx.my_value = 42.
Now let's build something more interesting: a straight-through estimator. This is used in quantization — the forward pass rounds values to integers (non-differentiable!), but the backward pass pretends rounding didn't happen and passes gradients straight through.
python class StraightThroughRound(torch.autograd.Function): @staticmethod def forward(ctx, x): return torch.round(x) # Round to nearest integer @staticmethod def backward(ctx, grad_output): return grad_output # Pass gradient through unchanged! # Forward: 2.7 → 3.0 (rounded) # Backward: gradient passes through as if rounding never happened x = torch.tensor(2.7, requires_grad=True) y = StraightThroughRound.apply(x) y.backward() print(x.grad) # tensor(1.) — gradient flows through!
The general pattern for custom functions:
The backward method must return exactly as many gradients as there were inputs to forward. If an input doesn't need a gradient (like an integer parameter), return None for it.
Let's implement one more custom function to really solidify the pattern: a clamp operation that clips values to a range [min, max], with correct gradients (gradient is 0 outside the range, 1 inside):
python class MyClamp(torch.autograd.Function): @staticmethod def forward(ctx, x, min_val, max_val): ctx.save_for_backward(x) ctx.min_val = min_val ctx.max_val = max_val return x.clamp(min=min_val, max=max_val) @staticmethod def backward(ctx, grad_output): x, = ctx.saved_tensors # Gradient passes through where x is in range, zero otherwise mask = (x >= ctx.min_val) & (x <= ctx.max_val) return grad_output * mask.float(), None, None # None, None for min_val and max_val (no grad needed) # Test x = torch.tensor([-2.0, 0.5, 3.0], requires_grad=True) y = MyClamp.apply(x, 0.0, 1.0) print(y) # tensor([0.0, 0.5, 1.0]) y.sum().backward() print(x.grad) # tensor([0., 1., 0.]) — only middle value gets gradient
None for non-differentiable inputs. If your forward takes 3 inputs but only the first needs a gradient, backward must return (grad_for_first, None, None). Returning the wrong number of values causes a cryptic error about "expected 3 gradients but got 1."Compare standard ReLU vs custom straight-through estimator. Drag x to see how gradients differ for negative inputs.
Now for the payoff. We're going to build a working autograd engine from scratch in ~50 lines of Python. Not a toy — a real engine that can compute gradients for arbitrary expressions. This is the same approach Andrej Karpathy's micrograd uses, stripped to its essence.
Our engine has one class: Value. Each Value wraps a number and remembers (1) what operation produced it, (2) what its inputs were, and (3) a _backward function that computes local gradients.
python class Value: def __init__(self, data, _children=(), _op=''): self.data = data self.grad = 0.0 self._backward = lambda: None self._prev = set(_children) self._op = _op def __add__(self, other): other = other if isinstance(other, Value) else Value(other) out = Value(self.data + other.data, (self, other), '+') def _backward(): self.grad += out.grad # d(a+b)/da = 1 other.grad += out.grad # d(a+b)/db = 1 out._backward = _backward return out def __mul__(self, other): other = other if isinstance(other, Value) else Value(other) out = Value(self.data * other.data, (self, other), '*') def _backward(): self.grad += other.data * out.grad # d(a*b)/da = b other.grad += self.data * out.grad # d(a*b)/db = a out._backward = _backward return out def __pow__(self, n): out = Value(self.data ** n, (self,), f'**{n}') def _backward(): self.grad += n * (self.data ** (n-1)) * out.grad out._backward = _backward return out def relu(self): out = Value(0 if self.data < 0 else self.data, (self,), 'ReLU') def _backward(): self.grad += (out.data > 0) * out.grad out._backward = _backward return out def backward(self): # Topological sort topo = [] visited = set() def build_topo(v): if v not in visited: visited.add(v) for child in v._prev: build_topo(child) topo.append(v) build_topo(self) # Seed gradient and walk backwards self.grad = 1.0 for v in reversed(topo): v._backward() def __radd__(self, other): return self + other def __rmul__(self, other): return self * other def __neg__(self): return self * -1 def __sub__(self, other): return self + (-other)
That's the entire engine. Let's use it:
python # Test: y = (3x + 2)² at x = 1 x = Value(1.0) y = (3 * x + 2) ** 2 y.backward() print(f"y = {y.data}") # 25.0 print(f"dy/dx = {x.grad}") # 30.0 ✓ (matches Chapter 2!) # A tiny neural network: 2 inputs, 1 hidden neuron, 1 output x1 = Value(2.0) x2 = Value(3.0) w1 = Value(0.5) w2 = Value(-0.3) b = Value(0.1) # Forward: neuron = relu(w1*x1 + w2*x2 + b) neuron = (w1 * x1 + w2 * x2 + b).relu() print(f"neuron = {neuron.data}") # relu(1.0 - 0.9 + 0.1) = relu(0.2) = 0.2 neuron.backward() print(f"dout/dw1 = {w1.grad}") # 2.0 (= x1, because relu is active) print(f"dout/dw2 = {w2.grad}") # 3.0 (= x2)
Let's also use our engine for a slightly larger example — a 2-layer neural network with one hidden unit:
python # 2-input, 1-hidden, 1-output network # Forward: h = relu(w1*x1 + w2*x2 + b1), out = w3*h + b2 x1 = Value(1.5) x2 = Value(-2.0) w1 = Value(0.8) w2 = Value(0.6) b1 = Value(0.1) w3 = Value(1.2) b2 = Value(-0.5) # Hidden layer h_pre = w1 * x1 + w2 * x2 + b1 # 0.8*1.5 + 0.6*(-2) + 0.1 = 0.1 h = h_pre.relu() # relu(0.1) = 0.1 # Output layer out = w3 * h + b2 # 1.2*0.1 + (-0.5) = -0.38 # Suppose target is 1.0, loss = (out - target)² target = Value(1.0) loss = (out + (-1) * target) ** 2 # (−0.38 - 1)² = 1.9044 loss.backward() print(f"dL/dw1 = {w1.grad:.4f}") # How much does w1 affect the loss? print(f"dL/dw2 = {w2.grad:.4f}") # How much does w2 affect the loss? print(f"dL/dw3 = {w3.grad:.4f}") # How much does w3 affect the loss? # These gradients tell the optimizer which direction to nudge each weight # to reduce the loss. Gradient descent: w -= lr * w.grad
This is a complete training iteration! Forward pass (compute loss), backward pass (compute gradients), update (nudge weights). Every modern deep learning framework does exactly this loop billions of times.
The interactive simulation below implements this exact engine in JavaScript. Adjust the input value, watch the computation graph form, then click "Backward" to see gradients flow through every node.
Adjust x to change the input. The graph computes y = (3x+2)². Click Backward to watch gradient propagation with numerical values at each node.
Key observations from the engine:
Let's trace through the topological sort for our example y = (3x+2)² to make it concrete:
python # Graph structure: x → (x*3) → (a+2) → (b**2) = y # Topological order (leaves first): [x, 3, a=x*3, 2, b=a+2, y=b²] # Reversed (for backward): [y=b², b=a+2, a=x*3, 2, 3, x] # # Backward execution: # 1. y.grad = 1.0 (seed) # 2. y._backward(): b.grad += 2*b.data * y.grad = 2*5*1 = 10 # 3. b._backward(): a.grad += b.grad = 10, two.grad += b.grad = 10 # 4. a._backward(): x.grad += 3 * a.grad = 30, three.grad += 1 * a.grad = 10 # Result: x.grad = 30 ✓
The beauty of this design is its composability. Each operation only defines its local gradient rule. The engine handles the rest: ordering, accumulation, traversal. Want to add a new operation? Just define _backward for it. Everything else stays the same.
Let's also try different expressions in our engine to build confidence. Each one you should be able to verify by hand:
python # Expression 1: y = x³ at x=2 # dy/dx = 3x² = 12 x = Value(2.0) y = x * x * x y.backward() print(f"x³ at x=2: grad = {x.grad}") # 12.0 ✓ # Expression 2: y = (x + 1) * (x - 1) = x² - 1 at x=3 # dy/dx = 2x = 6 x = Value(3.0) y = (x + 1) * (x + (-1)) y.backward() print(f"(x+1)(x-1) at x=3: grad = {x.grad}") # 6.0 ✓ # Expression 3: y = relu(x - 2) at x=1 (inactive) # dy/dx = 0 (relu is off) x = Value(1.0) y = (x + (-2)).relu() y.backward() print(f"relu(x-2) at x=1: grad = {x.grad}") # 0.0 ✓ # Expression 4: y = relu(x - 2) at x=5 (active) # dy/dx = 1 (relu is on, pass-through) x = Value(5.0) y = (x + (-2)).relu() y.backward() print(f"relu(x-2) at x=5: grad = {x.grad}") # 1.0 ✓
Each of these can be verified by hand in seconds. The engine handles them all with the same generic mechanism. This is the power of automatic differentiation: one algorithm for any computation.
Autograd is elegant, but it has sharp edges. Ignore these gotchas and you'll spend hours debugging silent failures or mysterious memory leaks. Let's catalog the most common ones.
An in-place operation modifies a tensor directly (like x.add_(1) or x[0] = 5). These can corrupt the computation graph because backward relies on the original values being intact.
python x = torch.tensor([1.0, 2.0], requires_grad=True) y = x * 2 y.add_(1) # In-place! Modifies y directly # y.backward() may crash or give wrong gradients # RuntimeError: one of the variables needed for gradient computation # has been modified by an inplace operation # FIX: use out-of-place operations y = y + 1 # Creates new tensor, graph stays valid
Calling .item() on a tensor extracts a plain Python number — completely disconnected from the graph. Similarly, .numpy() converts to a NumPy array (no graph). If you use these in a computation that needs gradients, you'll silently break the gradient chain.
python x = torch.tensor(3.0, requires_grad=True) y = x ** 2 # BAD: .item() extracts a plain float — no gradient! loss_value = y.item() # Just the number 9.0, no graph # GOOD: use the tensor directly for computations that need gradients # Only use .item() for logging/printing print(f"Loss: {y.item():.4f}") # Fine for display
If you want gradients of gradients (second-order derivatives, used in meta-learning or some regularization), the first backward pass must itself be differentiable. Pass create_graph=True to make the backward pass build its own computation graph.
python x = torch.tensor(2.0, requires_grad=True) y = x ** 3 # dy/dx = 3x², d²y/dx² = 6x # First derivative with create_graph so we can differentiate again grad = torch.autograd.grad(y, x, create_graph=True)[0] print(grad) # tensor(12., grad_fn=MulBackward0) — it's a graph node! # Second derivative grad2 = torch.autograd.grad(grad, x)[0] print(grad2) # tensor(12.) — 6x = 6*2 = 12 ✓
The computation graph keeps references to every intermediate tensor (needed for backward). For a 50-layer network processing a large batch, this means 50 layers worth of activations sitting in GPU memory. This is often the main memory bottleneck in training.
The solution: gradient checkpointing (also called activation checkpointing). Instead of storing all intermediates, only store some "checkpoints." During backward, recompute the missing intermediates from the nearest checkpoint. Trades compute for memory.
To understand the memory problem concretely: a GPT-3-sized model (175B params) with batch size 1 and sequence length 2048 stores approximately 200GB of activations during a forward pass. That's just the intermediate tensors the graph keeps alive for backward — far more than the 15GB for the parameters themselves. Gradient checkpointing can reduce this 200GB to ~40GB at the cost of recomputing roughly 33% of the forward pass.
python from torch.utils.checkpoint import checkpoint # Without checkpointing: stores ALL layer outputs in memory def forward_normal(x, layers): for layer in layers: x = layer(x) # Each output stored for backward return x # With checkpointing: recomputes during backward, saves ~50% memory def forward_checkpoint(x, layers): for layer in layers: x = checkpoint(layer, x, use_reentrant=False) return x
PyTorch tensors can be views of other tensors (sharing the same memory). Operations like .T, .reshape(), and slicing create views. The autograd graph tracks through views correctly, but in-place modifications to a view can corrupt the graph of the original tensor:
python x = torch.tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True) y = x[0] # y is a VIEW of x's first row z = y * 2 # Operation on the view z.sum().backward() print(x.grad) # tensor([[2., 2.], [0., 0.]]) — gradient only in first row # DANGER: in-place on a view affects the parent x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True) y = x * 2 z = y[0:2] # View of y z.mul_(3) # In-place on view → corrupts y's grad_fn! # y.backward() will now give wrong results or crash
When training in fp16/bf16, small gradients can underflow to zero. PyTorch's GradScaler multiplies the loss by a large number before backward (so gradients stay in representable range), then divides the gradients back down before the optimizer step:
python from torch.amp import GradScaler, autocast scaler = GradScaler() for batch in dataloader: optimizer.zero_grad() with autocast(device_type='cuda'): output = model(batch) loss = loss_fn(output) scaler.scale(loss).backward() # Scaled loss → scaled gradients scaler.step(optimizer) # Unscales then steps scaler.update() # Adjusts scale factor
Compare memory usage with and without gradient checkpointing for networks of varying depth.
If you reassign a leaf tensor with an operation, it stops being a leaf:
python x = torch.tensor(3.0, requires_grad=True) print(x.is_leaf) # True x = x * 2 # x is now a NEW tensor (result of multiplication) print(x.is_leaf) # False! .grad won't be stored here # This is why optimizer.step() uses torch.no_grad(): # It needs to modify parameters WITHOUT creating graph nodes with torch.no_grad(): param -= lr * param.grad # In-place update, stays as leaf
| Gotcha | Symptom | Fix |
|---|---|---|
| In-place ops | RuntimeError or wrong gradients | Use out-of-place: y = y + 1, not y.add_(1) |
| .item() / .numpy() | Gradient is None or missing | Only use for logging, not computation |
| No create_graph | Can't compute second derivatives | Pass create_graph=True to first backward |
| Memory blowup | OOM on large models | Gradient checkpointing |
| Leaf vs non-leaf | .grad is None for intermediates | Use .retain_grad() or torch.autograd.grad() |
| Leaf reassignment | Parameter no longer receives .grad | Update in-place under torch.no_grad() |
torch.autograd.set_detect_anomaly(True) can help catch issues at the point they occur.You now understand autograd from the inside out. You know how computation graphs are built dynamically, how the backward pass walks them in reverse topological order, how gradients accumulate, and how to control the process. Let's consolidate everything.
| Function / Concept | What it does | When to use |
|---|---|---|
requires_grad=True | Enables gradient tracking for a tensor | Model parameters, inputs you want to differentiate |
.backward() | Computes gradients for all leaf tensors | After computing loss, before optimizer.step() |
.grad | Accumulated gradient (tensor attribute) | Access after backward(), zero before next backward |
.zero_grad() | Resets .grad to zero | Before each backward pass (unless accumulating) |
.detach() | Creates graph-disconnected copy | Freezing encoders, stop-gradient, targets |
torch.no_grad() | Context: disables all tracking | Inference, evaluation, manual param updates |
retain_graph=True | Keeps graph after backward | Multiple backward passes, RL policy + value |
create_graph=True | Makes backward differentiable | Higher-order gradients, meta-learning (MAML) |
.retain_grad() | Saves grad for non-leaf tensor | Debugging intermediate gradients |
torch.autograd.grad() | Functional gradient API | When you want grads without .backward() |
torch.autograd.Function | Custom forward + backward | Novel ops, straight-through, custom gradients |
checkpoint() | Recompute instead of store activations | Memory-limited training of deep models |
.grad_fn | Pointer to the creating operation | Debugging, inspecting the graph structure |
As a final test of your understanding, let's derive the backward pass for softmax. Given input z (vector of length n), softmax produces:
The Jacobian is: ∂si/∂zj = si(δij - sj), where δij is 1 if i=j, else 0.
In matrix form: J = diag(s) - s sT.
Given incoming gradient g (from the loss above), the output gradient is:
Where ⊙ is elementwise multiplication and (s · g) is the dot product (a scalar). This is why softmax backward is O(n), not O(n²) — you never need to materialize the full Jacobian.
Let's derive this step by step. Start with the Jacobian entry ∂si/∂zj:
Combined: ∂si/∂zj = si(δij - sj). Now multiply by the incoming gradient vector g:
Which gives us the vector form: ∂L/∂z = s ⊙ (g - ⟨s,g⟩). This is equivalent to our earlier formulation and requires only O(n) computation: one dot product and one elementwise multiply.
python class MySoftmax(torch.autograd.Function): @staticmethod def forward(ctx, z): s = torch.exp(z - z.max()) # subtract max for numerical stability s = s / s.sum() ctx.save_for_backward(s) return s @staticmethod def backward(ctx, grad_output): s, = ctx.saved_tensors # J^T @ g = s * g - s * (s @ g) dot = (s * grad_output).sum() return s * grad_output - s * dot # Verify against PyTorch's built-in z = torch.randn(5, requires_grad=True) s1 = torch.softmax(z, dim=0) s2 = MySoftmax.apply(z.detach().requires_grad_(True)) # Both give same gradients ✓
For an extra challenge, think about implementing backward for layer normalization. The forward is:
The tricky part: μ and σ both depend on ALL elements of x. So ∂yi/∂xj is non-zero even when i ≠ j (because changing xj changes the mean and std that affect yi). This cross-dependency makes the Jacobian dense, but like softmax, it can be applied in O(n) time without materializing the full matrix.
python # Sketch of LayerNorm backward (the key insight) # Given: x_hat = (x - mean) / std # dy/dx_i = gamma/std * (grad_i - mean(grad) - x_hat_i * mean(grad * x_hat)) # All O(n) operations: no n×n Jacobian needed!
Autograd is the foundation that everything else in deep learning training is built on:
If you want to go deeper into PyTorch internals, here are the key source files:
torch/autograd/__init__.py — Python API (backward, grad)torch/csrc/autograd/engine.cpp — C++ backward enginetorch/csrc/autograd/function.h — Base class for all grad functionstools/autograd/derivatives.yaml — All derivative formulas in YAML!That last file is remarkable — every backward formula PyTorch knows is declared in a single YAML file. Code generation turns these declarations into C++ implementations. Want to know how PyTorch computes the gradient for any operation? Look it up in derivatives.yaml.
Backpropagation (reverse-mode autodiff for neural networks) was popularized by Rumelhart, Hinton, and Williams in 1986, though the mathematics was discovered independently multiple times before that. The key insight was applying the chain rule to networks with many layers — showing that you could efficiently compute gradients for ALL weights in a single backward sweep.
For decades, researchers hand-implemented backward passes for each new architecture. Theano (2010) was the first framework to automate this with a static graph compiler. PyTorch (2017) introduced dynamic graphs — recording operations at runtime — which made debugging natural (use print, use pdb, use if-statements) and research faster. Today's PyTorch autograd engine handles thousands of operations with hand-tuned CUDA kernels, vmap support for batched gradients, and torch.compile for fusion.
The evolution continues: torch.func (functorch) brings functional transformations like vmap (vectorized map), grad (functional gradient), and jacrev/jacfwd (full Jacobian computation) as composable transforms. These build on autograd but add new capabilities like per-sample gradients without a for-loop.
python from torch.func import grad, vmap, jacrev # Functional gradient (no .backward needed) def f(x): return (x ** 3).sum() gradient_fn = grad(f) print(gradient_fn(torch.tensor([1.0, 2.0, 3.0]))) # [3, 12, 27] # Per-sample gradients (useful for differential privacy) def loss_fn(params, x, y): return ((params @ x - y) ** 2).sum() per_sample_grad = vmap(grad(loss_fn), in_dims=(None, 0, 0)) # Full Jacobian (for small functions) def g(x): return torch.stack([x[0]**2 + x[1], x[0]*x[1]]) J = jacrev(g)(torch.tensor([2.0, 3.0])) print(J) # [[4, 1], [3, 2]] — the 2×2 Jacobian matrix
| Operation | Forward: y = | Backward: ∂L/∂x = |
|---|---|---|
| Add | a + b | ∂L/∂a = g, ∂L/∂b = g |
| Multiply | a · b | ∂L/∂a = b·g, ∂L/∂b = a·g |
| Power | xn | n·xn-1·g |
| Exp | ex | ex·g = y·g |
| Log | ln(x) | g / x |
| ReLU | max(0,x) | g · (x > 0) |
| Sigmoid | σ(x) | g · σ(x)(1-σ(x)) = g·y(1-y) |
| Tanh | tanh(x) | g · (1 - tanh²(x)) = g·(1-y²) |
| MatMul | A @ B | ∂L/∂A = g@BT, ∂L/∂B = AT@g |
| Sum | ∑x | g · ones (broadcast) |
| Mean | mean(x) | g / n (broadcast) |
In this table, g always means the incoming gradient (∂L/∂y). Notice that exp and sigmoid/tanh can reuse the forward output y in their backward — no need to save the input! This is a common optimization: save the output instead of the input when the backward formula allows it.
Here's every piece of a training loop annotated with what autograd is doing at each step:
python import torch import torch.nn as nn # Model parameters are leaf tensors with requires_grad=True model = nn.TransformerEncoderLayer(d_model=512, nhead=8) optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4) loss_fn = nn.CrossEntropyLoss() for batch_idx, (tokens, labels) in enumerate(dataloader): # 1. ZERO GRADIENTS # Autograd: clear .grad on all parameters (set to None or zero) optimizer.zero_grad() # 2. FORWARD PASS # Autograd: builds computation graph, node by node # Each nn.Linear, LayerNorm, attention, etc. creates grad_fn nodes # All intermediate activations are stored (needed for backward) logits = model(tokens) # Graph: tokens → ... → logits # 3. COMPUTE LOSS # Autograd: adds a few more nodes (softmax, nll, reduction) loss = loss_fn(logits, labels) # Graph: ... → logits → loss (scalar) # 4. BACKWARD PASS # Autograd: topological sort, then walk backward # Each node computes local_grad × incoming_grad # Results accumulate into param.grad for each leaf parameter # After backward, graph is FREED (intermediates released) loss.backward() # 5. OPTIMIZER STEP # NOT autograd! Optimizer reads .grad, applies update rule # Adam: m = β1*m + (1-β1)*grad, v = β2*v + (1-β2)*grad² # param -= lr * m_hat / (√v_hat + ε) # Happens inside torch.no_grad() to avoid building a graph for updates optimizer.step() # 6. LOGGING (optional) # .item() extracts scalar for display — disconnected from graph if batch_idx % 100 == 0: print(f"Step {batch_idx}: loss = {loss.item():.4f}")
Every modern training framework (HuggingFace Trainer, PyTorch Lightning, DeepSpeed) is built on exactly this loop. They add distribution, mixed precision, logging, and checkpointing — but the core autograd flow (zero → forward → loss → backward → step) is always the same.
When things go wrong, PyTorch provides tools to diagnose autograd problems:
python # 1. Anomaly detection: pinpoints where gradients become NaN/Inf torch.autograd.set_detect_anomaly(True) # Now if backward produces NaN, you get a stack trace to the FORWARD op # 2. Gradient checking: numerically verify your custom backward from torch.autograd import gradcheck x = torch.randn(3, dtype=torch.double, requires_grad=True) result = gradcheck(MyCustomFunction.apply, (x,), eps=1e-6) # Returns True if analytical gradient matches numerical approximation # 3. Profiling: see which backward ops take the most time with torch.profiler.profile(with_stack=True) as prof: loss.backward() print(prof.key_averages().table(sort_by="cpu_time_total")) # 4. Graph visualization: export graph to graphviz DOT format from torchviz import make_dot dot = make_dot(loss, params={"W": W, "b": b}) dot.render("computation_graph") # Saves PDF
The gradcheck function is essential when writing custom autograd functions. It computes numerical derivatives (using finite differences) and compares them against your analytical backward. If they disagree by more than a tolerance, your backward has a bug. Always test with double precision (float64) since float32's numerical errors can cause false failures.
One more critical debugging technique: gradient hooks. You can register a function that gets called every time a tensor's gradient is computed, letting you inspect, modify, or log gradients as they flow:
python # Register a hook to inspect gradients during backward def print_grad(name): def hook(grad): print(f"{name}: grad shape={grad.shape}, norm={grad.norm():.4f}") return grad # Return modified grad, or None to keep original return hook for name, param in model.named_parameters(): param.register_hook(print_grad(name)) # Now during backward, you'll see every parameter's gradient info loss.backward() # Prints gradient stats for every parameter
Our engine works on scalars. PyTorch operates on tensors — multi-dimensional arrays. This changes the backward functions: instead of simple scalar derivatives, each operation computes a Jacobian-vector product (JVP). For example, matrix multiply C = A @ B has backward:
This is why PyTorch backward functions receive grad_output (the "vector" in JVP) and return the result of multiplying it by the local Jacobian. The Jacobian itself is never materialized — only its product with the incoming gradient is computed.
| Our micrograd | PyTorch autograd |
|---|---|
| Scalar values | Tensor values (any shape) |
| Python closures | C++ Function objects |
| Single-threaded | Multi-threaded (parallel backward) |
| No memory management | Frees intermediates after use |
| No GPU support | CUDA kernels for every operation |
| Scalar derivatives | Jacobian-vector products |
Now that you understand autograd from the inside, here are productive directions to deepen your knowledge:
The progression from here: understanding autograd unlocks understanding of training dynamics (why learning rates matter, why batch norm helps, why residual connections prevent vanishing gradients) and optimization (why Adam works better than SGD for transformers, why warmup helps, why gradient clipping prevents explosions). All of these build directly on the gradient computation you now understand.
"What I cannot create, I do not understand." — Richard Feynman
You might have noticed that .backward() is called on a scalar (the loss). Why? Because the gradient of a scalar with respect to a vector is a vector (same shape as the parameters). If you tried to backward a vector output, you'd get a matrix (the Jacobian) for each parameter — that's N×M values instead of N, which explodes memory.
When you DO need to backward a non-scalar, you must provide a gradient argument that specifies which linear combination of the output dimensions you want the gradient for:
python x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True) y = x ** 2 # Vector output: [1, 4, 9] # Can't just do y.backward() — need to specify WHICH gradient # This computes d(y[0]*1 + y[1]*1 + y[2]*1)/dx = d(sum(y))/dx y.backward(torch.ones_like(y)) print(x.grad) # tensor([2., 4., 6.]) — same as (y.sum()).backward() # Or weight the outputs differently: x.grad.zero_() y = x ** 2 y.backward(torch.tensor([1.0, 0.0, 0.0])) # Only care about y[0] print(x.grad) # tensor([2., 0., 0.]) — gradient of just y[0]