What happens between model = MyNet() and model(x) — the machinery that makes deep learning code actually work.
You write model = MyNet() and call model(x). Training works. You ship it. Then one day you add a custom layer and your loss stays flat. The optimizer reports zero parameters. Your gradients are all None. You stare at the code for hours.
Here is the bug. Can you spot it?
python class MyModel(nn.Module): def __init__(self): super().__init__() self.layers = [ nn.Linear(64, 32), nn.Linear(32, 10) ] # Bug is HERE def forward(self, x): for layer in self.layers: x = layer(x) return x model = MyModel() print(list(model.parameters())) # [] — EMPTY!
The model has layers. They have weights. But model.parameters() returns nothing. The optimizer sees nothing to train. Your loss is a flat line.
The fix is one word: change the plain Python list to nn.ModuleList. But why does that matter? Why does PyTorch care whether you use a list or a ModuleList?
nn.Module is not just a Python class — it's a registration system. It tracks parameters, buffers, hooks, and child modules using special __setattr__ magic. A plain Python list bypasses that tracking entirely. Understanding this machinery lets you: debug gradient issues instantly, write custom layers that work correctly, use hooks for interpretability, and export models for deployment.This lesson takes you inside the machine. By the end, you'll know exactly what happens at every step from module construction to forward pass to JIT export.
Watch parameters get registered (or lost). Teal boxes = registered parameters the optimizer can see. Red boxes = orphaned tensors invisible to the optimizer.
model.parameters() return an empty list when layers are stored in a plain Python list?Every PyTorch model inherits from nn.Module. But what IS a Module, really? Strip away the convenience methods and you find a Python object with four internal dictionaries:
| Dictionary | Contents | Purpose |
|---|---|---|
_parameters | nn.Parameter objects | Trainable weights |
_buffers | Tensors (non-trainable) | Running stats, masks |
_modules | Child nn.Module objects | Sub-layers |
_forward_hooks | Callable functions | Intercept forward pass |
That's it. A Module is a container of named tensors and sub-containers, plus machinery to traverse them recursively. When you call model.parameters(), it walks _parameters on itself, then recursively on every child in _modules.
Let's build a Module from scratch — no inheritance, just dicts — to see why the class is needed:
python # A "module" without nn.Module — just raw dicts import torch my_module = { "_parameters": { "weight": torch.randn(10, 5, requires_grad=True), "bias": torch.randn(10, requires_grad=True), }, "_buffers": {}, "_modules": {}, } def forward(module, x): W = module["_parameters"]["weight"] b = module["_parameters"]["bias"] return x @ W.T + b # This works! But... # - No recursive parameter collection # - No .to(device) that moves everything # - No hooks for debugging # - No state_dict for saving/loading # - No __setattr__ magic for registration
nn.Module exists because neural networks are TREES of parameterized operations. You need to recursively collect parameters from all sub-modules, move them to devices together, save/load state, and intercept forward passes. A plain dict can't do any of that automatically.The magic happens in __setattr__. When you write self.linear = nn.Linear(10, 5) inside __init__, Python calls Module.__setattr__("linear", ...). This method checks: is the value an nn.Parameter? Put it in _parameters. Is it an nn.Module? Put it in _modules. Is it a plain tensor registered as a buffer? Put it in _buffers. Anything else? Store it as a regular Python attribute.
python # Simplified __setattr__ (actual PyTorch source is ~80 lines) def __setattr__(self, name, value): if isinstance(value, Parameter): self._parameters[name] = value elif isinstance(value, Module): self._modules[name] = value else: object.__setattr__(self, name, value) # normal Python
Click attributes to see which internal dict they land in. The orange path shows how __setattr__ routes each assignment.
self.fc = nn.Linear(10, 5) in __init__, where does PyTorch store the Linear layer?A Parameter is just a Tensor with one special property: requires_grad=True by default, and it registers itself with the Module's parameter tracking system. That's the entire difference. But that difference is everything.
python import torch import torch.nn as nn class Demo(nn.Module): def __init__(self): super().__init__() # This IS tracked — shows up in parameters() self.weight = nn.Parameter(torch.randn(10, 5)) # This is NOT tracked — invisible to optimizer self.scale = torch.randn(10) model = Demo() print(list(model.named_parameters())) # [('weight', Parameter containing: tensor(...))] # Notice: 'scale' is MISSING
The scale tensor exists on the object, but the optimizer never sees it. It won't get gradients. It won't be saved in state_dict(). It won't move when you call model.to('cuda'). It's a ghost.
nn.Parameter(). If it shouldn't be trained but should move with the model and appear in state_dict, use register_buffer(). If it's neither — it's just a local variable that shouldn't be on the Module at all.Let's trace through what named_parameters() actually does:
This is why state_dict() has dot-separated keys like "encoder.layer.0.self_attn.q_proj.weight" — they encode the full path through the module tree.
python # state_dict keys reflect the module tree structure model = nn.Sequential( nn.Linear(784, 128), nn.ReLU(), nn.Linear(128, 10) ) print(model.state_dict().keys()) # odict_keys(['0.weight', '0.bias', '2.weight', '2.bias']) # Notice: ReLU (index 1) has NO parameters — it's stateless
Toggle between Parameter and plain tensor assignment. Watch which tensors the optimizer can see and which become ghosts.
self.mask = torch.ones(10) in __init__. What happens to self.mask?Parameters are tensors the optimizer trains. But what about tensors that need to live on the model, move to GPU, get saved and loaded — but should not receive gradients? That's what buffers are for.
The classic example is BatchNorm. It has four tensors:
| Name | Type | Trained? | Purpose |
|---|---|---|---|
weight (γ) | Parameter | Yes | Learned scale |
bias (β) | Parameter | Yes | Learned shift |
running_mean | Buffer | No | EMA of batch means (for eval mode) |
running_var | Buffer | No | EMA of batch variances (for eval mode) |
The running statistics are updated during training (via exponential moving average) but NOT by the optimizer — they have no gradients. They must be saved with the model (they're needed at inference) and must move to GPU with everything else.
python class MyBatchNorm(nn.Module): def __init__(self, num_features): super().__init__() # Parameters (trained by optimizer) self.weight = nn.Parameter(torch.ones(num_features)) self.bias = nn.Parameter(torch.zeros(num_features)) # Buffers (NOT trained, but saved + moved with model) self.register_buffer('running_mean', torch.zeros(num_features)) self.register_buffer('running_var', torch.ones(num_features)) self.register_buffer('num_batches_tracked', torch.tensor(0, dtype=torch.long)) def forward(self, x): if self.training: mean = x.mean(dim=0) var = x.var(dim=0, unbiased=False) # Update running stats (no grad!) self.running_mean = 0.9 * self.running_mean + 0.1 * mean.detach() self.running_var = 0.9 * self.running_var + 0.1 * var.detach() else: mean = self.running_mean var = self.running_var x_norm = (x - mean) / torch.sqrt(var + 1e-5) return self.weight * x_norm + self.bias
register_buffer('name', tensor) does THREE things: (1) stores the tensor in self._buffers, (2) makes it appear in state_dict(), and (3) makes .to(device) move it. It does NOT add it to parameters(). The optimizer never touches buffers.There's one more option: register_buffer('name', tensor, persistent=False). Non-persistent buffers move with .to() but do NOT appear in state_dict(). Use these for scratch computation buffers that can be recomputed.
python # Checking what lives where bn = nn.BatchNorm1d(64) print("Parameters:", [n for n, _ in bn.named_parameters()]) # ['weight', 'bias'] print("Buffers:", [n for n, _ in bn.named_buffers()]) # ['running_mean', 'running_var', 'num_batches_tracked'] print("State dict keys:", list(bn.state_dict().keys())) # ['weight', 'bias', 'running_mean', 'running_var', 'num_batches_tracked']
Watch how parameters (trained) and buffers (tracked but untrained) behave differently during training. Orange = parameters receiving gradients. Teal = buffers updated by EMA.
persistent=False will:Hooks let you intercept a module's execution without modifying its source code. Think of them as wiretaps on the data flow. You can see exactly what goes into a layer, what comes out, and what gradients flow back.
There are three types of hooks:
| Hook Type | Fires When | Sees | Can Modify |
|---|---|---|---|
register_forward_pre_hook | Before forward() | input | input |
register_forward_hook | After forward() | input + output | output |
register_full_backward_hook | After backward() | grad_input + grad_output | grad_input |
python # Forward hook: capture every layer's output activations = {} def save_activation(name): def hook(module, input, output): activations[name] = output.detach() return hook model = nn.Sequential( nn.Linear(784, 128), nn.ReLU(), nn.Linear(128, 10) ) # Attach hooks to every layer for name, layer in model.named_modules(): layer.register_forward_hook(save_activation(name)) # After forward pass, activations dict has every layer's output x = torch.randn(1, 784) out = model(x) print(activations.keys()) # dict_keys(['', '0', '1', '2'])
python # Backward hook: detect NaN gradients instantly def nan_detector(name): def hook(module, grad_input, grad_output): for i, g in enumerate(grad_output): if g is not None and torch.isnan(g).any(): print(f"NaN gradient at {name}, grad_output[{i}]") return hook for name, layer in model.named_modules(): layer.register_full_backward_hook(nan_detector(name))
Other practical hook uses: per-layer gradient clipping, feature extraction for transfer learning (grab intermediate representations without rewriting the model), and activation statistics logging.
Click "Forward Pass" to watch data flow through layers. Yellow flashes show hooks firing and capturing activations. Purple flashes show backward hooks catching gradients.
When you write output = model(x), Python calls model.__call__(x). This is NOT the same as calling model.forward(x). The __call__ method does much more than just run forward — it orchestrates the entire execution pipeline.
Here's what __call__ actually does, in order:
model.forward(x) directly. Always call model(x). If you bypass __call__, hooks won't fire, autograd won't track operations correctly, and your model will silently produce wrong gradients during training.Here's a simplified version of what __call__ looks like internally:
python # Simplified nn.Module.__call__ (actual source is ~100 lines) def __call__(self, *args, **kwargs): # Step 1: forward pre-hooks for hook in self._forward_pre_hooks.values(): result = hook(self, args) if result is not None: args = result if isinstance(result, tuple) else (result,) # Step 2: actual forward pass output = self.forward(*args, **kwargs) # Step 3: forward hooks for hook in self._forward_hooks.values(): hook_result = hook(self, args, output) if hook_result is not None: output = hook_result # Step 4: backward hook registration (if any) if self._backward_hooks: # ... register gradient hooks on output tensor ... pass return output
This explains a common confusion: "Why does my hook fire when I use model(x) but not when I call model.forward(x)?" Now you know. forward() is just step 2. __call__ is the full pipeline.
.forward() directly, these libraries will silently malfunction. The model might appear to work but produce subtly wrong outputs.Compare what happens when you call the model properly vs calling forward directly. Notice which steps get skipped.
model.forward(x) instead of model(x)?We saw in Chapter 0 that a plain Python list breaks parameter registration. PyTorch provides three container classes that properly register their contents:
| Container | Use When | Access Pattern |
|---|---|---|
nn.Sequential | Layers run one after another | model(x) runs all layers in order |
nn.ModuleList | You need index access or iteration | self.layers[i] or for l in self.layers |
nn.ModuleDict | You need string-key access | self.heads['classifier'] |
The key difference from plain Python containers: when you assign an nn.ModuleList to self.layers, every Module inside it gets registered in self._modules. The recursive parameters() walk finds them all.
python # BROKEN: plain Python list class BadModel(nn.Module): def __init__(self): super().__init__() self.layers = [nn.Linear(10, 10) for _ in range(3)] bad = BadModel() print(sum(p.numel() for p in bad.parameters())) # 0 !!! # CORRECT: nn.ModuleList class GoodModel(nn.Module): def __init__(self): super().__init__() self.layers = nn.ModuleList([nn.Linear(10, 10) for _ in range(3)]) good = GoodModel() print(sum(p.numel() for p in good.parameters())) # 330
Sequential when your forward pass is literally "pass x through each layer in order" — you don't even need to write a forward() method. Use ModuleList when you need custom logic (skip connections, branching). Use ModuleDict when you select sub-modules by name (multi-task heads).python # Sequential: forward() is automatic mlp = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10) ) out = mlp(x) # Runs all 3 in order, no forward() needed # ModuleList: custom forward logic class ResBlock(nn.Module): def __init__(self): super().__init__() self.layers = nn.ModuleList([nn.Linear(64, 64) for _ in range(3)]) def forward(self, x): for layer in self.layers: x = x + torch.relu(layer(x)) # residual connection return x # ModuleDict: select by name class MultiTask(nn.Module): def __init__(self): super().__init__() self.backbone = nn.Linear(128, 64) self.heads = nn.ModuleDict({ 'classify': nn.Linear(64, 10), 'regress': nn.Linear(64, 1), }) def forward(self, x, task): features = torch.relu(self.backbone(x)) return self.heads[task](features)
See how parameter counts differ between plain list, ModuleList, and Sequential. The number visible to the optimizer is what matters for training.
Time to put everything together. This interactive inspector lets you explore real model architectures — see the module tree, parameter counts, data shapes flowing through, and where hooks would attach.
Select a model below. Click any node in the tree to inspect its parameters, buffers, and internal state. Click "Forward Pass" to watch tensor shapes propagate through the network.
Select a model architecture, then explore its internals. Click layers to inspect. Run forward to see data flow.
PyTorch models are Python objects. This is great for debugging (set breakpoints, print shapes, use Python control flow) but terrible for deployment. You can't run a Python object on a phone, in a C++ server, or in a browser. You need to export the model to a format that doesn't need Python.
TorchScript is PyTorch's solution: a statically-typed subset of Python that can be compiled and run without the Python interpreter. There are two ways to convert:
| Method | How it works | Strengths | Weaknesses |
|---|---|---|---|
torch.jit.trace | Runs your model ONCE, records all operations | Works with any code | Misses control flow (if/for) |
torch.jit.script | Parses your Python source, compiles it | Handles control flow | Only supports a subset of Python |
python # Tracing: works for simple models model = nn.Sequential(nn.Linear(10, 5), nn.ReLU(), nn.Linear(5, 2)) example_input = torch.randn(1, 10) traced = torch.jit.trace(model, example_input) traced.save("model.pt") # Can load in C++, no Python needed
But tracing has a fatal flaw. It only records ONE execution path. If your model has an if statement, tracing will only capture the branch that ran during tracing:
python # This model FAILS to trace correctly class ConditionalModel(nn.Module): def __init__(self): super().__init__() self.fc = nn.Linear(10, 5) self.fc_big = nn.Linear(10, 20) def forward(self, x): if x.shape[0] > 1: # batch size check return self.fc_big(x) else: return self.fc(x) model = ConditionalModel() # Trace with batch=1 → only captures the 'else' branch! traced = torch.jit.trace(model, torch.randn(1, 10)) # Now traced(torch.randn(4, 10)) STILL uses fc, not fc_big! # The if-statement was baked out during tracing.
torch.jit.script for models with control flow. Scripting parses the actual Python source and compiles the if-statement into the TorchScript IR. Both branches are preserved.python # Scripting handles control flow correctly scripted = torch.jit.script(model) # Both branches are compiled into the IR print(scripted.graph) # Shows prim::If node # Works correctly for both batch sizes: scripted(torch.randn(1, 10)).shape # torch.Size([1, 5]) scripted(torch.randn(4, 10)).shape # torch.Size([4, 20])
The tradeoff: scripting is stricter. It only supports a subset of Python (no arbitrary objects, limited list comprehensions, all variables must have inferrable types). Complex Python code often needs refactoring to be scriptable.
python # Modern alternative: torch.export (PyTorch 2.0+) # More flexible than jit, better for deployment from torch.export import export exported = export(model, (torch.randn(1, 10),)) # Creates an ExportedProgram with full graph capture # Supports dynamic shapes, guards, and control flow
Watch how tracing captures only one path while scripting preserves both branches. The red path shows the branch that tracing missed.
You now understand the full lifecycle of a PyTorch model: construction (parameter registration via __setattr__), execution (the __call__ protocol with hooks), state management (parameters + buffers + state_dict), and export (JIT trace vs script). Let's consolidate with patterns you'll use daily.
python class MyCustomLayer(nn.Module): def __init__(self, in_dim, out_dim, use_bias=True): super().__init__() # Parameters: trained by optimizer self.weight = nn.Parameter(torch.randn(out_dim, in_dim) * 0.01) if use_bias: self.bias = nn.Parameter(torch.zeros(out_dim)) else: self.register_parameter('bias', None) # explicit None # Buffers: saved + moved, NOT trained self.register_buffer('running_count', torch.tensor(0)) # Sub-modules: use containers, not lists self.norm = nn.LayerNorm(out_dim) def forward(self, x): # shape: x is (batch, in_dim) out = x @ self.weight.T # (batch, out_dim) if self.bias is not None: out = out + self.bias out = self.norm(out) self.running_count += 1 # buffer update (no grad) return out
| Situation | Container | Example |
|---|---|---|
| Layers in fixed order, no skip connections | nn.Sequential | MLP, encoder stack |
| Layers with custom iteration logic | nn.ModuleList | ResNet blocks, U-Net skip |
| Layers selected by name at runtime | nn.ModuleDict | Multi-task heads, routing |
| Single sub-module | Direct attribute | self.norm = LayerNorm(d) |
python # Recipe 1: Per-layer gradient clipping def clip_grad_hook(module, grad_input, grad_output): return tuple( g.clamp(-1.0, 1.0) if g is not None else g for g in grad_input ) # Recipe 2: Feature extraction (transfer learning) features = {} def extract(name): def hook(m, inp, out): features[name] = out.detach() return hook model.layer3.register_forward_hook(extract('layer3')) # Recipe 3: Structured pruning (zero out channels) def prune_hook(module, input): # Zero out bottom 20% of channels by magnitude w = module.weight.data norms = w.norm(dim=(1,2,3)) # per-filter norm threshold = norms.quantile(0.2) mask = norms > threshold module.weight.data *= mask.view(-1, 1, 1, 1)
| Topic | Relationship |
|---|---|
| Autograd | The computational graph that __call__ builds during forward |
| Training loops | The optimizer reads parameters() to know what to update |
| Distributed training | FSDP/DDP wraps modules and uses hooks for gradient sync |
| torch.compile | Modern alternative to JIT: captures graphs without TorchScript limitations |
| ONNX export | Uses tracing internally — same control-flow limitations apply |
__call__ is the signal flowing root-to-leaves. state_dict() is a snapshot of all leaf values. JIT/export freezes the tree structure into a portable format. Master the tree, master PyTorch.