Deep Framework Mechanics

PyTorch Internals

What happens between model = MyNet() and model(x) — the machinery that makes deep learning code actually work.

Prerequisites: Basic Python classes + Tensor intuition. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Internals?

You write model = MyNet() and call model(x). Training works. You ship it. Then one day you add a custom layer and your loss stays flat. The optimizer reports zero parameters. Your gradients are all None. You stare at the code for hours.

Here is the bug. Can you spot it?

python
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = [
            nn.Linear(64, 32),
            nn.Linear(32, 10)
        ]  # Bug is HERE

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

model = MyModel()
print(list(model.parameters()))  # [] — EMPTY!

The model has layers. They have weights. But model.parameters() returns nothing. The optimizer sees nothing to train. Your loss is a flat line.

The fix is one word: change the plain Python list to nn.ModuleList. But why does that matter? Why does PyTorch care whether you use a list or a ModuleList?

The answer lives in the internals. PyTorch's nn.Module is not just a Python class — it's a registration system. It tracks parameters, buffers, hooks, and child modules using special __setattr__ magic. A plain Python list bypasses that tracking entirely. Understanding this machinery lets you: debug gradient issues instantly, write custom layers that work correctly, use hooks for interpretability, and export models for deployment.

This lesson takes you inside the machine. By the end, you'll know exactly what happens at every step from module construction to forward pass to JIT export.

The Registration Problem

Watch parameters get registered (or lost). Teal boxes = registered parameters the optimizer can see. Red boxes = orphaned tensors invisible to the optimizer.

Why does model.parameters() return an empty list when layers are stored in a plain Python list?

Chapter 1: nn.Module Anatomy

Every PyTorch model inherits from nn.Module. But what IS a Module, really? Strip away the convenience methods and you find a Python object with four internal dictionaries:

DictionaryContentsPurpose
_parametersnn.Parameter objectsTrainable weights
_buffersTensors (non-trainable)Running stats, masks
_modulesChild nn.Module objectsSub-layers
_forward_hooksCallable functionsIntercept forward pass

That's it. A Module is a container of named tensors and sub-containers, plus machinery to traverse them recursively. When you call model.parameters(), it walks _parameters on itself, then recursively on every child in _modules.

Let's build a Module from scratch — no inheritance, just dicts — to see why the class is needed:

python
# A "module" without nn.Module — just raw dicts
import torch

my_module = {
    "_parameters": {
        "weight": torch.randn(10, 5, requires_grad=True),
        "bias": torch.randn(10, requires_grad=True),
    },
    "_buffers": {},
    "_modules": {},
}

def forward(module, x):
    W = module["_parameters"]["weight"]
    b = module["_parameters"]["bias"]
    return x @ W.T + b

# This works! But...
# - No recursive parameter collection
# - No .to(device) that moves everything
# - No hooks for debugging
# - No state_dict for saving/loading
# - No __setattr__ magic for registration
Key insight: nn.Module exists because neural networks are TREES of parameterized operations. You need to recursively collect parameters from all sub-modules, move them to devices together, save/load state, and intercept forward passes. A plain dict can't do any of that automatically.

The magic happens in __setattr__. When you write self.linear = nn.Linear(10, 5) inside __init__, Python calls Module.__setattr__("linear", ...). This method checks: is the value an nn.Parameter? Put it in _parameters. Is it an nn.Module? Put it in _modules. Is it a plain tensor registered as a buffer? Put it in _buffers. Anything else? Store it as a regular Python attribute.

python
# Simplified __setattr__ (actual PyTorch source is ~80 lines)
def __setattr__(self, name, value):
    if isinstance(value, Parameter):
        self._parameters[name] = value
    elif isinstance(value, Module):
        self._modules[name] = value
    else:
        object.__setattr__(self, name, value)  # normal Python
Module Internal Structure

Click attributes to see which internal dict they land in. The orange path shows how __setattr__ routes each assignment.

When you write self.fc = nn.Linear(10, 5) in __init__, where does PyTorch store the Linear layer?

Chapter 2: Parameter Registration

A Parameter is just a Tensor with one special property: requires_grad=True by default, and it registers itself with the Module's parameter tracking system. That's the entire difference. But that difference is everything.

python
import torch
import torch.nn as nn

class Demo(nn.Module):
    def __init__(self):
        super().__init__()
        # This IS tracked — shows up in parameters()
        self.weight = nn.Parameter(torch.randn(10, 5))

        # This is NOT tracked — invisible to optimizer
        self.scale = torch.randn(10)

model = Demo()
print(list(model.named_parameters()))
# [('weight', Parameter containing: tensor(...))]
# Notice: 'scale' is MISSING

The scale tensor exists on the object, but the optimizer never sees it. It won't get gradients. It won't be saved in state_dict(). It won't move when you call model.to('cuda'). It's a ghost.

The rule: If a tensor should be trained by the optimizer, wrap it in nn.Parameter(). If it shouldn't be trained but should move with the model and appear in state_dict, use register_buffer(). If it's neither — it's just a local variable that shouldn't be on the Module at all.

Let's trace through what named_parameters() actually does:

model.named_parameters()
Starts the recursive walk
yield from self._parameters
Yields ('weight', tensor) for each parameter on THIS module
for name, child in self._modules
Recurse into each child module
yield (prefix + child_name + '.' + param_name, param)
Prefix the parameter name with the module path
↻ recurse deeper

This is why state_dict() has dot-separated keys like "encoder.layer.0.self_attn.q_proj.weight" — they encode the full path through the module tree.

python
# state_dict keys reflect the module tree structure
model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)
print(model.state_dict().keys())
# odict_keys(['0.weight', '0.bias', '2.weight', '2.bias'])
# Notice: ReLU (index 1) has NO parameters — it's stateless
Parameter vs Plain Tensor

Toggle between Parameter and plain tensor assignment. Watch which tensors the optimizer can see and which become ghosts.

You write self.mask = torch.ones(10) in __init__. What happens to self.mask?

Chapter 3: Buffers & State

Parameters are tensors the optimizer trains. But what about tensors that need to live on the model, move to GPU, get saved and loaded — but should not receive gradients? That's what buffers are for.

The classic example is BatchNorm. It has four tensors:

NameTypeTrained?Purpose
weight (γ)ParameterYesLearned scale
bias (β)ParameterYesLearned shift
running_meanBufferNoEMA of batch means (for eval mode)
running_varBufferNoEMA of batch variances (for eval mode)

The running statistics are updated during training (via exponential moving average) but NOT by the optimizer — they have no gradients. They must be saved with the model (they're needed at inference) and must move to GPU with everything else.

python
class MyBatchNorm(nn.Module):
    def __init__(self, num_features):
        super().__init__()
        # Parameters (trained by optimizer)
        self.weight = nn.Parameter(torch.ones(num_features))
        self.bias = nn.Parameter(torch.zeros(num_features))

        # Buffers (NOT trained, but saved + moved with model)
        self.register_buffer('running_mean', torch.zeros(num_features))
        self.register_buffer('running_var', torch.ones(num_features))
        self.register_buffer('num_batches_tracked', torch.tensor(0, dtype=torch.long))

    def forward(self, x):
        if self.training:
            mean = x.mean(dim=0)
            var = x.var(dim=0, unbiased=False)
            # Update running stats (no grad!)
            self.running_mean = 0.9 * self.running_mean + 0.1 * mean.detach()
            self.running_var = 0.9 * self.running_var + 0.1 * var.detach()
        else:
            mean = self.running_mean
            var = self.running_var
        x_norm = (x - mean) / torch.sqrt(var + 1e-5)
        return self.weight * x_norm + self.bias
Key insight: register_buffer('name', tensor) does THREE things: (1) stores the tensor in self._buffers, (2) makes it appear in state_dict(), and (3) makes .to(device) move it. It does NOT add it to parameters(). The optimizer never touches buffers.

There's one more option: register_buffer('name', tensor, persistent=False). Non-persistent buffers move with .to() but do NOT appear in state_dict(). Use these for scratch computation buffers that can be recomputed.

python
# Checking what lives where
bn = nn.BatchNorm1d(64)

print("Parameters:", [n for n, _ in bn.named_parameters()])
# ['weight', 'bias']

print("Buffers:", [n for n, _ in bn.named_buffers()])
# ['running_mean', 'running_var', 'num_batches_tracked']

print("State dict keys:", list(bn.state_dict().keys()))
# ['weight', 'bias', 'running_mean', 'running_var', 'num_batches_tracked']
BatchNorm Internals

Watch how parameters (trained) and buffers (tracked but untrained) behave differently during training. Orange = parameters receiving gradients. Teal = buffers updated by EMA.

Mode: training | Steps: 0
A buffer registered with persistent=False will:

Chapter 4: Forward & Backward Hooks

Hooks let you intercept a module's execution without modifying its source code. Think of them as wiretaps on the data flow. You can see exactly what goes into a layer, what comes out, and what gradients flow back.

There are three types of hooks:

Hook TypeFires WhenSeesCan Modify
register_forward_pre_hookBefore forward()inputinput
register_forward_hookAfter forward()input + outputoutput
register_full_backward_hookAfter backward()grad_input + grad_outputgrad_input
python
# Forward hook: capture every layer's output
activations = {}

def save_activation(name):
    def hook(module, input, output):
        activations[name] = output.detach()
    return hook

model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

# Attach hooks to every layer
for name, layer in model.named_modules():
    layer.register_forward_hook(save_activation(name))

# After forward pass, activations dict has every layer's output
x = torch.randn(1, 784)
out = model(x)
print(activations.keys())  # dict_keys(['', '0', '1', '2'])
Use case: debugging NaN gradients. Register a backward hook on every layer. When a NaN appears, you instantly know which layer produced it — no binary search, no print statements scattered through forward().
python
# Backward hook: detect NaN gradients instantly
def nan_detector(name):
    def hook(module, grad_input, grad_output):
        for i, g in enumerate(grad_output):
            if g is not None and torch.isnan(g).any():
                print(f"NaN gradient at {name}, grad_output[{i}]")
    return hook

for name, layer in model.named_modules():
    layer.register_full_backward_hook(nan_detector(name))

Other practical hook uses: per-layer gradient clipping, feature extraction for transfer learning (grab intermediate representations without rewriting the model), and activation statistics logging.

Hooks Firing During Forward Pass

Click "Forward Pass" to watch data flow through layers. Yellow flashes show hooks firing and capturing activations. Purple flashes show backward hooks catching gradients.

You want to modify the output of a layer without changing its source code. Which hook type do you use?

Chapter 5: The __call__ Protocol

When you write output = model(x), Python calls model.__call__(x). This is NOT the same as calling model.forward(x). The __call__ method does much more than just run forward — it orchestrates the entire execution pipeline.

Here's what __call__ actually does, in order:

1. Run forward pre-hooks
Each registered pre-hook can inspect or modify the input
2. Call self.forward(input)
YOUR code runs here
3. Run forward hooks
Each registered hook sees input + output, can modify output
4. Register autograd backward hooks
Sets up gradient computation graph nodes
5. Return output
Back to the caller
Critical rule: NEVER call model.forward(x) directly. Always call model(x). If you bypass __call__, hooks won't fire, autograd won't track operations correctly, and your model will silently produce wrong gradients during training.

Here's a simplified version of what __call__ looks like internally:

python
# Simplified nn.Module.__call__ (actual source is ~100 lines)
def __call__(self, *args, **kwargs):
    # Step 1: forward pre-hooks
    for hook in self._forward_pre_hooks.values():
        result = hook(self, args)
        if result is not None:
            args = result if isinstance(result, tuple) else (result,)

    # Step 2: actual forward pass
    output = self.forward(*args, **kwargs)

    # Step 3: forward hooks
    for hook in self._forward_hooks.values():
        hook_result = hook(self, args, output)
        if hook_result is not None:
            output = hook_result

    # Step 4: backward hook registration (if any)
    if self._backward_hooks:
        # ... register gradient hooks on output tensor ...
        pass

    return output

This explains a common confusion: "Why does my hook fire when I use model(x) but not when I call model.forward(x)?" Now you know. forward() is just step 2. __call__ is the full pipeline.

Practical consequence: Libraries like HuggingFace Transformers, FSDP, and torch.compile all rely on hooks. If you call .forward() directly, these libraries will silently malfunction. The model might appear to work but produce subtly wrong outputs.
__call__ vs .forward() Execution

Compare what happens when you call the model properly vs calling forward directly. Notice which steps get skipped.

What breaks if you call model.forward(x) instead of model(x)?

Chapter 6: Module Containers

We saw in Chapter 0 that a plain Python list breaks parameter registration. PyTorch provides three container classes that properly register their contents:

ContainerUse WhenAccess Pattern
nn.SequentialLayers run one after anothermodel(x) runs all layers in order
nn.ModuleListYou need index access or iterationself.layers[i] or for l in self.layers
nn.ModuleDictYou need string-key accessself.heads['classifier']

The key difference from plain Python containers: when you assign an nn.ModuleList to self.layers, every Module inside it gets registered in self._modules. The recursive parameters() walk finds them all.

python
# BROKEN: plain Python list
class BadModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = [nn.Linear(10, 10) for _ in range(3)]

bad = BadModel()
print(sum(p.numel() for p in bad.parameters()))  # 0 !!!

# CORRECT: nn.ModuleList
class GoodModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([nn.Linear(10, 10) for _ in range(3)])

good = GoodModel()
print(sum(p.numel() for p in good.parameters()))  # 330
When to use which: Use Sequential when your forward pass is literally "pass x through each layer in order" — you don't even need to write a forward() method. Use ModuleList when you need custom logic (skip connections, branching). Use ModuleDict when you select sub-modules by name (multi-task heads).
python
# Sequential: forward() is automatic
mlp = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)
out = mlp(x)  # Runs all 3 in order, no forward() needed

# ModuleList: custom forward logic
class ResBlock(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([nn.Linear(64, 64) for _ in range(3)])

    def forward(self, x):
        for layer in self.layers:
            x = x + torch.relu(layer(x))  # residual connection
        return x

# ModuleDict: select by name
class MultiTask(nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = nn.Linear(128, 64)
        self.heads = nn.ModuleDict({
            'classify': nn.Linear(64, 10),
            'regress': nn.Linear(64, 1),
        })

    def forward(self, x, task):
        features = torch.relu(self.backbone(x))
        return self.heads[task](features)
Container Registration Comparison

See how parameter counts differ between plain list, ModuleList, and Sequential. The number visible to the optimizer is what matters for training.

You have 5 Linear layers with skip connections. Which container should you use?

Chapter 7: Model Inspector

Time to put everything together. This interactive inspector lets you explore real model architectures — see the module tree, parameter counts, data shapes flowing through, and where hooks would attach.

Select a model below. Click any node in the tree to inspect its parameters, buffers, and internal state. Click "Forward Pass" to watch tensor shapes propagate through the network.

Interactive Model Inspector

Select a model architecture, then explore its internals. Click layers to inspect. Run forward to see data flow.

Try this: Select "Transformer Block" and click "Forward Pass." Watch how the input tensor (batch=2, seq=8, dim=64) flows through self-attention, add+norm, FFN, and add+norm again. Notice how shapes stay the same through residual connections but change inside the FFN.

Chapter 8: JIT & TorchScript

PyTorch models are Python objects. This is great for debugging (set breakpoints, print shapes, use Python control flow) but terrible for deployment. You can't run a Python object on a phone, in a C++ server, or in a browser. You need to export the model to a format that doesn't need Python.

TorchScript is PyTorch's solution: a statically-typed subset of Python that can be compiled and run without the Python interpreter. There are two ways to convert:

MethodHow it worksStrengthsWeaknesses
torch.jit.traceRuns your model ONCE, records all operationsWorks with any codeMisses control flow (if/for)
torch.jit.scriptParses your Python source, compiles itHandles control flowOnly supports a subset of Python
python
# Tracing: works for simple models
model = nn.Sequential(nn.Linear(10, 5), nn.ReLU(), nn.Linear(5, 2))
example_input = torch.randn(1, 10)
traced = torch.jit.trace(model, example_input)
traced.save("model.pt")  # Can load in C++, no Python needed

But tracing has a fatal flaw. It only records ONE execution path. If your model has an if statement, tracing will only capture the branch that ran during tracing:

python
# This model FAILS to trace correctly
class ConditionalModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 5)
        self.fc_big = nn.Linear(10, 20)

    def forward(self, x):
        if x.shape[0] > 1:  # batch size check
            return self.fc_big(x)
        else:
            return self.fc(x)

model = ConditionalModel()
# Trace with batch=1 → only captures the 'else' branch!
traced = torch.jit.trace(model, torch.randn(1, 10))
# Now traced(torch.randn(4, 10)) STILL uses fc, not fc_big!
# The if-statement was baked out during tracing.
The fix: Use torch.jit.script for models with control flow. Scripting parses the actual Python source and compiles the if-statement into the TorchScript IR. Both branches are preserved.
python
# Scripting handles control flow correctly
scripted = torch.jit.script(model)
# Both branches are compiled into the IR
print(scripted.graph)  # Shows prim::If node

# Works correctly for both batch sizes:
scripted(torch.randn(1, 10)).shape   # torch.Size([1, 5])
scripted(torch.randn(4, 10)).shape   # torch.Size([4, 20])

The tradeoff: scripting is stricter. It only supports a subset of Python (no arbitrary objects, limited list comprehensions, all variables must have inferrable types). Complex Python code often needs refactoring to be scriptable.

python
# Modern alternative: torch.export (PyTorch 2.0+)
# More flexible than jit, better for deployment
from torch.export import export

exported = export(model, (torch.randn(1, 10),))
# Creates an ExportedProgram with full graph capture
# Supports dynamic shapes, guards, and control flow
Trace vs Script: Control Flow

Watch how tracing captures only one path while scripting preserves both branches. The red path shows the branch that tracing missed.

Your model uses an if-statement that checks input batch size. You trace it with batch_size=1. What happens when you run the traced model with batch_size=8?

Chapter 9: Mastery & Connections

You now understand the full lifecycle of a PyTorch model: construction (parameter registration via __setattr__), execution (the __call__ protocol with hooks), state management (parameters + buffers + state_dict), and export (JIT trace vs script). Let's consolidate with patterns you'll use daily.

Custom Module Template

python
class MyCustomLayer(nn.Module):
    def __init__(self, in_dim, out_dim, use_bias=True):
        super().__init__()
        # Parameters: trained by optimizer
        self.weight = nn.Parameter(torch.randn(out_dim, in_dim) * 0.01)
        if use_bias:
            self.bias = nn.Parameter(torch.zeros(out_dim))
        else:
            self.register_parameter('bias', None)  # explicit None

        # Buffers: saved + moved, NOT trained
        self.register_buffer('running_count', torch.tensor(0))

        # Sub-modules: use containers, not lists
        self.norm = nn.LayerNorm(out_dim)

    def forward(self, x):
        # shape: x is (batch, in_dim)
        out = x @ self.weight.T  # (batch, out_dim)
        if self.bias is not None:
            out = out + self.bias
        out = self.norm(out)
        self.running_count += 1  # buffer update (no grad)
        return out

When to Use Each Container

SituationContainerExample
Layers in fixed order, no skip connectionsnn.SequentialMLP, encoder stack
Layers with custom iteration logicnn.ModuleListResNet blocks, U-Net skip
Layers selected by name at runtimenn.ModuleDictMulti-task heads, routing
Single sub-moduleDirect attributeself.norm = LayerNorm(d)

Hook Recipes

python
# Recipe 1: Per-layer gradient clipping
def clip_grad_hook(module, grad_input, grad_output):
    return tuple(
        g.clamp(-1.0, 1.0) if g is not None else g
        for g in grad_input
    )

# Recipe 2: Feature extraction (transfer learning)
features = {}
def extract(name):
    def hook(m, inp, out):
        features[name] = out.detach()
    return hook
model.layer3.register_forward_hook(extract('layer3'))

# Recipe 3: Structured pruning (zero out channels)
def prune_hook(module, input):
    # Zero out bottom 20% of channels by magnitude
    w = module.weight.data
    norms = w.norm(dim=(1,2,3))  # per-filter norm
    threshold = norms.quantile(0.2)
    mask = norms > threshold
    module.weight.data *= mask.view(-1, 1, 1, 1)

Connections

TopicRelationship
AutogradThe computational graph that __call__ builds during forward
Training loopsThe optimizer reads parameters() to know what to update
Distributed trainingFSDP/DDP wraps modules and uses hooks for gradient sync
torch.compileModern alternative to JIT: captures graphs without TorchScript limitations
ONNX exportUses tracing internally — same control-flow limitations apply
The mental model: nn.Module is a tree. Parameters are leaves. Hooks are sensors on the branches. __call__ is the signal flowing root-to-leaves. state_dict() is a snapshot of all leaf values. JIT/export freezes the tree structure into a portable format. Master the tree, master PyTorch.
You're building a model with 3 attention heads that you select by name at inference ("fast", "accurate", "balanced"). Which container?