PyTorch Internals

Chapter 0: Why Internals?

You write model = MyNet() and call model(x). Training works. You ship it. Then one day you add a custom layer and your loss stays flat. The optimizer reports zero parameters. Your gradients are all None. You stare at the code for hours.

Here is the bug. Can you spot it?

python
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = [
            nn.Linear(64, 32),
            nn.Linear(32, 10)
        ]  # Bug is HERE

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

model = MyModel()
print(list(model.parameters()))  # [] — EMPTY!

The model has layers. They have weights. But model.parameters() returns nothing. The optimizer sees nothing to train. Your loss is a flat line.

The fix is one word: change the plain Python list to nn.ModuleList. But why does that matter? Why does PyTorch care whether you use a list or a ModuleList?

The answer lives in the internals. PyTorch's nn.Module is not just a Python class — it's a registration system. It tracks parameters, buffers, hooks, and child modules using special __setattr__ magic. A plain Python list bypasses that tracking entirely. Understanding this machinery lets you: debug gradient issues instantly, write custom layers that work correctly, use hooks for interpretability, and export models for deployment.

This lesson takes you inside the machine. By the end, you'll know exactly what happens at every step from module construction to forward pass to JIT export.

The Registration Problem

Watch parameters get registered (or lost). Teal boxes = registered parameters the optimizer can see. Red boxes = orphaned tensors invisible to the optimizer.

Why does model.parameters() return an empty list when layers are stored in a plain Python list?

Python lists can't hold nn.Module objects The optimizer hasn't been initialized yet nn.Module's registration system only tracks attributes set via special descriptors, not plain list contents

Chapter 1: nn.Module Anatomy

Every PyTorch model inherits from nn.Module. But what IS a Module, really? Strip away the convenience methods and you find a Python object with four internal dictionaries:

Dictionary	Contents	Purpose
`_parameters`	nn.Parameter objects	Trainable weights
`_buffers`	Tensors (non-trainable)	Running stats, masks
`_modules`	Child nn.Module objects	Sub-layers
`_forward_hooks`	Callable functions	Intercept forward pass

That's it. A Module is a container of named tensors and sub-containers, plus machinery to traverse them recursively. When you call model.parameters(), it walks _parameters on itself, then recursively on every child in _modules.

Let's build a Module from scratch — no inheritance, just dicts — to see why the class is needed:

python
# A "module" without nn.Module — just raw dicts
import torch

my_module = {
    "_parameters": {
        "weight": torch.randn(10, 5, requires_grad=True),
        "bias": torch.randn(10, requires_grad=True),
    },
    "_buffers": {},
    "_modules": {},
}

def forward(module, x):
    W = module["_parameters"]["weight"]
    b = module["_parameters"]["bias"]
    return x @ W.T + b

# This works! But...
# - No recursive parameter collection
# - No .to(device) that moves everything
# - No hooks for debugging
# - No state_dict for saving/loading
# - No __setattr__ magic for registration

Key insight: nn.Module exists because neural networks are TREES of parameterized operations. You need to recursively collect parameters from all sub-modules, move them to devices together, save/load state, and intercept forward passes. A plain dict can't do any of that automatically.

The magic happens in __setattr__. When you write self.linear = nn.Linear(10, 5) inside __init__, Python calls Module.__setattr__("linear", ...). This method checks: is the value an nn.Parameter? Put it in _parameters. Is it an nn.Module? Put it in _modules. Is it a plain tensor registered as a buffer? Put it in _buffers. Anything else? Store it as a regular Python attribute.

python
# Simplified __setattr__ (actual PyTorch source is ~80 lines)
def __setattr__(self, name, value):
    if isinstance(value, Parameter):
        self._parameters[name] = value
    elif isinstance(value, Module):
        self._modules[name] = value
    else:
        object.__setattr__(self, name, value)  # normal Python

Module Internal Structure

Click attributes to see which internal dict they land in. The orange path shows how __setattr__ routes each assignment.

When you write self.fc = nn.Linear(10, 5) in __init__, where does PyTorch store the Linear layer?

In self._parameters In self._modules In self._buffers As a regular Python attribute

Chapter 2: Parameter Registration

A Parameter is just a Tensor with one special property: requires_grad=True by default, and it registers itself with the Module's parameter tracking system. That's the entire difference. But that difference is everything.

python
import torch
import torch.nn as nn

class Demo(nn.Module):
    def __init__(self):
        super().__init__()
        # This IS tracked — shows up in parameters()
        self.weight = nn.Parameter(torch.randn(10, 5))

        # This is NOT tracked — invisible to optimizer
        self.scale = torch.randn(10)

model = Demo()
print(list(model.named_parameters()))
# [('weight', Parameter containing: tensor(...))]
# Notice: 'scale' is MISSING

The scale tensor exists on the object, but the optimizer never sees it. It won't get gradients. It won't be saved in state_dict(). It won't move when you call model.to('cuda'). It's a ghost.

The rule: If a tensor should be trained by the optimizer, wrap it in nn.Parameter(). If it shouldn't be trained but should move with the model and appear in state_dict, use register_buffer(). If it's neither — it's just a local variable that shouldn't be on the Module at all.

Let's trace through what named_parameters() actually does:

model.named_parameters()

Starts the recursive walk

↓

yield from self._parameters

Yields ('weight', tensor) for each parameter on THIS module

↓

for name, child in self._modules

Recurse into each child module

↓

yield (prefix + child_name + '.' + param_name, param)

Prefix the parameter name with the module path

↻ recurse deeper

This is why state_dict() has dot-separated keys like "encoder.layer.0.self_attn.q_proj.weight" — they encode the full path through the module tree.

python
# state_dict keys reflect the module tree structure
model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)
print(model.state_dict().keys())
# odict_keys(['0.weight', '0.bias', '2.weight', '2.bias'])
# Notice: ReLU (index 1) has NO parameters — it's stateless

Parameter vs Plain Tensor

Toggle between Parameter and plain tensor assignment. Watch which tensors the optimizer can see and which become ghosts.

You write self.mask = torch.ones(10) in __init__. What happens to self.mask?

It exists on the object but is invisible to parameters(), state_dict(), and .to(device) It gets registered as a parameter with requires_grad=True It gets registered as a buffer automatically

Chapter 3: Buffers & State

Parameters are tensors the optimizer trains. But what about tensors that need to live on the model, move to GPU, get saved and loaded — but should not receive gradients? That's what buffers are for.

The classic example is BatchNorm. It has four tensors:

Name	Type	Trained?	Purpose
`weight` (γ)	Parameter	Yes	Learned scale
`bias` (β)	Parameter	Yes	Learned shift
`running_mean`	Buffer	No	EMA of batch means (for eval mode)
`running_var`	Buffer	No	EMA of batch variances (for eval mode)

The running statistics are updated during training (via exponential moving average) but NOT by the optimizer — they have no gradients. They must be saved with the model (they're needed at inference) and must move to GPU with everything else.

python
class MyBatchNorm(nn.Module):
    def __init__(self, num_features):
        super().__init__()
        # Parameters (trained by optimizer)
        self.weight = nn.Parameter(torch.ones(num_features))
        self.bias = nn.Parameter(torch.zeros(num_features))

        # Buffers (NOT trained, but saved + moved with model)
        self.register_buffer('running_mean', torch.zeros(num_features))
        self.register_buffer('running_var', torch.ones(num_features))
        self.register_buffer('num_batches_tracked', torch.tensor(0, dtype=torch.long))

    def forward(self, x):
        if self.training:
            mean = x.mean(dim=0)
            var = x.var(dim=0, unbiased=False)
            # Update running stats (no grad!)
            self.running_mean = 0.9 * self.running_mean + 0.1 * mean.detach()
            self.running_var = 0.9 * self.running_var + 0.1 * var.detach()
        else:
            mean = self.running_mean
            var = self.running_var
        x_norm = (x - mean) / torch.sqrt(var + 1e-5)
        return self.weight * x_norm + self.bias

Key insight: register_buffer('name', tensor) does THREE things: (1) stores the tensor in self._buffers, (2) makes it appear in state_dict(), and (3) makes .to(device) move it. It does NOT add it to parameters(). The optimizer never touches buffers.

There's one more option: register_buffer('name', tensor, persistent=False). Non-persistent buffers move with .to() but do NOT appear in state_dict(). Use these for scratch computation buffers that can be recomputed.

python
# Checking what lives where
bn = nn.BatchNorm1d(64)

print("Parameters:", [n for n, _ in bn.named_parameters()])
# ['weight', 'bias']

print("Buffers:", [n for n, _ in bn.named_buffers()])
# ['running_mean', 'running_var', 'num_batches_tracked']

print("State dict keys:", list(bn.state_dict().keys()))
# ['weight', 'bias', 'running_mean', 'running_var', 'num_batches_tracked']

BatchNorm Internals

Watch how parameters (trained) and buffers (tracked but untrained) behave differently during training. Orange = parameters receiving gradients. Teal = buffers updated by EMA.

Mode: training | Steps: 0

A buffer registered with persistent=False will:

Not move with .to(device) and not appear in state_dict Move with .to(device) but NOT appear in state_dict Appear in state_dict but not move with .to(device)

Chapter 4: Forward & Backward Hooks

Hooks let you intercept a module's execution without modifying its source code. Think of them as wiretaps on the data flow. You can see exactly what goes into a layer, what comes out, and what gradients flow back.

There are three types of hooks:

Hook Type	Fires When	Sees	Can Modify
`register_forward_pre_hook`	Before forward()	input	input
`register_forward_hook`	After forward()	input + output	output
`register_full_backward_hook`	After backward()	grad_input + grad_output	grad_input

python
# Forward hook: capture every layer's output
activations = {}

def save_activation(name):
    def hook(module, input, output):
        activations[name] = output.detach()
    return hook

model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

# Attach hooks to every layer
for name, layer in model.named_modules():
    layer.register_forward_hook(save_activation(name))

# After forward pass, activations dict has every layer's output
x = torch.randn(1, 784)
out = model(x)
print(activations.keys())  # dict_keys(['', '0', '1', '2'])

Use case: debugging NaN gradients. Register a backward hook on every layer. When a NaN appears, you instantly know which layer produced it — no binary search, no print statements scattered through forward().

python
# Backward hook: detect NaN gradients instantly
def nan_detector(name):
    def hook(module, grad_input, grad_output):
        for i, g in enumerate(grad_output):
            if g is not None and torch.isnan(g).any():
                print(f"NaN gradient at {name}, grad_output[{i}]")
    return hook

for name, layer in model.named_modules():
    layer.register_full_backward_hook(nan_detector(name))

Other practical hook uses: per-layer gradient clipping, feature extraction for transfer learning (grab intermediate representations without rewriting the model), and activation statistics logging.

Hooks Firing During Forward Pass

Click "Forward Pass" to watch data flow through layers. Yellow flashes show hooks firing and capturing activations. Purple flashes show backward hooks catching gradients.

You want to modify the output of a layer without changing its source code. Which hook type do you use?

register_forward_pre_hook (can only modify input) register_forward_hook (sees input+output, can modify output) register_full_backward_hook (only sees gradients)

Chapter 5: The call Protocol

When you write output = model(x), Python calls model.__call__(x). This is NOT the same as calling model.forward(x). The __call__ method does much more than just run forward — it orchestrates the entire execution pipeline.

Here's what __call__ actually does, in order:

1. Run forward pre-hooks

Each registered pre-hook can inspect or modify the input

↓

2. Call self.forward(input)

YOUR code runs here

↓

3. Run forward hooks

Each registered hook sees input + output, can modify output

↓

4. Register autograd backward hooks

Sets up gradient computation graph nodes

↓

5. Return output

Back to the caller

Critical rule: NEVER call model.forward(x) directly. Always call model(x). If you bypass __call__, hooks won't fire, autograd won't track operations correctly, and your model will silently produce wrong gradients during training.

Here's a simplified version of what __call__ looks like internally:

python
# Simplified nn.Module.__call__ (actual source is ~100 lines)
def __call__(self, *args, **kwargs):
    # Step 1: forward pre-hooks
    for hook in self._forward_pre_hooks.values():
        result = hook(self, args)
        if result is not None:
            args = result if isinstance(result, tuple) else (result,)

    # Step 2: actual forward pass
    output = self.forward(*args, **kwargs)

    # Step 3: forward hooks
    for hook in self._forward_hooks.values():
        hook_result = hook(self, args, output)
        if hook_result is not None:
            output = hook_result

    # Step 4: backward hook registration (if any)
    if self._backward_hooks:
        # ... register gradient hooks on output tensor ...
        pass

    return output

This explains a common confusion: "Why does my hook fire when I use model(x) but not when I call model.forward(x)?" Now you know. forward() is just step 2. __call__ is the full pipeline.

Practical consequence: Libraries like HuggingFace Transformers, FSDP, and torch.compile all rely on hooks. If you call .forward() directly, these libraries will silently malfunction. The model might appear to work but produce subtly wrong outputs.

__call__ vs .forward() Execution

Compare what happens when you call the model properly vs calling forward directly. Notice which steps get skipped.

What breaks if you call model.forward(x) instead of model(x)?

The forward function throws an error Only pre-hooks are skipped All hooks are skipped and autograd backward hooks don't register

Chapter 6: Module Containers

We saw in Chapter 0 that a plain Python list breaks parameter registration. PyTorch provides three container classes that properly register their contents:

Container	Use When	Access Pattern
`nn.Sequential`	Layers run one after another	`model(x)` runs all layers in order
`nn.ModuleList`	You need index access or iteration	`self.layers[i]` or `for l in self.layers`
`nn.ModuleDict`	You need string-key access	`self.heads['classifier']`

The key difference from plain Python containers: when you assign an nn.ModuleList to self.layers, every Module inside it gets registered in self._modules. The recursive parameters() walk finds them all.

python
# BROKEN: plain Python list
class BadModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = [nn.Linear(10, 10) for _ in range(3)]

bad = BadModel()
print(sum(p.numel() for p in bad.parameters()))  # 0 !!!

# CORRECT: nn.ModuleList
class GoodModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([nn.Linear(10, 10) for _ in range(3)])

good = GoodModel()
print(sum(p.numel() for p in good.parameters()))  # 330

When to use which: Use Sequential when your forward pass is literally "pass x through each layer in order" — you don't even need to write a forward() method. Use ModuleList when you need custom logic (skip connections, branching). Use ModuleDict when you select sub-modules by name (multi-task heads).

python
# Sequential: forward() is automatic
mlp = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)
out = mlp(x)  # Runs all 3 in order, no forward() needed

# ModuleList: custom forward logic
class ResBlock(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([nn.Linear(64, 64) for _ in range(3)])

    def forward(self, x):
        for layer in self.layers:
            x = x + torch.relu(layer(x))  # residual connection
        return x

# ModuleDict: select by name
class MultiTask(nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = nn.Linear(128, 64)
        self.heads = nn.ModuleDict({
            'classify': nn.Linear(64, 10),
            'regress': nn.Linear(64, 1),
        })

    def forward(self, x, task):
        features = torch.relu(self.backbone(x))
        return self.heads[task](features)

Container Registration Comparison

See how parameter counts differ between plain list, ModuleList, and Sequential. The number visible to the optimizer is what matters for training.

You have 5 Linear layers with skip connections. Which container should you use?

nn.Sequential (runs layers in order automatically) nn.ModuleList (need custom forward logic for skip connections) nn.ModuleDict (need string key access)

Chapter 7: Model Inspector

Time to put everything together. This interactive inspector lets you explore real model architectures — see the module tree, parameter counts, data shapes flowing through, and where hooks would attach.

Select a model below. Click any node in the tree to inspect its parameters, buffers, and internal state. Click "Forward Pass" to watch tensor shapes propagate through the network.

Interactive Model Inspector

Select a model architecture, then explore its internals. Click layers to inspect. Run forward to see data flow.

Try this: Select "Transformer Block" and click "Forward Pass." Watch how the input tensor (batch=2, seq=8, dim=64) flows through self-attention, add+norm, FFN, and add+norm again. Notice how shapes stay the same through residual connections but change inside the FFN.

Chapter 8: JIT & TorchScript

PyTorch models are Python objects. This is great for debugging (set breakpoints, print shapes, use Python control flow) but terrible for deployment. You can't run a Python object on a phone, in a C++ server, or in a browser. You need to export the model to a format that doesn't need Python.

TorchScript is PyTorch's solution: a statically-typed subset of Python that can be compiled and run without the Python interpreter. There are two ways to convert:

Method	How it works	Strengths	Weaknesses
`torch.jit.trace`	Runs your model ONCE, records all operations	Works with any code	Misses control flow (if/for)
`torch.jit.script`	Parses your Python source, compiles it	Handles control flow	Only supports a subset of Python

python
# Tracing: works for simple models
model = nn.Sequential(nn.Linear(10, 5), nn.ReLU(), nn.Linear(5, 2))
example_input = torch.randn(1, 10)
traced = torch.jit.trace(model, example_input)
traced.save("model.pt")  # Can load in C++, no Python needed

But tracing has a fatal flaw. It only records ONE execution path. If your model has an if statement, tracing will only capture the branch that ran during tracing:

python
# This model FAILS to trace correctly
class ConditionalModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 5)
        self.fc_big = nn.Linear(10, 20)

    def forward(self, x):
        if x.shape[0] > 1:  # batch size check
            return self.fc_big(x)
        else:
            return self.fc(x)

model = ConditionalModel()
# Trace with batch=1 → only captures the 'else' branch!
traced = torch.jit.trace(model, torch.randn(1, 10))
# Now traced(torch.randn(4, 10)) STILL uses fc, not fc_big!
# The if-statement was baked out during tracing.

The fix: Use torch.jit.script for models with control flow. Scripting parses the actual Python source and compiles the if-statement into the TorchScript IR. Both branches are preserved.

python
# Scripting handles control flow correctly
scripted = torch.jit.script(model)
# Both branches are compiled into the IR
print(scripted.graph)  # Shows prim::If node

# Works correctly for both batch sizes:
scripted(torch.randn(1, 10)).shape   # torch.Size([1, 5])
scripted(torch.randn(4, 10)).shape   # torch.Size([4, 20])

The tradeoff: scripting is stricter. It only supports a subset of Python (no arbitrary objects, limited list comprehensions, all variables must have inferrable types). Complex Python code often needs refactoring to be scriptable.

python
# Modern alternative: torch.export (PyTorch 2.0+)
# More flexible than jit, better for deployment
from torch.export import export

exported = export(model, (torch.randn(1, 10),))
# Creates an ExportedProgram with full graph capture
# Supports dynamic shapes, guards, and control flow

Trace vs Script: Control Flow

Watch how tracing captures only one path while scripting preserves both branches. The red path shows the branch that tracing missed.

Your model uses an if-statement that checks input batch size. You trace it with batch_size=1. What happens when you run the traced model with batch_size=8?

It runs the batch_size=1 branch regardless, giving wrong output It throws a runtime error about mismatched shapes It automatically retraces with the new batch size

Chapter 9: Mastery & Connections

You now understand the full lifecycle of a PyTorch model: construction (parameter registration via __setattr__), execution (the __call__ protocol with hooks), state management (parameters + buffers + state_dict), and export (JIT trace vs script). Let's consolidate with patterns you'll use daily.

Custom Module Template

python
class MyCustomLayer(nn.Module):
    def __init__(self, in_dim, out_dim, use_bias=True):
        super().__init__()
        # Parameters: trained by optimizer
        self.weight = nn.Parameter(torch.randn(out_dim, in_dim) * 0.01)
        if use_bias:
            self.bias = nn.Parameter(torch.zeros(out_dim))
        else:
            self.register_parameter('bias', None)  # explicit None

        # Buffers: saved + moved, NOT trained
        self.register_buffer('running_count', torch.tensor(0))

        # Sub-modules: use containers, not lists
        self.norm = nn.LayerNorm(out_dim)

    def forward(self, x):
        # shape: x is (batch, in_dim)
        out = x @ self.weight.T  # (batch, out_dim)
        if self.bias is not None:
            out = out + self.bias
        out = self.norm(out)
        self.running_count += 1  # buffer update (no grad)
        return out

When to Use Each Container

Situation	Container	Example
Layers in fixed order, no skip connections	`nn.Sequential`	MLP, encoder stack
Layers with custom iteration logic	`nn.ModuleList`	ResNet blocks, U-Net skip
Layers selected by name at runtime	`nn.ModuleDict`	Multi-task heads, routing
Single sub-module	Direct attribute	`self.norm = LayerNorm(d)`

Hook Recipes

python
# Recipe 1: Per-layer gradient clipping
def clip_grad_hook(module, grad_input, grad_output):
    return tuple(
        g.clamp(-1.0, 1.0) if g is not None else g
        for g in grad_input
    )

# Recipe 2: Feature extraction (transfer learning)
features = {}
def extract(name):
    def hook(m, inp, out):
        features[name] = out.detach()
    return hook
model.layer3.register_forward_hook(extract('layer3'))

# Recipe 3: Structured pruning (zero out channels)
def prune_hook(module, input):
    # Zero out bottom 20% of channels by magnitude
    w = module.weight.data
    norms = w.norm(dim=(1,2,3))  # per-filter norm
    threshold = norms.quantile(0.2)
    mask = norms > threshold
    module.weight.data *= mask.view(-1, 1, 1, 1)

Connections

Topic	Relationship
Autograd	The computational graph that `__call__` builds during forward
Training loops	The optimizer reads `parameters()` to know what to update
Distributed training	FSDP/DDP wraps modules and uses hooks for gradient sync
torch.compile	Modern alternative to JIT: captures graphs without TorchScript limitations
ONNX export	Uses tracing internally — same control-flow limitations apply

The mental model: nn.Module is a tree. Parameters are leaves. Hooks are sensors on the branches. __call__ is the signal flowing root-to-leaves. state_dict() is a snapshot of all leaf values. JIT/export freezes the tree structure into a portable format. Master the tree, master PyTorch.

You're building a model with 3 attention heads that you select by name at inference ("fast", "accurate", "balanced"). Which container?

nn.Sequential nn.ModuleList nn.ModuleDict