Introduction
Richard Feynman wrote on his blackboard: "What I cannot create, I do not understand." This article takes that literally. We will build the entire visual pipeline of a modern VLM from scratch — starting from the definition of a pixel and ending with the tensor representations that feed into a transformer. Every operation will be derived, every dimension tracked, every design choice explained.
The history of computer vision is a history of representations. For decades, researchers hand-crafted features: SIFT (Lowe, 1999) detected scale-invariant keypoints. HOG (Dalal & Triggs, 2005) counted gradient orientations in local cells. These features were brilliant engineering, but they hit a ceiling because a human designer had to decide what to extract. The deep learning revolution, beginning with AlexNet (Krizhevsky et al., 2012), replaced hand-crafted features with learned features: stacks of convolutional layers that discover what to extract directly from data.
Every modern VLM — GPT-4V, Claude's vision, Gemini, LLaVA, Qwen-VL — uses a vision encoder descended from this tradition. Whether the encoder is a CNN (like EfficientNet in RT-1) or a Vision Transformer (like ViT in CLIP), it is fundamentally doing the same thing: converting a grid of pixel values into a sequence of high-dimensional feature vectors that capture the visual content of the image. Understanding how this conversion works — mechanically, mathematically, intuitively — is the foundation for everything else in this series.
We build from absolute first principles. What a pixel is, physically and numerically. How color spaces work. The convolution operation derived step by step with full index arithmetic. Why stacking convolutions creates feature hierarchies. What receptive fields are and how to compute them. Pooling, residual connections, feature pyramids, and normalization layers. Code to implement everything from scratch in PyTorch and numpy. By the end, you will be able to build a CNN from scratch and understand exactly what every layer does to every pixel.
What Is an Image?
To a camera sensor, an image is a collection of photon counts. Each photosite (pixel) on the sensor accumulates photons during exposure, converts the charge to a voltage, and digitizes it to an integer. A typical sensor has millions of photosites arranged in a rectangular grid.
To a computer, the result is a 2D array of integers. But a single array gives only luminance (brightness). Color requires multiple measurements at different wavelengths.
Pixels and channels
Most digital cameras use a Bayer filter — a mosaic of red, green, and blue filters placed over the sensor, so that each pixel records only one color. A demosaicing algorithm interpolates the missing colors, producing three values per pixel: red (R), green (G), and blue (B) intensity. These three values are the channels of the image.
A standard 8-bit RGB image has values in [0, 255] per channel per pixel. With 3 channels, each pixel is a triplet (R, G, B). A 224×224 image has 224 × 224 = 50,176 pixels × 3 channels = 150,528 values. This is already a very high-dimensional space — far more dimensions than most tabular datasets. Yet to a human, it's just a thumbnail.
Red Channel
Intensity of long-wavelength light (~620–750nm). High in fire, skin, brick. Low in sky, water, foliage.
Green Channel
Intensity of medium-wavelength light (~495–570nm). Dominates in vegetation. Human vision is most sensitive to this band.
Blue Channel
Intensity of short-wavelength light (~450–495nm). High in sky, water, shadows. Sensor noise is typically highest in this channel.
Color spaces
RGB is not the only way to decompose color. Several alternative color spaces are useful in different contexts:
- HSV (Hue, Saturation, Value): Separates chromatic content (hue) from intensity (value). Useful for color-based segmentation because hue is invariant to illumination changes. Hue is an angle [0°, 360°); saturation and value are in [0, 1].
- YCbCr: Separates luminance (Y) from chrominance (Cb, Cr). Used in JPEG compression because human vision is more sensitive to luminance than chrominance, allowing heavier compression of Cb/Cr channels. Also the native format of many video codecs.
- LAB (CIE L*a*b*): Perceptually uniform: equal numerical distances correspond to approximately equal perceived differences. L* is lightness; a* and b* are color opponents (red-green and yellow-blue). Useful for color difference metrics.
For deep learning, RGB is the standard. All major pretrained models (ResNet, ViT, CLIP, DINOv2) expect RGB input. The standard preprocessing normalizes pixel values from [0, 255] integers to [0, 1] floats, then applies channel-wise normalization with the ImageNet mean ([0.485, 0.456, 0.406]) and standard deviation ([0.229, 0.224, 0.225]) to produce zero-centered, unit-variance inputs. This normalization is not arbitrary — it matches the statistics of the pretraining data, and using different normalization with a pretrained model will degrade performance.
The image as a tensor
In PyTorch, an image is a 3D tensor of shape (C, H, W): channels first, then height
(rows), then width (columns). A batch of B images is a 4D tensor (B, C, H, W). This
"channels first" layout is PyTorch's default and differs from TensorFlow's default
(B, H, W, C) and numpy/PIL's (H, W, C).
The ordering matters for computation: convolution kernels are applied along the (H, W) dimensions and sum across C. Getting the dimension ordering wrong is one of the most common bugs in vision code.
Batch tensor: X ∈ RB × C × H × W
For a 224×224 RGB image: x ∈ R3 × 224 × 224
Why does dimension ordering matter? Modern hardware (GPUs, TPUs) is optimized for specific memory
access patterns. Channels-first layout allows contiguous memory access for convolution operations,
since each kernel computes a dot product across all channels at a spatial location. This is why
PyTorch defaults to NCHW format, though the operation channels_last memory format
(NHWC) can be faster on some GPUs due to tensor core requirements.
Hover over the image grid to see the RGB values of each pixel. Toggle channels to see individual color components. The numbers on the right show the actual tensor values.
The Convolution Operation
The convolution is the fundamental operation of all CNN-based vision models. Despite the name, what neural networks compute is technically cross-correlation (no kernel flipping), but the community universally calls it "convolution." Let's derive it from scratch.
Mechanics: what a convolution does
A convolution slides a small matrix (the kernel or filter) across the input image, computing a dot product at each position. For a single-channel input and a K×K kernel, the output at position (i, j) is:
where w is the kernel weight matrix, x is the input, and b is a scalar bias. The output y is called a feature map. Each position in the feature map summarizes a local region of the input through the kernel's weighted sum.
For a multi-channel input (Cin channels), the kernel becomes 3D: w ∈ RCin × K × K. The convolution sums across all input channels:
One kernel produces one feature map. To produce Cout feature maps, we use Cout kernels, each of shape (Cin, K, K). The full weight tensor is W ∈ RCout × Cin × K × K.
Parameter count: A Conv2d layer with Cin=64, Cout=128, K=3 has 128 × 64 × 3 × 3 + 128 = 73,856 parameters. This is vastly fewer than a fully-connected layer connecting the same input to the same output: a 64×56×56 input mapped to 128×56×56 output would need over 25 billion parameters. The convolution's weight sharing (same kernel applied at every position) and locality (each output depends only on a small neighborhood) are what make it feasible.
Let's make this concrete. Here is a 3×3 edge-detection kernel and what it does:
The horizontal edge detector subtracts the top row from the bottom row. Where pixel values change sharply from top to bottom (a horizontal edge), the output is large. Where values are uniform, the output is zero. The Gaussian blur kernel is a weighted average that smooths the image. In a learned CNN, kernels are initialized randomly and updated by gradient descent to extract whatever features minimize the training loss.
Padding and stride
Without padding, a K×K convolution on an H×W input produces an (H-K+1)×(W-K+1) output. Each layer shrinks the feature map. Padding adds zeros (or reflects values) around the border so the output has the same spatial dimensions as the input. For a 3×3 kernel, padding=1 preserves the size: (H-3+2+1) = H.
where H = input height, K = kernel size, P = padding, S = stride
Stride controls how far the kernel moves between positions. Stride=1 means the kernel moves one pixel at a time (the default). Stride=2 means it moves two pixels, halving the output dimensions. Stride-2 convolutions are the primary mechanism for spatial downsampling in modern CNNs — they reduce resolution while increasing the number of channels, trading spatial detail for semantic richness.
What learned kernels look like
The first convolutional layer of a trained CNN learns kernels that resemble classical image processing filters: edge detectors at various orientations, color-opponent filters (red vs. green, blue vs. yellow), and frequency-selective filters (Gabor-like patterns). This was first clearly demonstrated by Krizhevsky et al. (2012) when they visualized AlexNet's first layer.
Deeper layers learn increasingly abstract patterns. Zeiler and Fergus (2014) showed this definitively by using deconvolutional visualization: layer 2 learns corners, textures, and simple shapes; layer 3 learns object parts (wheels, eyes, windows); layers 4–5 learn object-level representations (faces, buildings, animals). This hierarchical emergence of features — from edges to textures to parts to objects — is the core insight of deep convolutional networks.
AlexNet (2012) used 11×11 and 5×5 kernels. VGGNet (Simonyan & Zisserman, 2014) showed that stacking two 3×3 convolutions gives the same receptive field as one 5×5 (and stacking three 3×3 gives 7×7) but with fewer parameters and more nonlinearities. Two 3×3 layers: 2 × 9C2 = 18C2 parameters. One 5×5 layer: 25C2 parameters. Same receptive field, 28% fewer parameters, and an extra ReLU between the layers adds representational capacity. Since VGGNet, nearly all CNN architectures use 3×3 kernels exclusively (with occasional 1×1 kernels for channel mixing).
Watch a 3×3 kernel slide across the input and produce the output feature map. Click Step to advance. The highlighted cells show the kernel position and the computed output value.
Feature Hierarchies
A single convolution layer detects simple local patterns (edges, colors). By stacking many layers, each operating on the output of the previous, the network builds a hierarchy of increasingly abstract features. This is the core computational principle of deep learning for vision.
What each layer learns
The progression is remarkably consistent across architectures and training objectives:
| Depth | Features Detected | Receptive Field | Analogy |
|---|---|---|---|
| Layer 1 | Oriented edges, color gradients, brightness contrasts | 3–7 pixels | Retinal ganglion cells (V1 simple cells) |
| Layers 2–3 | Corners, T-junctions, textures, simple shapes | 16–40 pixels | V1 complex cells, V2 |
| Layers 4–6 | Object parts: wheels, eyes, handles, windows | 50–100 pixels | V4, IT cortex |
| Layers 7+ | Whole objects, scene categories, semantic concepts | Full image | Higher visual cortex |
This hierarchy is not designed — it emerges from training. The network discovers that edges are useful building blocks for corners, corners for parts, and parts for objects. The loss function (e.g., cross-entropy for classification) only provides a signal at the final layer. All intermediate representations are learned end-to-end through backpropagation.
A deep insight: this learned hierarchy mirrors the visual processing pathway in the primate brain. The visual cortex processes information through areas V1 → V2 → V4 → IT with increasing receptive fields and abstraction levels. This parallel was first noted by Yamins et al. (2014), who showed that deep CNN representations predict neural responses in monkey IT cortex better than any previous computational model. The brain and the CNN independently converge on similar representational strategies — strong evidence that this hierarchy is not just a design choice but reflects fundamental structure in the statistics of natural images.
Visualizing learned features
Understanding what each layer represents is essential for debugging and interpreting CNN behavior. Several visualization techniques exist:
- Filter visualization: Display the weight values of the learned kernels directly. Works well for the first layer (3-channel kernels interpretable as colored patterns) but becomes opaque for deeper layers (64-channel kernels are not human-interpretable as images).
- Activation maximization: Generate a synthetic input that maximizes the activation of a specific neuron or channel. Uses gradient ascent on the input image (with regularization to keep images natural-looking). Produces "dream-like" images that reveal what each neuron is looking for.
- Grad-CAM (Selvaraju et al., 2017): Uses the gradient of the output class score with respect to the feature maps of a convolutional layer to produce a coarse localization map showing which spatial regions contributed most to the prediction. This is the standard method for model interpretability in production.
Explore what a CNN learns at different depths. Each layer builds on the features of the previous layer, creating increasingly abstract representations.
Receptive Fields
The receptive field of a neuron is the region of the input image that can influence its value. For a single 3×3 convolution, the receptive field is 3×3 pixels. For two stacked 3×3 convolutions, it's 5×5. For three, it's 7×7. Each additional layer expands the receptive field by K-1 pixels (for a K×K kernel with stride 1).
Computing receptive fields
For a stack of L convolution layers, each with kernel size Kl, stride Sl, and padding Pl, the receptive field of a neuron in layer L is:
For a simple stack of 3×3 convolutions with stride 1:
So 5 layers of 3×3 conv → RF = 11×11. This grows linearly with depth. To achieve a large receptive field efficiently, architectures use stride-2 convolutions or pooling layers that double the effective growth rate. A typical ResNet-50 has a theoretical receptive field of 483×483 pixels — larger than the standard 224×224 input, meaning every output neuron can theoretically "see" the entire image.
Effective receptive field
Luo et al. (2016) showed that the effective receptive field is much smaller than the theoretical one. Center pixels contribute far more to the output than peripheral pixels, following a roughly Gaussian distribution. In practice, only about 30–50% of the theoretical receptive field is actually used. This means that even in deep networks, each neuron is primarily influenced by a local neighborhood — a property that has implications for spatial resolution in feature maps and for the design of architectures that need to capture long-range dependencies (motivating the move to attention mechanisms and Vision Transformers).
If a CNN's effective receptive field is ~60 pixels, then objects smaller than ~60 pixels in the image are below the resolution that the network's deepest features can meaningfully represent. This is why higher input resolution improves performance on fine-grained tasks: more pixels per object means the object occupies a larger fraction of each neuron's receptive field. It also explains why VLMs benefit from higher resolution: a ViT-L/14 at 224px has each patch covering 14×14 pixels (very coarse for document text), while at 336px the same patch covers the same 14×14 pixels but the image captures 2.25× more visual content.
Add convolutional layers and watch the receptive field grow. The purple region shows the theoretical RF; the yellow center shows the effective RF (Gaussian falloff).
Pooling & Downsampling
Pooling reduces the spatial dimensions of feature maps, serving two purposes: reducing computation (fewer values to process in subsequent layers) and introducing a degree of translation invariance (small shifts in the input don't change the pooled output).
Max pooling takes the maximum value in each local window (typically 2×2 with stride 2, halving both dimensions). It preserves the strongest activation in each region, acting as a "was this feature present anywhere in this region?" detector. Max pooling was the dominant downsampling method in early CNN architectures (AlexNet, VGGNet).
Average pooling computes the mean instead of the max. It is smoother and preserves more information about the average activity in a region, but discards the precise location of features. Global Average Pooling (GAP) — averaging over the entire spatial extent to produce a single value per channel — is the standard method for converting spatial feature maps to classification vectors in modern CNNs. It was introduced by Lin et al. (2013) and replaced the fully connected layers that AlexNet used.
Stride-2 convolution has largely replaced explicit pooling in modern architectures. Instead of convolving with stride 1 and then max-pooling, the network convolves with stride 2 directly. This is more parameter-efficient (one operation instead of two) and gives the network the ability to learn how to downsample rather than using a fixed max/average rule. ResNet, EfficientNet, and most modern CNNs use stride-2 convolutions for spatial reduction.
Residual Networks
Training very deep CNNs (20+ layers) hit a surprising problem: deeper networks performed worse than shallower ones, even on the training set. This wasn't overfitting — training loss was higher, not just test loss. The problem was degradation: optimization struggled to learn identity mappings through many nonlinear layers.
Skip connections: the key insight
He et al. (2016) proposed a simple but transformative solution: instead of learning a function H(x) directly, learn the residual F(x) = H(x) - x. The output becomes:
where F(x) is computed by a stack of conv-BN-ReLU layers and x is passed through a skip connection (identity shortcut) that bypasses those layers. If the optimal transformation is close to identity (which it often is in deep networks), learning a small residual F(x) ≈ 0 is much easier than learning H(x) ≈ x from scratch.
The skip connection has a second, equally important effect: it provides a direct gradient pathway from the loss to earlier layers. Without skip connections, gradients must flow through every intermediate layer, shrinking multiplicatively at each step (the vanishing gradient problem). With skip connections, gradients can flow directly through the identity shortcut, maintaining signal strength across many layers. This is why ResNets can be trained with 100+ layers while plain networks fail at 20+.
The ResNet architecture
ResNet is organized into stages, with each stage operating at a different spatial resolution:
| Stage | Output Size | Channels | Blocks (ResNet-50) | Downsampling |
|---|---|---|---|---|
| Stem | 56×56 | 64 | Conv 7×7, stride 2, MaxPool | 4× |
| Stage 1 | 56×56 | 256 | 3 bottleneck blocks | None |
| Stage 2 | 28×28 | 512 | 4 bottleneck blocks | 2× (stride-2 conv) |
| Stage 3 | 14×14 | 1024 | 6 bottleneck blocks | 2× |
| Stage 4 | 7×7 | 2048 | 3 bottleneck blocks | 2× |
| Head | 1×1 | 2048 | Global Average Pool | 7× |
A bottleneck block uses three convolutions: 1×1 (reduce channels), 3×3 (spatial processing at reduced channels), 1×1 (restore channels). This is cheaper than two 3×3 convolutions at full channel width. For ResNet-50 with bottleneck blocks, the total parameter count is ~25.6M — remarkably efficient for the accuracy it achieves.
ResNet was published in 2015 and won the ImageNet competition with 3.57% top-5 error (surpassing
human-level performance at ~5.1%). A decade later, its core idea — residual connections
— is universal. Every transformer layer uses a residual connection (y = Attention(x) + x).
Every modern VLM backbone is built on residual connections. The insight that networks should
learn deviations from identity, not the full transformation, is one of the most important ideas
in deep learning. When you see x + self.attn(self.norm(x)) in a transformer, you
are seeing ResNet's legacy.
Feature Pyramids
A CNN naturally produces features at multiple scales: early layers have high spatial resolution (many pixels, fine detail) and late layers have low spatial resolution (few pixels, coarse semantics). Object detection and segmentation require both — fine localization and strong semantics. The Feature Pyramid Network (FPN, Lin et al., 2017) combines features from all stages into a multi-scale feature representation.
FPN works by adding a top-down pathway: starting from the deepest (most semantic) features, it upsamples by 2× using nearest-neighbor interpolation and adds the result to the corresponding higher-resolution features from the bottom-up pathway. A 1×1 convolution adjusts the channel counts to match. The result is a set of feature maps at every resolution, each containing both fine spatial detail and strong semantic content.
For VLMs, the FPN concept manifests in a different form: multi-layer feature extraction. Instead of combining features spatially (as FPN does), some VLM architectures extract features from multiple transformer layers and concatenate them, getting both spatial detail (from early layers) and semantic abstraction (from late layers). This is conceptually the same principle as FPN applied to the depth dimension rather than the spatial dimension.
Normalization Layers
Training deep networks is unstable without normalization. As activations flow through many layers, their distribution shifts, requiring careful learning rate tuning. Normalization layers standardize activations to have zero mean and unit variance, stabilizing training and enabling larger learning rates.
Batch Normalization
BatchNorm (Ioffe & Szegedy, 2015) normalizes across the batch dimension for each channel independently:
yc = γc &hat;xc + βc
where μc = (1/BHW) ∑b,h,w xb,c,h,w and σc2 = (1/BHW) ∑b,h,w (xb,c,h,w - μc)2
The learnable parameters γ (scale) and β (shift) allow the network to recover any mean and variance it needs. BatchNorm depends on the batch dimension, which creates problems: it behaves differently during training (batch statistics) and inference (running statistics), and performance degrades with small batch sizes.
Layer Normalization
LayerNorm (Ba et al., 2016) normalizes across the feature dimensions for each sample independently:
LayerNorm computes statistics per-sample, not per-batch, so it is identical during training and inference and works with any batch size. This is why transformers use LayerNorm exclusively. Every ViT layer, every LLM layer, every VLM backbone uses LayerNorm (or the closely related RMSNorm, which drops the mean centering). When you encounter a CNN backbone (BatchNorm) connected to a transformer (LayerNorm), the normalization transition is often a source of bugs.
From CNNs to VLMs
Everything in this article builds toward a single purpose: producing a feature representation of an image that a language model can understand. The CNN pipeline we've built — convolutions that detect local patterns, stacked into hierarchies that detect objects, with residual connections for trainability and normalization for stability — produces spatial feature maps.
But modern VLMs don't use CNNs directly. They use Vision Transformers (ViTs), which we'll derive in Article 02. The ViT takes a different approach to the same problem: instead of building features hierarchically through local convolutions, it splits the image into patches, embeds each patch as a vector, and uses self-attention to let every patch attend to every other patch from the very first layer.
Understanding CNN fundamentals is still essential because:
- ViT patch embedding is literally a single convolution (kernel=patch_size, stride=patch_size).
- Many ViTs are hybrid: a CNN stem extracts initial features, then a transformer processes them.
- The feature hierarchy that CNNs learn (edges → textures → parts → objects) also emerges in ViTs, just through a different mechanism (attention patterns rather than filter weights).
- Concepts like receptive fields, spatial resolution, and feature extraction remain central to understanding what ViTs do.
Hierarchical Local → Global
Start with small receptive fields, build up through stacking. Translation equivariant by construction (same kernel everywhere). Strong inductive bias for spatial locality. O(K2CHW) per layer.
Global from the Start
Every patch attends to every other patch at every layer. No built-in locality bias — must learn spatial relationships from data. O(N2D) per layer where N = number of patches. Requires more data to match CNN performance.
Code Examples
Let's build everything we've discussed from scratch. These examples are designed so you can run them, modify them, and verify that you understand each operation mechanically.
Image as a tensor — loading and preprocessing
import torch
import numpy as np
from PIL import Image
from torchvision import transforms
# Load an image and inspect its raw format
img = Image.open("photo.jpg")
print(f"PIL Image: mode={img.mode}, size={img.size}") # RGB, (W, H)
# Convert to numpy: shape is (H, W, C) with values in [0, 255]
img_np = np.array(img)
print(f"Numpy array: shape={img_np.shape}, dtype={img_np.dtype}")
print(f" Pixel [100, 50]: R={img_np[100,50,0]}, G={img_np[100,50,1]}, B={img_np[100,50,2]}")
# Standard ImageNet preprocessing pipeline
preprocess = transforms.Compose([
transforms.Resize(256), # Resize shortest edge to 256
transforms.CenterCrop(224), # Crop center 224x224
transforms.ToTensor(), # (H,W,C) uint8 -> (C,H,W) float32 in [0,1]
transforms.Normalize(
mean=[0.485, 0.456, 0.406], # ImageNet channel means
std=[0.229, 0.224, 0.225] # ImageNet channel stds
),
])
tensor = preprocess(img)
print(f"\nPreprocessed tensor: shape={tensor.shape}, dtype={tensor.dtype}")
print(f" Channel 0 (R): min={tensor[0].min():.3f}, max={tensor[0].max():.3f}")
print(f" Channel 1 (G): min={tensor[1].min():.3f}, max={tensor[1].max():.3f}")
print(f" Channel 2 (B): min={tensor[2].min():.3f}, max={tensor[2].max():.3f}")
# The batch dimension: add it for model input
batch = tensor.unsqueeze(0) # (1, 3, 224, 224)
print(f"Batch tensor: {batch.shape}")
Convolution from scratch — numpy implementation
import numpy as np
def conv2d_numpy(input_: np.ndarray, kernel: np.ndarray,
padding: int = 0, stride: int = 1) -> np.ndarray:
"""
2D convolution from scratch. No libraries, no tricks.
Args:
input_: (C_in, H, W)
kernel: (C_out, C_in, K, K)
padding: zero-padding on each side
stride: step size
Returns:
output: (C_out, H_out, W_out)
"""
C_out, C_in, K, _ = kernel.shape
_, H, W = input_.shape
# Apply zero-padding
if padding > 0:
input_ = np.pad(input_, ((0,0), (padding,padding), (padding,padding)))
_, H_pad, W_pad = input_.shape
H_out = (H_pad - K) // stride + 1
W_out = (W_pad - K) // stride + 1
output = np.zeros((C_out, H_out, W_out))
# The triple loop: this is EXACTLY what a GPU parallelizes
for co in range(C_out): # for each output channel
for i in range(H_out): # for each output row
for j in range(W_out): # for each output column
# Extract the local patch
h_start = i * stride
w_start = j * stride
patch = input_[:, h_start:h_start+K, w_start:w_start+K]
# Dot product with kernel
output[co, i, j] = np.sum(patch * kernel[co])
return output
# Example: apply edge detection to a grayscale image
# Create a simple 8x8 image with a vertical edge
image = np.zeros((1, 8, 8))
image[0, :, 4:] = 1.0 # right half is white
# Vertical edge detector
kernel = np.array([[[[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1]]]]) # (1, 1, 3, 3)
result = conv2d_numpy(image, kernel, padding=1)
print("Input (8x8, vertical edge at column 4):")
print(image[0].astype(int))
print("\nOutput (edge detected):")
print(np.round(result[0], 1))
Verifying against PyTorch
import torch
import torch.nn as nn
# Create a Conv2d layer and inspect its dimensions
conv = nn.Conv2d(
in_channels=3,
out_channels=64,
kernel_size=3,
stride=1,
padding=1,
bias=True
)
print(f"Weight shape: {conv.weight.shape}") # (64, 3, 3, 3)
print(f"Bias shape: {conv.bias.shape}") # (64,)
print(f"Parameter count: {sum(p.numel() for p in conv.parameters())}")
# 64 * 3 * 3 * 3 + 64 = 1,792
# Forward pass
x = torch.randn(1, 3, 224, 224) # batch of 1 RGB image
y = conv(x)
print(f"\nInput: {x.shape}") # (1, 3, 224, 224)
print(f"Output: {y.shape}") # (1, 64, 224, 224) — same spatial size (padding=1)
# Verify our numpy implementation matches PyTorch
conv_test = nn.Conv2d(1, 1, 3, padding=1, bias=False)
with torch.no_grad():
conv_test.weight.copy_(torch.tensor([[[[-1.,0.,1.],[-1.,0.,1.],[-1.,0.,1.]]]]))
x_test = torch.tensor(image, dtype=torch.float32).unsqueeze(0) # (1,1,8,8)
y_torch = conv_test(x_test).squeeze().numpy()
y_numpy = result[0]
print(f"\nMax difference: {np.max(np.abs(y_torch - y_numpy)):.10f}") # ~0.0
Building a ResNet block from scratch
import torch
import torch.nn as nn
class ResidualBlock(nn.Module):
"""A basic residual block: two 3x3 convolutions with a skip connection."""
def __init__(self, channels: int):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(channels)
self.relu = nn.ReLU(inplace=True)
def forward(self, x: torch.Tensor) -> torch.Tensor:
identity = x # save for skip connection
out = self.conv1(x) # conv -> BN -> ReLU
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out) # conv -> BN
out = self.bn2(out)
out = out + identity # SKIP CONNECTION: F(x) + x
out = self.relu(out) # final ReLU
return out
class BottleneckBlock(nn.Module):
"""ResNet-50 style bottleneck: 1x1 -> 3x3 -> 1x1 with expansion=4."""
expansion = 4
def __init__(self, in_channels: int, bottleneck_channels: int,
stride: int = 1):
super().__init__()
out_channels = bottleneck_channels * self.expansion
self.conv1 = nn.Conv2d(in_channels, bottleneck_channels, 1, bias=False)
self.bn1 = nn.BatchNorm2d(bottleneck_channels)
self.conv2 = nn.Conv2d(bottleneck_channels, bottleneck_channels, 3,
stride=stride, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(bottleneck_channels)
self.conv3 = nn.Conv2d(bottleneck_channels, out_channels, 1, bias=False)
self.bn3 = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)
# If dimensions change, need a projection shortcut
self.shortcut = nn.Identity()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
nn.BatchNorm2d(out_channels),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
identity = self.shortcut(x)
out = self.relu(self.bn1(self.conv1(x))) # 1x1 reduce
out = self.relu(self.bn2(self.conv2(out))) # 3x3 spatial
out = self.bn3(self.conv3(out)) # 1x1 expand
out = out + identity # residual
return self.relu(out)
# Build a mini ResNet and count parameters
block = BottleneckBlock(in_channels=256, bottleneck_channels=64)
x = torch.randn(1, 256, 56, 56)
y = block(x)
print(f"Input: {x.shape}") # (1, 256, 56, 56)
print(f"Output: {y.shape}") # (1, 256, 56, 56)
print(f"Params: {sum(p.numel() for p in block.parameters()):,}")
# With downsampling
block_down = BottleneckBlock(256, 128, stride=2)
y_down = block_down(x)
print(f"\nDownsampled: {y_down.shape}") # (1, 512, 28, 28)
Grad-CAM visualization
import torch
import torch.nn.functional as F
from torchvision import models, transforms
from PIL import Image
import numpy as np
def grad_cam(model, input_tensor, target_class, target_layer):
"""
Compute Grad-CAM heatmap for a given class and layer.
This implements: L_c = ReLU(sum_k(alpha_k * A^k))
where alpha_k = (1/Z) * sum_ij (dY_c / dA^k_ij)
"""
activations = {}
gradients = {}
# Register hooks to capture activations and gradients
def forward_hook(module, input, output):
activations['value'] = output.detach()
def backward_hook(module, grad_input, grad_output):
gradients['value'] = grad_output[0].detach()
handle_fwd = target_layer.register_forward_hook(forward_hook)
handle_bwd = target_layer.register_full_backward_hook(backward_hook)
# Forward pass
output = model(input_tensor)
class_score = output[0, target_class]
# Backward pass
model.zero_grad()
class_score.backward()
# Compute Grad-CAM
A = activations['value'] # (1, C, H, W) - feature maps
dY_dA = gradients['value'] # (1, C, H, W) - gradients
# Global average pooling of gradients -> channel importance weights
alpha = dY_dA.mean(dim=(2, 3), keepdim=True) # (1, C, 1, 1)
# Weighted sum of feature maps
cam = (alpha * A).sum(dim=1, keepdim=True) # (1, 1, H, W)
cam = F.relu(cam) # ReLU: only positive contributions
# Normalize to [0, 1]
cam = cam - cam.min()
cam = cam / (cam.max() + 1e-8)
# Resize to input dimensions
cam = F.interpolate(cam, size=input_tensor.shape[2:],
mode='bilinear', align_corners=False)
handle_fwd.remove()
handle_bwd.remove()
return cam.squeeze().numpy()
# Usage
model = models.resnet50(pretrained=True)
model.eval()
img = Image.open("cat.jpg")
tensor = transforms.Compose([
transforms.Resize(224),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485,0.456,0.406], [0.229,0.224,0.225])
])(img).unsqueeze(0)
heatmap = grad_cam(model, tensor, target_class=281, # tabby cat
target_layer=model.layer4[-1])
print(f"Heatmap shape: {heatmap.shape}") # (224, 224)
References
Seminal papers and key works referenced in this article.
- Krizhevsky et al. "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS, 2012.
- Simonyan & Zisserman. "Very Deep Convolutional Networks for Large-Scale Image Recognition." ICLR, 2015. arXiv
- He et al. "Deep Residual Learning for Image Recognition." CVPR, 2016. arXiv
- Zeiler & Fergus. "Visualizing and Understanding Convolutional Networks." ECCV, 2014. arXiv
- Selvaraju et al. "Grad-CAM: Visual Explanations from Deep Networks." ICCV, 2017. arXiv
- Lin et al. "Feature Pyramid Networks for Object Detection." CVPR, 2017. arXiv