Vision Foundations & Image Representations

Introduction

Richard Feynman wrote on his blackboard: "What I cannot create, I do not understand." This article takes that literally. We will build the entire visual pipeline of a modern VLM from scratch — starting from the definition of a pixel and ending with the tensor representations that feed into a transformer. Every operation will be derived, every dimension tracked, every design choice explained.

The history of computer vision is a history of representations. For decades, researchers hand-crafted features: SIFT (Lowe, 1999) detected scale-invariant keypoints. HOG (Dalal & Triggs, 2005) counted gradient orientations in local cells. These features were brilliant engineering, but they hit a ceiling because a human designer had to decide what to extract. The deep learning revolution, beginning with AlexNet (Krizhevsky et al., 2012), replaced hand-crafted features with learned features: stacks of convolutional layers that discover what to extract directly from data.

Every modern VLM — GPT-4V, Claude's vision, Gemini, LLaVA, Qwen-VL — uses a vision encoder descended from this tradition. Whether the encoder is a CNN (like EfficientNet in RT-1) or a Vision Transformer (like ViT in CLIP), it is fundamentally doing the same thing: converting a grid of pixel values into a sequence of high-dimensional feature vectors that capture the visual content of the image. Understanding how this conversion works — mechanically, mathematically, intuitively — is the foundation for everything else in this series.

ℹ What this article covers

We build from absolute first principles. What a pixel is, physically and numerically. How color spaces work. The convolution operation derived step by step with full index arithmetic. Why stacking convolutions creates feature hierarchies. What receptive fields are and how to compute them. Pooling, residual connections, feature pyramids, and normalization layers. Code to implement everything from scratch in PyTorch and numpy. By the end, you will be able to build a CNN from scratch and understand exactly what every layer does to every pixel.

What Is an Image?

To a camera sensor, an image is a collection of photon counts. Each photosite (pixel) on the sensor accumulates photons during exposure, converts the charge to a voltage, and digitizes it to an integer. A typical sensor has millions of photosites arranged in a rectangular grid.

To a computer, the result is a 2D array of integers. But a single array gives only luminance (brightness). Color requires multiple measurements at different wavelengths.

Pixels and channels

Most digital cameras use a Bayer filter — a mosaic of red, green, and blue filters placed over the sensor, so that each pixel records only one color. A demosaicing algorithm interpolates the missing colors, producing three values per pixel: red (R), green (G), and blue (B) intensity. These three values are the channels of the image.

A standard 8-bit RGB image has values in [0, 255] per channel per pixel. With 3 channels, each pixel is a triplet (R, G, B). A 224×224 image has 224 × 224 = 50,176 pixels × 3 channels = 150,528 values. This is already a very high-dimensional space — far more dimensions than most tabular datasets. Yet to a human, it's just a thumbnail.

Red Channel

Intensity of long-wavelength light (~620–750nm). High in fire, skin, brick. Low in sky, water, foliage.

Green Channel

Intensity of medium-wavelength light (~495–570nm). Dominates in vegetation. Human vision is most sensitive to this band.

Blue Channel

Intensity of short-wavelength light (~450–495nm). High in sky, water, shadows. Sensor noise is typically highest in this channel.

Color spaces

RGB is not the only way to decompose color. Several alternative color spaces are useful in different contexts:

HSV (Hue, Saturation, Value): Separates chromatic content (hue) from intensity (value). Useful for color-based segmentation because hue is invariant to illumination changes. Hue is an angle [0°, 360°); saturation and value are in [0, 1].
YCbCr: Separates luminance (Y) from chrominance (Cb, Cr). Used in JPEG compression because human vision is more sensitive to luminance than chrominance, allowing heavier compression of Cb/Cr channels. Also the native format of many video codecs.
LAB (CIE L*a*b*): Perceptually uniform: equal numerical distances correspond to approximately equal perceived differences. L* is lightness; a* and b* are color opponents (red-green and yellow-blue). Useful for color difference metrics.

For deep learning, RGB is the standard. All major pretrained models (ResNet, ViT, CLIP, DINOv2) expect RGB input. The standard preprocessing normalizes pixel values from [0, 255] integers to [0, 1] floats, then applies channel-wise normalization with the ImageNet mean ([0.485, 0.456, 0.406]) and standard deviation ([0.229, 0.224, 0.225]) to produce zero-centered, unit-variance inputs. This normalization is not arbitrary — it matches the statistics of the pretraining data, and using different normalization with a pretrained model will degrade performance.

The image as a tensor

In PyTorch, an image is a 3D tensor of shape (C, H, W): channels first, then height (rows), then width (columns). A batch of B images is a 4D tensor (B, C, H, W). This "channels first" layout is PyTorch's default and differs from TensorFlow's default (B, H, W, C) and numpy/PIL's (H, W, C).

The ordering matters for computation: convolution kernels are applied along the (H, W) dimensions and sum across C. Getting the dimension ordering wrong is one of the most common bugs in vision code.

Image tensor: x \in R C \times H \times W Batch tensor: X \in R B \times C \times H \times W For a 224\times224 RGB image: x \in R 3 \times 224 \times 224

Why does dimension ordering matter? Modern hardware (GPUs, TPUs) is optimized for specific memory access patterns. Channels-first layout allows contiguous memory access for convolution operations, since each kernel computes a dot product across all channels at a spatial location. This is why PyTorch defaults to NCHW format, though the operation channels_last memory format (NHWC) can be faster on some GPUs due to tensor core requirements.

Pixel Explorer — Image as a Grid of Numbers Interactive

Hover over the image grid to see the RGB values of each pixel. Toggle channels to see individual color components. The numbers on the right show the actual tensor values.

Showing RGB — hover to inspect pixels

The Convolution Operation

The convolution is the fundamental operation of all CNN-based vision models. Despite the name, what neural networks compute is technically cross-correlation (no kernel flipping), but the community universally calls it "convolution." Let's derive it from scratch.

Mechanics: what a convolution does

A convolution slides a small matrix (the kernel or filter) across the input image, computing a dot product at each position. For a single-channel input and a K×K kernel, the output at position (i, j) is:

y(i, j) = \sum m=0 K-1 \sum n=0 K-1 w(m, n) \cdot x(i+m, j+n) + b

where w is the kernel weight matrix, x is the input, and b is a scalar bias. The output y is called a feature map. Each position in the feature map summarizes a local region of the input through the kernel's weighted sum.

For a multi-channel input (C_in channels), the kernel becomes 3D: w ∈ R^{C_in × K × K}. The convolution sums across all input channels:

y(i, j) = \sum c=0 C in -1 \sum m=0 K-1 \sum n=0 K-1 w(c, m, n) \cdot x(c, i+m, j+n) + b

One kernel produces one feature map. To produce C_out feature maps, we use C_out kernels, each of shape (C_in, K, K). The full weight tensor is W ∈ R^{C_out × C_in × K × K}.

Parameter count: A Conv2d layer with C_in=64, C_out=128, K=3 has 128 × 64 × 3 × 3 + 128 = 73,856 parameters. This is vastly fewer than a fully-connected layer connecting the same input to the same output: a 64×56×56 input mapped to 128×56×56 output would need over 25 billion parameters. The convolution's weight sharing (same kernel applied at every position) and locality (each output depends only on a small neighborhood) are what make it feasible.

Let's make this concrete. Here is a 3×3 edge-detection kernel and what it does:

Horizontal Edge Detector

-1

Vertical Edge Detector

-1

Gaussian Blur (3×3)

(÷ 16)

The horizontal edge detector subtracts the top row from the bottom row. Where pixel values change sharply from top to bottom (a horizontal edge), the output is large. Where values are uniform, the output is zero. The Gaussian blur kernel is a weighted average that smooths the image. In a learned CNN, kernels are initialized randomly and updated by gradient descent to extract whatever features minimize the training loss.

Padding and stride

Without padding, a K×K convolution on an H×W input produces an (H-K+1)×(W-K+1) output. Each layer shrinks the feature map. Padding adds zeros (or reflects values) around the border so the output has the same spatial dimensions as the input. For a 3×3 kernel, padding=1 preserves the size: (H-3+2+1) = H.

Output size = ⌊(H + 2P - K) / S⌋ + 1 where H = input height, K = kernel size, P = padding, S = stride

Stride controls how far the kernel moves between positions. Stride=1 means the kernel moves one pixel at a time (the default). Stride=2 means it moves two pixels, halving the output dimensions. Stride-2 convolutions are the primary mechanism for spatial downsampling in modern CNNs — they reduce resolution while increasing the number of channels, trading spatial detail for semantic richness.

What learned kernels look like

The first convolutional layer of a trained CNN learns kernels that resemble classical image processing filters: edge detectors at various orientations, color-opponent filters (red vs. green, blue vs. yellow), and frequency-selective filters (Gabor-like patterns). This was first clearly demonstrated by Krizhevsky et al. (2012) when they visualized AlexNet's first layer.

Deeper layers learn increasingly abstract patterns. Zeiler and Fergus (2014) showed this definitively by using deconvolutional visualization: layer 2 learns corners, textures, and simple shapes; layer 3 learns object parts (wheels, eyes, windows); layers 4–5 learn object-level representations (faces, buildings, animals). This hierarchical emergence of features — from edges to textures to parts to objects — is the core insight of deep convolutional networks.

💡 Why 3×3 kernels won the war

AlexNet (2012) used 11×11 and 5×5 kernels. VGGNet (Simonyan & Zisserman, 2014) showed that stacking two 3×3 convolutions gives the same receptive field as one 5×5 (and stacking three 3×3 gives 7×7) but with fewer parameters and more nonlinearities. Two 3×3 layers: 2 × 9C² = 18C² parameters. One 5×5 layer: 25C² parameters. Same receptive field, 28% fewer parameters, and an extra ReLU between the layers adds representational capacity. Since VGGNet, nearly all CNN architectures use 3×3 kernels exclusively (with occasional 1×1 kernels for channel mixing).

Convolution — Step by Step Interactive

Watch a 3×3 kernel slide across the input and produce the output feature map. Click Step to advance. The highlighted cells show the kernel position and the computed output value.

Position (0,0) — slide kernel to compute output

Feature Hierarchies

A single convolution layer detects simple local patterns (edges, colors). By stacking many layers, each operating on the output of the previous, the network builds a hierarchy of increasingly abstract features. This is the core computational principle of deep learning for vision.

What each layer learns

The progression is remarkably consistent across architectures and training objectives:

Depth	Features Detected	Receptive Field	Analogy
Layer 1	Oriented edges, color gradients, brightness contrasts	3–7 pixels	Retinal ganglion cells (V1 simple cells)
Layers 2–3	Corners, T-junctions, textures, simple shapes	16–40 pixels	V1 complex cells, V2
Layers 4–6	Object parts: wheels, eyes, handles, windows	50–100 pixels	V4, IT cortex
Layers 7+	Whole objects, scene categories, semantic concepts	Full image	Higher visual cortex

This hierarchy is not designed — it emerges from training. The network discovers that edges are useful building blocks for corners, corners for parts, and parts for objects. The loss function (e.g., cross-entropy for classification) only provides a signal at the final layer. All intermediate representations are learned end-to-end through backpropagation.

A deep insight: this learned hierarchy mirrors the visual processing pathway in the primate brain. The visual cortex processes information through areas V1 → V2 → V4 → IT with increasing receptive fields and abstraction levels. This parallel was first noted by Yamins et al. (2014), who showed that deep CNN representations predict neural responses in monkey IT cortex better than any previous computational model. The brain and the CNN independently converge on similar representational strategies — strong evidence that this hierarchy is not just a design choice but reflects fundamental structure in the statistics of natural images.

Visualizing learned features

Understanding what each layer represents is essential for debugging and interpreting CNN behavior. Several visualization techniques exist:

Filter visualization: Display the weight values of the learned kernels directly. Works well for the first layer (3-channel kernels interpretable as colored patterns) but becomes opaque for deeper layers (64-channel kernels are not human-interpretable as images).
Activation maximization: Generate a synthetic input that maximizes the activation of a specific neuron or channel. Uses gradient ascent on the input image (with regularization to keep images natural-looking). Produces "dream-like" images that reveal what each neuron is looking for.
Grad-CAM (Selvaraju et al., 2017): Uses the gradient of the output class score with respect to the feature maps of a convolutional layer to produce a coarse localization map showing which spatial regions contributed most to the prediction. This is the standard method for model interpretability in production.

Feature Hierarchy — From Edges to Objects Interactive

Explore what a CNN learns at different depths. Each layer builds on the features of the previous layer, creating increasingly abstract representations.

Layer 1 — oriented edges, color gradients

Receptive Fields

The receptive field of a neuron is the region of the input image that can influence its value. For a single 3×3 convolution, the receptive field is 3×3 pixels. For two stacked 3×3 convolutions, it's 5×5. For three, it's 7×7. Each additional layer expands the receptive field by K-1 pixels (for a K×K kernel with stride 1).

Computing receptive fields

For a stack of L convolution layers, each with kernel size K_l, stride S_l, and padding P_l, the receptive field of a neuron in layer L is:

RF L = 1 + \sum l=1 L (K l - 1) \cdot \prod k=1 l-1 S k

For a simple stack of 3×3 convolutions with stride 1:

RF L = 1 + 2L

So 5 layers of 3×3 conv → RF = 11×11. This grows linearly with depth. To achieve a large receptive field efficiently, architectures use stride-2 convolutions or pooling layers that double the effective growth rate. A typical ResNet-50 has a theoretical receptive field of 483×483 pixels — larger than the standard 224×224 input, meaning every output neuron can theoretically "see" the entire image.

Effective receptive field

Luo et al. (2016) showed that the effective receptive field is much smaller than the theoretical one. Center pixels contribute far more to the output than peripheral pixels, following a roughly Gaussian distribution. In practice, only about 30–50% of the theoretical receptive field is actually used. This means that even in deep networks, each neuron is primarily influenced by a local neighborhood — a property that has implications for spatial resolution in feature maps and for the design of architectures that need to capture long-range dependencies (motivating the move to attention mechanisms and Vision Transformers).

ℹ Receptive fields explain resolution sensitivity

If a CNN's effective receptive field is ~60 pixels, then objects smaller than ~60 pixels in the image are below the resolution that the network's deepest features can meaningfully represent. This is why higher input resolution improves performance on fine-grained tasks: more pixels per object means the object occupies a larger fraction of each neuron's receptive field. It also explains why VLMs benefit from higher resolution: a ViT-L/14 at 224px has each patch covering 14×14 pixels (very coarse for document text), while at 336px the same patch covers the same 14×14 pixels but the image captures 2.25× more visual content.

Receptive Field Growth — How Deep Networks See More Interactive

Add convolutional layers and watch the receptive field grow. The purple region shows the theoretical RF; the yellow center shows the effective RF (Gaussian falloff).

0 layers — RF = 1×1 (single pixel)

Pooling & Downsampling

Pooling reduces the spatial dimensions of feature maps, serving two purposes: reducing computation (fewer values to process in subsequent layers) and introducing a degree of translation invariance (small shifts in the input don't change the pooled output).

Max pooling takes the maximum value in each local window (typically 2×2 with stride 2, halving both dimensions). It preserves the strongest activation in each region, acting as a "was this feature present anywhere in this region?" detector. Max pooling was the dominant downsampling method in early CNN architectures (AlexNet, VGGNet).

MaxPool 2\times2 : y(i, j) = max { x(2i, 2j), x(2i+1, 2j), x(2i, 2j+1), x(2i+1, 2j+1) }

Average pooling computes the mean instead of the max. It is smoother and preserves more information about the average activity in a region, but discards the precise location of features. Global Average Pooling (GAP) — averaging over the entire spatial extent to produce a single value per channel — is the standard method for converting spatial feature maps to classification vectors in modern CNNs. It was introduced by Lin et al. (2013) and replaced the fully connected layers that AlexNet used.

Stride-2 convolution has largely replaced explicit pooling in modern architectures. Instead of convolving with stride 1 and then max-pooling, the network convolves with stride 2 directly. This is more parameter-efficient (one operation instead of two) and gives the network the ability to learn how to downsample rather than using a fixed max/average rule. ResNet, EfficientNet, and most modern CNNs use stride-2 convolutions for spatial reduction.

Residual Networks

Training very deep CNNs (20+ layers) hit a surprising problem: deeper networks performed worse than shallower ones, even on the training set. This wasn't overfitting — training loss was higher, not just test loss. The problem was degradation: optimization struggled to learn identity mappings through many nonlinear layers.

Skip connections: the key insight

He et al. (2016) proposed a simple but transformative solution: instead of learning a function H(x) directly, learn the residual F(x) = H(x) - x. The output becomes:

y = F(x) + x

where F(x) is computed by a stack of conv-BN-ReLU layers and x is passed through a skip connection (identity shortcut) that bypasses those layers. If the optimal transformation is close to identity (which it often is in deep networks), learning a small residual F(x) ≈ 0 is much easier than learning H(x) ≈ x from scratch.

The skip connection has a second, equally important effect: it provides a direct gradient pathway from the loss to earlier layers. Without skip connections, gradients must flow through every intermediate layer, shrinking multiplicatively at each step (the vanishing gradient problem). With skip connections, gradients can flow directly through the identity shortcut, maintaining signal strength across many layers. This is why ResNets can be trained with 100+ layers while plain networks fail at 20+.

The ResNet architecture

ResNet is organized into stages, with each stage operating at a different spatial resolution:

Stage	Output Size	Channels	Blocks (ResNet-50)	Downsampling
Stem	56×56	64	Conv 7×7, stride 2, MaxPool	4×
Stage 1	56×56	256	3 bottleneck blocks	None
Stage 2	28×28	512	4 bottleneck blocks	2× (stride-2 conv)
Stage 3	14×14	1024	6 bottleneck blocks	2×
Stage 4	7×7	2048	3 bottleneck blocks	2×
Head	1×1	2048	Global Average Pool	7×

A bottleneck block uses three convolutions: 1×1 (reduce channels), 3×3 (spatial processing at reduced channels), 1×1 (restore channels). This is cheaper than two 3×3 convolutions at full channel width. For ResNet-50 with bottleneck blocks, the total parameter count is ~25.6M — remarkably efficient for the accuracy it achieves.

💡 ResNet's lasting influence

ResNet was published in 2015 and won the ImageNet competition with 3.57% top-5 error (surpassing human-level performance at ~5.1%). A decade later, its core idea — residual connections — is universal. Every transformer layer uses a residual connection (y = Attention(x) + x). Every modern VLM backbone is built on residual connections. The insight that networks should learn deviations from identity, not the full transformation, is one of the most important ideas in deep learning. When you see x + self.attn(self.norm(x)) in a transformer, you are seeing ResNet's legacy.

Feature Pyramids

A CNN naturally produces features at multiple scales: early layers have high spatial resolution (many pixels, fine detail) and late layers have low spatial resolution (few pixels, coarse semantics). Object detection and segmentation require both — fine localization and strong semantics. The Feature Pyramid Network (FPN, Lin et al., 2017) combines features from all stages into a multi-scale feature representation.

FPN works by adding a top-down pathway: starting from the deepest (most semantic) features, it upsamples by 2× using nearest-neighbor interpolation and adds the result to the corresponding higher-resolution features from the bottom-up pathway. A 1×1 convolution adjusts the channel counts to match. The result is a set of feature maps at every resolution, each containing both fine spatial detail and strong semantic content.

For VLMs, the FPN concept manifests in a different form: multi-layer feature extraction. Instead of combining features spatially (as FPN does), some VLM architectures extract features from multiple transformer layers and concatenate them, getting both spatial detail (from early layers) and semantic abstraction (from late layers). This is conceptually the same principle as FPN applied to the depth dimension rather than the spatial dimension.

Normalization Layers

Training deep networks is unstable without normalization. As activations flow through many layers, their distribution shifts, requiring careful learning rate tuning. Normalization layers standardize activations to have zero mean and unit variance, stabilizing training and enabling larger learning rates.

Batch Normalization

BatchNorm (Ioffe & Szegedy, 2015) normalizes across the batch dimension for each channel independently:

x̂ c = (x c - μ c) / \sqrt(σ c 2 + ε) y c = γ c x̂ c + β c where μ c = (1/BHW) \sum b,h,w x b,c,h,w and σ c 2 = (1/BHW) \sum b,h,w (x b,c,h,w - μ c) 2

The learnable parameters γ (scale) and β (shift) allow the network to recover any mean and variance it needs. BatchNorm depends on the batch dimension, which creates problems: it behaves differently during training (batch statistics) and inference (running statistics), and performance degrades with small batch sizes.

Layer Normalization

LayerNorm (Ba et al., 2016) normalizes across the feature dimensions for each sample independently:

μ = (1/CHW) \sum c,h,w x c,h,w, σ 2 = (1/CHW) \sum c,h,w (x c,h,w - μ) 2

LayerNorm computes statistics per-sample, not per-batch, so it is identical during training and inference and works with any batch size. This is why transformers use LayerNorm exclusively. Every ViT layer, every LLM layer, every VLM backbone uses LayerNorm (or the closely related RMSNorm, which drops the mean centering). When you encounter a CNN backbone (BatchNorm) connected to a transformer (LayerNorm), the normalization transition is often a source of bugs.

From CNNs to VLMs

Everything in this article builds toward a single purpose: producing a feature representation of an image that a language model can understand. The CNN pipeline we've built — convolutions that detect local patterns, stacked into hierarchies that detect objects, with residual connections for trainability and normalization for stability — produces spatial feature maps.

But modern VLMs don't use CNNs directly. They use Vision Transformers (ViTs), which we'll derive in Article 02. The ViT takes a different approach to the same problem: instead of building features hierarchically through local convolutions, it splits the image into patches, embeds each patch as a vector, and uses self-attention to let every patch attend to every other patch from the very first layer.

Understanding CNN fundamentals is still essential because:

ViT patch embedding is literally a single convolution (kernel=patch_size, stride=patch_size).
Many ViTs are hybrid: a CNN stem extracts initial features, then a transformer processes them.
The feature hierarchy that CNNs learn (edges → textures → parts → objects) also emerges in ViTs, just through a different mechanism (attention patterns rather than filter weights).
Concepts like receptive fields, spatial resolution, and feature extraction remain central to understanding what ViTs do.

CNN Approach

Hierarchical Local → Global

Start with small receptive fields, build up through stacking. Translation equivariant by construction (same kernel everywhere). Strong inductive bias for spatial locality. O(K²CHW) per layer.

ViT Approach

Global from the Start

Every patch attends to every other patch at every layer. No built-in locality bias — must learn spatial relationships from data. O(N²D) per layer where N = number of patches. Requires more data to match CNN performance.

Code Examples

Let's build everything we've discussed from scratch. These examples are designed so you can run them, modify them, and verify that you understand each operation mechanically.

Image as a tensor — loading and preprocessing

python

import torch
import numpy as np
from PIL import Image
from torchvision import transforms

# Load an image and inspect its raw format
img = Image.open("photo.jpg")
print(f"PIL Image: mode={img.mode}, size={img.size}")  # RGB, (W, H)

# Convert to numpy: shape is (H, W, C) with values in [0, 255]
img_np = np.array(img)
print(f"Numpy array: shape={img_np.shape}, dtype={img_np.dtype}")
print(f"  Pixel [100, 50]: R={img_np[100,50,0]}, G={img_np[100,50,1]}, B={img_np[100,50,2]}")

# Standard ImageNet preprocessing pipeline
preprocess = transforms.Compose([
    transforms.Resize(256),                    # Resize shortest edge to 256
    transforms.CenterCrop(224),                # Crop center 224x224
    transforms.ToTensor(),                     # (H,W,C) uint8 -> (C,H,W) float32 in [0,1]
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],           # ImageNet channel means
        std=[0.229, 0.224, 0.225]              # ImageNet channel stds
    ),
])

tensor = preprocess(img)
print(f"\nPreprocessed tensor: shape={tensor.shape}, dtype={tensor.dtype}")
print(f"  Channel 0 (R): min={tensor[0].min():.3f}, max={tensor[0].max():.3f}")
print(f"  Channel 1 (G): min={tensor[1].min():.3f}, max={tensor[1].max():.3f}")
print(f"  Channel 2 (B): min={tensor[2].min():.3f}, max={tensor[2].max():.3f}")

# The batch dimension: add it for model input
batch = tensor.unsqueeze(0)  # (1, 3, 224, 224)
print(f"Batch tensor: {batch.shape}")

Convolution from scratch — numpy implementation

python

import numpy as np

def conv2d_numpy(input_: np.ndarray, kernel: np.ndarray,
                 padding: int = 0, stride: int = 1) -> np.ndarray:
    """
    2D convolution from scratch. No libraries, no tricks.

    Args:
        input_: (C_in, H, W)
        kernel: (C_out, C_in, K, K)
        padding: zero-padding on each side
        stride: step size

    Returns:
        output: (C_out, H_out, W_out)
    """
    C_out, C_in, K, _ = kernel.shape
    _, H, W = input_.shape

    # Apply zero-padding
    if padding > 0:
        input_ = np.pad(input_, ((0,0), (padding,padding), (padding,padding)))

    _, H_pad, W_pad = input_.shape
    H_out = (H_pad - K) // stride + 1
    W_out = (W_pad - K) // stride + 1

    output = np.zeros((C_out, H_out, W_out))

    # The triple loop: this is EXACTLY what a GPU parallelizes
    for co in range(C_out):           # for each output channel
        for i in range(H_out):        # for each output row
            for j in range(W_out):    # for each output column
                # Extract the local patch
                h_start = i * stride
                w_start = j * stride
                patch = input_[:, h_start:h_start+K, w_start:w_start+K]

                # Dot product with kernel
                output[co, i, j] = np.sum(patch * kernel[co])

    return output

# Example: apply edge detection to a grayscale image
# Create a simple 8x8 image with a vertical edge
image = np.zeros((1, 8, 8))
image[0, :, 4:] = 1.0  # right half is white

# Vertical edge detector
kernel = np.array([[[[-1, 0, 1],
                     [-1, 0, 1],
                     [-1, 0, 1]]]])  # (1, 1, 3, 3)

result = conv2d_numpy(image, kernel, padding=1)
print("Input (8x8, vertical edge at column 4):")
print(image[0].astype(int))
print("\nOutput (edge detected):")
print(np.round(result[0], 1))

Verifying against PyTorch

python

import torch
import torch.nn as nn

# Create a Conv2d layer and inspect its dimensions
conv = nn.Conv2d(
    in_channels=3,
    out_channels=64,
    kernel_size=3,
    stride=1,
    padding=1,
    bias=True
)

print(f"Weight shape: {conv.weight.shape}")  # (64, 3, 3, 3)
print(f"Bias shape: {conv.bias.shape}")       # (64,)
print(f"Parameter count: {sum(p.numel() for p in conv.parameters())}")
# 64 * 3 * 3 * 3 + 64 = 1,792

# Forward pass
x = torch.randn(1, 3, 224, 224)  # batch of 1 RGB image
y = conv(x)
print(f"\nInput:  {x.shape}")   # (1, 3, 224, 224)
print(f"Output: {y.shape}")     # (1, 64, 224, 224) — same spatial size (padding=1)

# Verify our numpy implementation matches PyTorch
conv_test = nn.Conv2d(1, 1, 3, padding=1, bias=False)
with torch.no_grad():
    conv_test.weight.copy_(torch.tensor([[[[-1.,0.,1.],[-1.,0.,1.],[-1.,0.,1.]]]]))

x_test = torch.tensor(image, dtype=torch.float32).unsqueeze(0)  # (1,1,8,8)
y_torch = conv_test(x_test).squeeze().numpy()
y_numpy = result[0]
print(f"\nMax difference: {np.max(np.abs(y_torch - y_numpy)):.10f}")  # ~0.0

Building a ResNet block from scratch

python

import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    """A basic residual block: two 3x3 convolutions with a skip connection."""

    def __init__(self, channels: int):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        identity = x                        # save for skip connection

        out = self.conv1(x)                 # conv -> BN -> ReLU
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)               # conv -> BN
        out = self.bn2(out)

        out = out + identity                # SKIP CONNECTION: F(x) + x
        out = self.relu(out)                # final ReLU
        return out

class BottleneckBlock(nn.Module):
    """ResNet-50 style bottleneck: 1x1 -> 3x3 -> 1x1 with expansion=4."""

    expansion = 4

    def __init__(self, in_channels: int, bottleneck_channels: int,
                 stride: int = 1):
        super().__init__()
        out_channels = bottleneck_channels * self.expansion

        self.conv1 = nn.Conv2d(in_channels, bottleneck_channels, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(bottleneck_channels)
        self.conv2 = nn.Conv2d(bottleneck_channels, bottleneck_channels, 3,
                               stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(bottleneck_channels)
        self.conv3 = nn.Conv2d(bottleneck_channels, out_channels, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        # If dimensions change, need a projection shortcut
        self.shortcut = nn.Identity()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels),
            )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        identity = self.shortcut(x)

        out = self.relu(self.bn1(self.conv1(x)))     # 1x1 reduce
        out = self.relu(self.bn2(self.conv2(out)))    # 3x3 spatial
        out = self.bn3(self.conv3(out))               # 1x1 expand

        out = out + identity                          # residual
        return self.relu(out)

# Build a mini ResNet and count parameters
block = BottleneckBlock(in_channels=256, bottleneck_channels=64)
x = torch.randn(1, 256, 56, 56)
y = block(x)
print(f"Input:  {x.shape}")     # (1, 256, 56, 56)
print(f"Output: {y.shape}")     # (1, 256, 56, 56)
print(f"Params: {sum(p.numel() for p in block.parameters()):,}")

# With downsampling
block_down = BottleneckBlock(256, 128, stride=2)
y_down = block_down(x)
print(f"\nDownsampled: {y_down.shape}")  # (1, 512, 28, 28)

Grad-CAM visualization

python

import torch
import torch.nn.functional as F
from torchvision import models, transforms
from PIL import Image
import numpy as np

def grad_cam(model, input_tensor, target_class, target_layer):
    """
    Compute Grad-CAM heatmap for a given class and layer.

    This implements: L_c = ReLU(sum_k(alpha_k * A^k))
    where alpha_k = (1/Z) * sum_ij (dY_c / dA^k_ij)
    """
    activations = {}
    gradients = {}

    # Register hooks to capture activations and gradients
    def forward_hook(module, input, output):
        activations['value'] = output.detach()

    def backward_hook(module, grad_input, grad_output):
        gradients['value'] = grad_output[0].detach()

    handle_fwd = target_layer.register_forward_hook(forward_hook)
    handle_bwd = target_layer.register_full_backward_hook(backward_hook)

    # Forward pass
    output = model(input_tensor)
    class_score = output[0, target_class]

    # Backward pass
    model.zero_grad()
    class_score.backward()

    # Compute Grad-CAM
    A = activations['value']      # (1, C, H, W) - feature maps
    dY_dA = gradients['value']    # (1, C, H, W) - gradients

    # Global average pooling of gradients -> channel importance weights
    alpha = dY_dA.mean(dim=(2, 3), keepdim=True)  # (1, C, 1, 1)

    # Weighted sum of feature maps
    cam = (alpha * A).sum(dim=1, keepdim=True)  # (1, 1, H, W)
    cam = F.relu(cam)  # ReLU: only positive contributions

    # Normalize to [0, 1]
    cam = cam - cam.min()
    cam = cam / (cam.max() + 1e-8)

    # Resize to input dimensions
    cam = F.interpolate(cam, size=input_tensor.shape[2:],
                        mode='bilinear', align_corners=False)

    handle_fwd.remove()
    handle_bwd.remove()

    return cam.squeeze().numpy()

# Usage
model = models.resnet50(pretrained=True)
model.eval()

img = Image.open("cat.jpg")
tensor = transforms.Compose([
    transforms.Resize(224),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485,0.456,0.406], [0.229,0.224,0.225])
])(img).unsqueeze(0)

heatmap = grad_cam(model, tensor, target_class=281,  # tabby cat
                   target_layer=model.layer4[-1])
print(f"Heatmap shape: {heatmap.shape}")  # (224, 224)

References

Seminal papers and key works referenced in this article.

Krizhevsky et al. "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS, 2012.
Simonyan & Zisserman. "Very Deep Convolutional Networks for Large-Scale Image Recognition." ICLR, 2015. arXiv
He et al. "Deep Residual Learning for Image Recognition." CVPR, 2016. arXiv
Zeiler & Fergus. "Visualizing and Understanding Convolutional Networks." ECCV, 2014. arXiv
Selvaraju et al. "Grad-CAM: Visual Explanations from Deep Networks." ICCV, 2017. arXiv
Lin et al. "Feature Pyramid Networks for Object Detection." CVPR, 2017. arXiv