Computer Vision Workbook — Engineermaxxing

Chapter 0: Convolution Math

You're debugging a CNN and the shapes don't match. The feature map coming out of layer 3 is 14×14 but the next layer expects 7×7. What happened? To debug CNN architectures, you need one formula burned into memory:

Convolution output size:
O = ⌊(W − K + 2P) / S⌋ + 1

W = input spatial dimension (width or height)
K = kernel size
P = padding (pixels added to each side)
S = stride

This formula governs every convolutional layer in every CNN ever built. A 3×3 kernel on a 5×5 input with no padding and stride 1: O = (5 − 3 + 0)/1 + 1 = 3. The output is 3×3.

The output size formula is the single most important equation in CNN engineering. Every shape mismatch, every padding choice, every downsampling decision comes back to ⌊(W − K + 2P)/S⌋ + 1. Memorize it.

How does convolution actually compute a single output pixel? The kernel slides across the input. At each position, we element-wise multiply the kernel with the overlapping input region, then sum all products. For a 3×3 kernel at position (0,0) of a 5×5 input:

output[0,0] = ∑_i=0..2 ∑_j=0..2 input[i,j] × kernel[i,j]

Exercise 0.1: Basic Output Size Derive

A 3×3 kernel is applied to a 32×32 input with stride=1, padding=0. What is the output spatial dimension (one side)?

pixels

Show derivation

O = ⌊(32 − 3 + 0) / 1⌋ + 1 = 29 + 1 = 30

Without padding, a 3×3 kernel shrinks each dimension by 2 pixels (K−1). This is why "valid" convolution produces smaller outputs.

Exercise 0.2: Stride and Padding Derive

A 3×3 kernel, stride=2, padding=1, applied to a 224×224 input. Output size?

pixels

Show derivation

O = ⌊(224 − 3 + 2×1) / 2⌋ + 1 = ⌊223 / 2⌋ + 1 = 111 + 1 = 112

Stride=2 with padding=1 is the standard "halving" recipe. 224→112→56→28→14→7 — this is the exact downsampling chain in ResNet. Each layer halves the spatial dimension.

Exercise 0.3: Trace One Output Pixel Trace

Given a 5×5 input and a 3×3 kernel (no padding, stride=1), compute output[1,1].

Input:
1 0 2 1 3
0 1 1 0 2
2 1 0 2 1
1 0 3 1 0
0 2 1 0 1

Kernel:
1 0 −1
0 1 0
−1 0 1

output[1,1] overlaps input rows 1-3, cols 1-3. Multiply element-wise and sum.

(integer)

Show derivation

Patch at (1,1) = input[1:4, 1:4]:

[[1, 1, 0], [1, 0, 2], [0, 3, 1]]

Kernel: [[1, 0, −1], [0, 1, 0], [−1, 0, 1]]

Products: 1(1) + 1(0) + 0(−1) + 1(0) + 0(1) + 2(0) + 0(−1) + 3(0) + 1(1)

= 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1 = 2

This kernel detects diagonal edges — it has +1 on the main diagonal and −1 on the anti-diagonal. For this particular patch, most values align with the zero entries of the kernel, so only the two diagonal corners contribute. The result is 2.

Exercise 0.4: "Same" Padding Derive

"Same" padding means the output size equals the input size (when stride=1). For a 5×5 kernel with stride=1, how much padding P do you need on each side?

Set O = W and solve: W = (W − K + 2P)/1 + 1 ⇒ P = ?

pixels per side

Show derivation

W = (W − 5 + 2P)/1 + 1

W − 1 = W − 5 + 2P

2P = 4 ⇒ P = 2

The general rule: for any kernel K with stride=1, "same" padding requires P = (K−1)/2. For K=3: P=1. For K=5: P=2. For K=7: P=3. This is why odd kernel sizes are universal — they give integer padding for "same" convolutions.

Exercise 0.5: Conv Layer Parameters Derive

A conv layer: 3×3 kernel, 64 input channels, 128 output channels, with bias. How many learnable parameters?

Each output filter has shape [C_in, K, K] plus one bias.

parameters

Show derivation

Weights per filter = C_in × K × K = 64 × 3 × 3 = 536

Total weights = C_out × 536 = 128 × 536 = 73,728

Biases = C_out = 128

Total = 73,728 + 128 = 73,856

The parameter count formula for a conv layer is C_out × (C_in × K × K + 1). Notice it doesn't depend on the spatial size of the input — a 3×3 conv with 64→128 channels uses the same 73K parameters whether the input is 224×224 or 7×7. This is the fundamental advantage of convolutions over fully-connected layers.

Exercise 0.6: Implement conv2d() Build

Write a 2D convolution for a single-channel input. No padding, stride=1.

Return a 2D array. Use nested loops: for each output position, sum the element-wise product of the patch and kernel.

Show solution

javascript
function conv2d(input, kernel) {
  const H = input.length, W = input[0].length;
  const Kh = kernel.length, Kw = kernel[0].length;
  const Oh = H - Kh + 1, Ow = W - Kw + 1;
  const out = [];
  for (let i = 0; i < Oh; i++) {
    out[i] = [];
    for (let j = 0; j < Ow; j++) {
      let s = 0;
      for (let ki = 0; ki < Kh; ki++)
        for (let kj = 0; kj < Kw; kj++)
          s += input[i+ki][j+kj] * kernel[ki][kj];
      out[i][j] = s;
    }
  }
  return out;
}

Chapter 1: Receptive Field

A single pixel in a deep feature map "sees" a region of the original input. That region is the receptive field. Understanding it tells you whether your network can even detect objects at a given scale. If your receptive field is 32×32 pixels but the target object is 200×200, the network can only see fragments.

Receptive field after layer i:
rf_i = rf_i−1 + (K_i − 1) × ∏_j=1..i−1 S_j

rf₀ = 1 (a single input pixel)
K_i = kernel size at layer i
S_j = stride at layer j
The stride product multiplies all strides before the current layer.

Why the stride product matters. Each stride-2 layer doubles the "zoom level" of subsequent layers. A 3×3 kernel after one stride-2 layer covers (3−1)×2 = 4 extra input pixels. After two stride-2 layers, that same 3×3 kernel covers (3−1)×4 = 8 extra input pixels. Strides are multiplicative — they compound like interest.

Exercise 1.1: Three Stacked Convolutions Derive

Three consecutive 3×3 conv layers, all stride=1, no pooling. What is the receptive field?

Start with rf=1. Each 3×3 stride-1 layer adds (3−1)×1 = 2.

pixels × pixels

Show derivation

rf₀ = 1

rf₁ = 1 + (3−1)×1 = 3

rf₂ = 3 + (3−1)×1 = 5

rf₃ = 5 + (3−1)×1 = 7

Three 3×3 layers give a 7×7 receptive field — the same field of view as a single 7×7 kernel, but with fewer parameters (3×9 = 27 vs 49) and more nonlinearities. This is the key insight behind VGG: stack small kernels to get large receptive fields cheaply.

Exercise 1.2: With Stride-2 Layer Derive

Layer 1: 3×3, stride=2. Layer 2: 3×3, stride=1. Layer 3: 3×3, stride=1. What is the receptive field after layer 3?

pixels

Show derivation

Start: rf = 1, jump = 1

Layer 1 (3×3, s=2): rf = 1 + (3−1)×1 = 3, jump = 1×2 = 2

Layer 2 (3×3, s=1): rf = 3 + (3−1)×2 = 7, jump = 2×1 = 2

Layer 3 (3×3, s=1): rf = 7 + (3−1)×2 = 11, jump = 2

The stride-2 in layer 1 doubles the "jump" — every subsequent 3×3 kernel covers 2×(3−1) = 4 extra input pixels instead of 2. Compare to three stride-1 layers which give rf=7. One stride-2 layer early on grows the receptive field much faster: 11 vs 7 with the same number of layers.

Exercise 1.3: ResNet First 5 Layers Derive

ResNet-50 starts with: conv1 (7×7, stride=2), maxpool (3×3, stride=2), then three 3×3 stride-1 convs in the first residual block. What is the receptive field after these 5 layers?

Track the jump (cumulative stride product) at each layer. Layer i adds (K_i−1) × jump to the receptive field, then multiply jump by S_i.

pixels

Show derivation

Start: rf=1, jump=1

conv1 (7×7, s=2): rf = 1 + 6×1 = 7, jump = 1×2 = 2

maxpool (3×3, s=2): rf = 7 + 2×2 = 11, jump = 2×2 = 4

conv (3×3, s=1): rf = 11 + 2×4 = 19, jump = 4

conv (3×3, s=1): rf = 19 + 2×4 = 27, jump = 4

conv (3×3, s=1): rf = 27 + 2×4 = 35, jump = 4

After just the stem and first residual block, each feature map pixel already sees a 35×35 region of the original 224×224 input. By the end of ResNet-50, the receptive field exceeds the entire input image — meaning each final feature map pixel is theoretically influenced by all input pixels.

Exercise 1.4: Dilation Expands RF Trace

A 3×3 kernel with dilation=2 (atrous convolution) has an effective kernel size of K_eff = K + (K−1)×(d−1). What is K_eff for a 3×3 kernel with dilation=2?

4 5 6 7

Show derivation

K_eff = 3 + (3−1)×(2−1) = 3 + 2 = 5

Dilated convolutions skip every (d−1) pixels. A 3×3 kernel with dilation=2 samples a 5×5 region but only uses 9 weights. This is how DeepLab achieves large receptive fields in segmentation without increasing parameters or reducing resolution.

Exercise 1.5: Why Not One Big Kernel? Trace

Three 3×3 conv layers (64 channels each) give a 7×7 receptive field. A single 7×7 conv layer also gives 7×7. Which uses fewer parameters?

Three 3×3 layers: 3×(64×64×9) = 110,592 vs one 7×7: 64×64×49 = 200,704 One 7×7 layer: it's a single matrix They use the same number of parameters

Show derivation

3 × (64 × 64 × 3 × 3) = 3 × 36,864 = 110,592

1 × (64 × 64 × 7 × 7) = 200,704

Three stacked 3×3 layers use 45% fewer parameters than one 7×7 layer while achieving the same receptive field. Plus they add two extra ReLU nonlinearities, making the function more expressive. This is the VGGNet insight (2014) that killed large kernels.

Chapter 2: Pooling & Feature Maps

Pooling layers reduce spatial dimensions without learnable parameters. They summarize local regions — max pooling takes the strongest activation, average pooling takes the mean. The output size formula is the same as convolution: O = ⌊(W − K + 2P) / S⌋ + 1.

Pooling is convolution's non-learned cousin. Same sliding window, same output size formula, but no parameters to train. Max pooling selects the strongest feature; average pooling summarizes. Modern architectures often replace pooling with stride-2 convolutions, gaining learnability at the cost of extra parameters.

Exercise 2.1: Max Pool a Feature Map Trace

Apply 2×2 max pooling (stride=2) to this 4×4 feature map. What is the value at output position (0,1)?

Feature map:
1 3 2 4
5 6 7 8
3 2 1 0
1 0 3 2

(value)

Show derivation

Output (0,1) covers input rows 0-1, cols 2-3: [[2,4],[7,8]]

max(2, 4, 7, 8) = 8

The full 2×2 output is: [[6, 8], [3, 3]]. Max pooling with 2×2 stride 2 halves each dimension and keeps the strongest activation in each quadrant.

Exercise 2.2: Global Average Pooling Derive

The final conv layer of ResNet-50 outputs a 7×7×2048 feature map. Global Average Pooling (GAP) averages each channel over the entire spatial dimension. What is the GAP output shape?

7×7×1 1×1×2048 (a 2048-dim vector) 1×1×1 7×7×2048 (unchanged)

Show reasoning

GAP averages over the spatial dimensions (H and W) independently for each channel. Input [7, 7, 2048] → each of the 2048 channels gets averaged over 7×7=49 positions → output [1, 1, 2048]. This eliminates 7×7 − 1 = 48 spatial positions per channel, leaving a single number per channel.

Exercise 2.3: GAP vs Fully Connected Derive

Compare: (A) GAP then a 2048→1000 FC layer vs (B) Flatten 7×7×2048 then a 100,352→1000 FC layer. How many parameters does approach B use? How many does A use?

Just the FC layer parameters (weights + bias).

million params saved by GAP

Show derivation

Approach A (GAP + FC): 2048 × 1000 + 1000 = 2,049,000

Approach B (Flatten + FC): 100,352 × 1000 + 1000 = 100,353,000

Savings = 100,353,000 − 2,049,000 = 98,304,000 ≈ 98.3M

GAP saves ~98.3M parameters — a 49x reduction. This is why every modern classifier uses GAP: it collapses spatial information into a compact vector with zero learned parameters, then lets a tiny FC layer do the classification. Introduced by GoogLeNet (2014), GAP also acts as a regularizer since it forces the network to learn spatially-distributed features.

Exercise 2.4: Feature Map Memory Derive

A conv layer outputs 56×56×256 feature maps (float32). How much memory does this activation tensor consume per image?

Show derivation

Elements = 56 × 56 × 256 = 802,816

Bytes = 802,816 × 4 = 3,211,264 bytes

MB = 3,211,264 / (1024 × 1024) ≈ 3.06 MB

During training, every layer's activations must be stored for backpropagation. ResNet-50 with batch_size=32 stores ~3 GB of activations across all layers. This is why activation checkpointing (recomputing activations instead of storing them) is critical for training large CNNs on limited GPU memory.

Exercise 2.5: Depthwise Separable Conv Derive

A standard 3×3 conv: 256 input channels, 256 output channels. A depthwise separable conv splits this into: (1) depthwise 3×3 conv (one filter per channel, 256 filters) + (2) pointwise 1×1 conv (256→256). What is the parameter ratio: standard / depthwise_separable?

× (ratio)

Show derivation

Standard: 256 × 256 × 3 × 3 = 589,824

Depthwise: 256 × 1 × 3 × 3 = 2,304

Pointwise: 256 × 256 × 1 × 1 = 65,536

Separable total = 2,304 + 65,536 = 67,840

Ratio = 589,824 / 67,840 ≈ 8.7×

In general, the ratio is approximately K² for large channel counts (here 3² = 9, close to our 8.7). MobileNet uses depthwise separable convolutions throughout, achieving ~8-9x parameter reduction per layer with minimal accuracy loss. The general formula: standard/(depthwise+pointwise) = C_out×C_in×K² / (C_in×K² + C_in×C_out) ≈ K² when C is large.

Chapter 3: Object Detection: Anchors

How does a detection network know where to look? It doesn't search randomly. Instead, it tiles the image with thousands of predefined anchor boxes — rectangles of various sizes and aspect ratios — and predicts offsets from each anchor to refine the location. The anchor system converts "find any object anywhere" into "classify and refine each of these specific boxes."

Anchor generation (Faster R-CNN style):
Scales: {128, 256, 512} // 3 sizes
Aspect ratios: {1:2, 1:1, 2:1} // 3 shapes
Anchors per location = scales × ratios = 9
For a feature map of size H×W: total anchors = H × W × 9

Anchors are the detection network's hypothesis space. Each anchor says "there might be an object here, at roughly this scale and shape." The network's job is reduced to: for each anchor, (1) is there an object? (2) how much should I shift/resize this box? This is why anchor-based detectors are fast — they amortize the search problem into a fixed set of classifications.

To evaluate how well a predicted box matches the ground truth, we use Intersection over Union (IoU):

IoU(A, B) = Area(A ∩ B) / Area(A ∪ B) = Area(intersection) / (Area(A) + Area(B) − Area(intersection))

Exercise 3.1: Total Anchors Derive

A Faster R-CNN uses 3 scales × 3 aspect ratios = 9 anchors per location. The backbone outputs a 32×32 feature map. How many total anchor boxes?

anchors

Show derivation

Total = 32 × 32 × 9 = 9,216 anchors

And this is for a single feature map scale. FPN-based detectors like RetinaNet use 5 feature map levels (P3-P7), with sizes like 100×100, 50×50, 25×25, 13×13, 7×7. Total anchors across all levels: 9 × (10000+2500+625+169+49) = 9 × 13,343 = ~120K anchors per image. The vast majority are negative (no object) — this class imbalance motivated focal loss.

Exercise 3.2: IoU by Hand Derive

Compute IoU for boxes A=[0, 0, 10, 10] and B=[5, 5, 15, 15] (format: [x1, y1, x2, y2]).

IoU (0 to 1)

Show derivation

Intersection: x1=max(0,5)=5, y1=max(0,5)=5, x2=min(10,15)=10, y2=min(10,15)=10

Intersection area = (10−5)×(10−5) = 25

Area A = (10−0)×(10−0) = 100

Area B = (15−5)×(15−5) = 100

Union = 100 + 100 − 25 = 175

IoU = 25 / 175 = 1/7 ≈ 0.143

An IoU of 0.143 means only 14.3% overlap. In COCO evaluation, a detection is considered a true positive at IoU ≥ 0.5. The COCO primary metric (AP) averages over IoU thresholds from 0.5 to 0.95 in steps of 0.05.

Exercise 3.3: Implement computeIoU() Build

Write a function that computes IoU between two boxes in [x1, y1, x2, y2] format.

Find intersection coordinates with max/min. Clamp intersection dimensions to 0. Compute areas. IoU = intersection / union.

Show solution

javascript
function computeIoU(a, b) {
  const ix1 = Math.max(a[0], b[0]);
  const iy1 = Math.max(a[1], b[1]);
  const ix2 = Math.min(a[2], b[2]);
  const iy2 = Math.min(a[3], b[3]);
  const iw = Math.max(0, ix2 - ix1);
  const ih = Math.max(0, iy2 - iy1);
  const inter = iw * ih;
  const areaA = (a[2]-a[0]) * (a[3]-a[1]);
  const areaB = (b[2]-b[0]) * (b[3]-b[1]);
  return inter / (areaA + areaB - inter);
}

Exercise 3.4: Anchor Matching Trace

In Faster R-CNN, an anchor is labeled positive if its IoU with any ground-truth box is ≥ 0.7, and negative if its max IoU with any ground-truth box is < 0.3. What happens to anchors with IoU between 0.3 and 0.7?

They are labeled as positive with reduced weight They are labeled as negative They are ignored during training (neither positive nor negative) They are used for regression only, not classification

Show explanation

Anchors in the "gray zone" (IoU 0.3-0.7) are ignored — they contribute no loss during training. This prevents ambiguous examples from corrupting the classifier. Of the ~9K anchors from a 32×32 feature map, typically only ~256 are sampled for training (128 positive + 128 negative), with the rest discarded.

Exercise 3.5: Multi-Scale Anchors Derive

An FPN-based detector uses 5 feature map levels with spatial sizes: P3=80×80, P4=40×40, P5=20×20, P6=10×10, P7=5×5. Each level has 9 anchors per location. Total anchors?

total anchors

Show derivation

P3: 80×80 = 6,400 locations

P4: 40×40 = 1,600

P5: 20×20 = 400

P6: 10×10 = 100

P7: 5×5 = 25

Total locations = 8,525

Total anchors = 8,525 × 9 = 76,725

~84% of anchors come from P3 alone (the finest level). This heavy skew toward small objects is intentional — small objects are harder to detect and need denser coverage. But it also means ~75K of these ~77K anchors are negative, which is the class imbalance problem that focal loss addresses.

Exercise 3.6: Find the Bug Debug

This IoU function returns values greater than 1 for some inputs. Click the buggy line.

function computeIoU(a, b) {
  const ix1 = Math.max(a[0], b[0]);
  const iy1 = Math.max(a[1], b[1]);
  const ix2 = Math.min(a[2], b[2]);
  const iy2 = Math.min(a[3], b[3]);
  const inter = (ix2 - ix1) * (iy2 - iy1);
  const areaA = (a[2]-a[0]) * (a[3]-a[1]);
  const areaB = (b[2]-b[0]) * (b[3]-b[1]);
  return inter / (areaA + areaB - inter);
}

Show explanation

Line 6 is the bug. When boxes don't overlap, ix2 - ix1 or iy2 - iy1 can be negative. Two negative numbers multiplied give a positive intersection area, which corrupts the IoU. The fix: clamp both dimensions to zero with Math.max(0, ix2 - ix1) and Math.max(0, iy2 - iy1). Without this clamp, non-overlapping boxes can produce IoU > 1 because the "intersection" is positive while the union is smaller than the sum of areas.

Chapter 4: Non-Maximum Suppression

A detector with 77K anchors might produce hundreds of detections for a single object. Non-Maximum Suppression (NMS) prunes these duplicates: keep the highest-confidence detection, remove all others that overlap it significantly, repeat.

NMS Algorithm:
1. Sort all detections by confidence score (descending)
2. Pick the top detection, add it to the output list
3. Compute IoU of this detection with all remaining detections
4. Remove any detection with IoU ≥ threshold (e.g. 0.5)
5. Repeat from step 2 until no detections remain

NMS is greedy and imperfect. It assumes that if two boxes overlap heavily, they must be detecting the same object. This fails for occluded objects standing side-by-side. Soft-NMS (reducing scores instead of removing) and end-to-end set prediction (DETR) are modern alternatives.

Exercise 4.1: Trace NMS Trace

5 detections with scores and boxes. IoU threshold = 0.5. Trace NMS and count how many survive.

Det A: score=0.9, box=[10,10,50,50]
Det B: score=0.85, box=[12,12,52,52]
Det C: score=0.7, box=[100,100,140,140]
Det D: score=0.6, box=[105,102,145,142]
Det E: score=0.3, box=[200,200,240,240]

A and B overlap heavily (IoU ≈ 0.68). C and D overlap heavily (IoU ≈ 0.65). E is isolated.

1 box survives (only A) 2 boxes survive (A and C) 3 boxes survive (A, C, and E) 5 boxes survive (no suppression)

Show trace

Round 1: Pick A (score=0.9). IoU(A,B)≈0.68 ≥ 0.5 → suppress B. IoU(A,C)≈0 < 0.5 → keep. IoU(A,D)≈0 < 0.5 → keep. IoU(A,E)≈0 < 0.5 → keep. Remaining: {C, D, E}.

Round 2: Pick C (score=0.7). IoU(C,D)≈0.65 ≥ 0.5 → suppress D. IoU(C,E)≈0 < 0.5 → keep. Remaining: {E}.

Round 3: Pick E (score=0.3). No remaining detections. Done.

Survivors: A, C, E — three distinct detections for three distinct regions.

Exercise 4.2: NMS Failure Case Trace

Two people standing side-by-side have bounding boxes with IoU = 0.55. Your NMS threshold is 0.5. What happens?

The lower-scoring person is suppressed — NMS deletes a true positive Both are kept because NMS handles this case Both are suppressed

Show explanation

Since IoU (0.55) ≥ threshold (0.5), the lower-scoring box is suppressed. This is a false negative caused by NMS. Solutions: (1) raise the threshold to 0.6 or 0.7, but this lets more duplicate detections through, (2) use Soft-NMS which decays the score instead of removing, or (3) use DETR-style set prediction which avoids NMS entirely by predicting a fixed set of unique objects.

Exercise 4.3: Implement nms() Build

Implement NMS. Input: array of detections [{box:[x1,y1,x2,y2], score:float}], IoU threshold. Return surviving detections in order.

Sort by score descending. Greedy loop: pick top, remove overlapping, repeat.

Show solution

javascript
function nms(detections, iouThreshold) {
  function iou(a, b) {
    const ix1=Math.max(a[0],b[0]), iy1=Math.max(a[1],b[1]);
    const ix2=Math.min(a[2],b[2]), iy2=Math.min(a[3],b[3]);
    const inter = Math.max(0,ix2-ix1)*Math.max(0,iy2-iy1);
    const aa=(a[2]-a[0])*(a[3]-a[1]);
    const ab=(b[2]-b[0])*(b[3]-b[1]);
    return inter/(aa+ab-inter);
  }
  const sorted = [...detections].sort((a,b) => b.score-a.score);
  const keep = [];
  const suppressed = new Set();
  for (let i=0; i<sorted.length; i++) {
    if (suppressed.has(i)) continue;
    keep.push(sorted[i]);
    for (let j=i+1; j<sorted.length; j++) {
      if (!suppressed.has(j) && iou(sorted[i].box,sorted[j].box) >= iouThreshold)
        suppressed.add(j);
    }
  }
  return keep;
}

Exercise 4.4: NMS Complexity Derive

NMS processes N detections. In the worst case (no suppression — every detection is isolated), how many IoU computations does the greedy algorithm perform?

IoU ops for N=100

Show derivation

Round 1: compare top with 99 remaining = 99 IoU ops

Round 2: compare with 98 remaining = 98 IoU ops

...

Total = 99 + 98 + ... + 1 = N(N−1)/2 = 100×99/2 = 4,950

O(N²) complexity. With N=77K anchors this would be ~3 billion IoU ops. In practice, a confidence threshold (e.g., score > 0.05) pre-filters to a few hundred detections before NMS, making it cheap. GPU-accelerated NMS (torchvision.ops.nms) handles the remaining cost.

Exercise 4.5: Arrange the Detection Pipeline Design

Put these detection post-processing steps in the correct order.

→

Decode box offsets to coordinates Apply confidence threshold Per-class NMS Keep top-K results Raw network output (logits + offsets)

Show correct order

1. Raw network output → 2. Decode box offsets (apply anchor transforms to get actual coordinates) → 3. Apply confidence threshold (remove low-score detections) → 4. Per-class NMS (remove duplicates within each class) → 5. Keep top-K (limit final output count)

Chapter 5: Image Classification Architectures

The core innovation of ResNet is embarrassingly simple: add the input of a block to its output. Instead of learning H(x), the block learns F(x) = H(x) − x, and the output is F(x) + x. This is the residual connection (or skip connection).

ResNet residual block:
output = F(x) + x

F(x) = W₂ · ReLU(W₁ · x) // learned residual
x // identity shortcut

Why gradients flow better. During backprop, ∂output/∂x = ∂F(x)/∂x + 1. That "+1" is the identity gradient from the skip connection. Even if the learned gradients ∂F/∂x vanish, the gradient never drops below 1 along the skip path. This is how 152-layer networks train successfully — the gradient has a "highway" through the skip connections.

Exercise 5.1: ResNet-50 Parameter Count Derive

ResNet-50 has 4 stages with bottleneck blocks. Each bottleneck has three convolutions: 1×1 (reduce), 3×3 (process), 1×1 (expand). Stage dimensions:

Stage 1: 3 blocks, channels 64→64→256
Stage 2: 4 blocks, channels 128→128→512
Stage 3: 6 blocks, channels 256→256→1024
Stage 4: 3 blocks, channels 512→512→2048
Plus: conv1 (7×7, 3→64) + FC (2048→1000)

Per bottleneck block: 1×1 conv (C_in→C_mid) + 3×3 conv (C_mid→C_mid) + 1×1 conv (C_mid→C_out). Compute total for Stage 3 (6 blocks, first block has C_in=512 from Stage 2).

million params (Stage 3 only)

Show derivation

Block 1 (512→256→256→1024): 512×256 + 256×256×9 + 256×1024 = 131K + 590K + 262K = 983K

+ 1×1 downsample shortcut: 512×1024 = 524K. Block 1 total ≈ 1,507K

Blocks 2-6 (1024→256→256→1024 each): 1024×256 + 256×256×9 + 256×1024 = 262K + 590K + 262K = 1,114K each

Stage 3 total = 1,507K + 5 × 1,114K = 1,507K + 5,530K = 7,077K ≈ 7.1M

ResNet-50 total is ~25.6M parameters. Stage 3 (1024-channel blocks) and Stage 4 (2048-channel blocks) dominate because parameter count scales with C². The full count: conv1 (9.4K) + Stage1 (0.2M) + Stage2 (1.2M) + Stage3 (7.1M) + Stage4 (15.0M) + FC (2.0M) ≈ 25.6M.

Exercise 5.2: FLOPs for One Conv Layer Derive

A 3×3 conv layer: 256 input channels, 256 output channels, applied to a 56×56 feature map. How many multiply-accumulate operations (MACs)?

Each output element requires C_in × K × K multiplications. Total MACs = output_elements × macs_per_element = (H_out × W_out × C_out) × (C_in × K × K).

GMACs (billions of MACs)

Show derivation

Output spatial size (same padding): 56 × 56

MACs per output element = 256 × 3 × 3 = 2,304

Total output elements = 56 × 56 × 256 = 802,816

Total MACs = 802,816 × 2,304 = 1,850,490,880 ≈ 1.85 GMACs

Note: FLOPs ≈ 2 × MACs (one multiply + one add per MAC). So this layer uses ~3.7 GFLOPs. The key insight: FLOPs scale with spatial area. The same conv layer at 28×28 uses 4x fewer FLOPs than at 56×56. This is why downsampling early is computationally efficient — most FLOPs are spent in the early, high-resolution layers.

Exercise 5.3: Skip Connection Gradient Trace

In a 100-layer network without skip connections, the gradient through the chain rule is a product of ~100 terms. If each term is 0.9, the gradient magnitude at the first layer is 0.9¹⁰⁰ ≈ ?

~0.000027 — essentially zero (vanishing gradient) ~0.9 — barely diminished ~0.5 — halved ~0.01 — small but trainable

Show computation

0.9¹⁰⁰ = e^100×ln(0.9) = e^{100×(−0.1054)} = e^−10.54 ≈ 0.0000265

Even with per-layer gradients of 0.9 (close to 1!), 100 layers of multiplication shrink the signal to near zero. This is the vanishing gradient problem. ResNet's skip connections bypass this chain by providing an additive gradient path: the gradient through the identity shortcut is always 1, regardless of depth.

Exercise 5.4: Find the Bug Debug

This ResNet block implementation crashes with a shape mismatch on the addition. Click the buggy line.

function residualBlock(x, W1, W2) {
  // x: [batch, 256, 56, 56]
  let out = conv2d(x, W1);     // W1: 3x3, 256→512
  out = relu(out);              // [batch, 512, 56, 56]
  out = conv2d(out, W2);        // W2: 3x3, 512→256
  return out + x;               // CRASH: [batch,256,56,56]
}

Show explanation

Line 3 is the bug. The first conv changes channels from 256 to 512, but the second conv brings it back to 256. While the final shape matches x, the issue is that W1 projects to 512 channels unnecessarily. In a standard ResNet basic block, both convolutions should maintain the same channel count (256→256, 256→256). If you need to change channels, you must add a 1×1 projection shortcut: return out + conv1x1(x, Ws). The code as written doesn't crash on shapes (both are [batch,256,56,56]), but the intermediate 512-channel expansion is wrong for a basic block. For a bottleneck block, the expansion goes 256→64→64→256 (reduce-process-expand), not 256→512→256.

Exercise 5.5: ResNet-50 Total FLOPs Derive

A rough estimate: ResNet-50 processes a 224×224 input. The spatial dimensions halve at each stage while channels double. Approximate total GMACs (use the rule that most FLOPs are in Stage 3 at 14×14 resolution).

~0.5 GMACs ~4 GMACs (the standard number) ~25 GMACs ~100 GMACs

Show reasoning

ResNet-50 uses ~4.1 GMACs (~8.2 GFLOPs) for a 224×224 input. The majority of compute is in Stage 2 (56×56, 128 channels) and Stage 3 (28×28, 256 channels). Despite Stage 4 having the most parameters (512 channels), it operates at 7×7 resolution, so its FLOP contribution is small. This is a general principle: parameters concentrate in deep layers, FLOPs concentrate in early layers.

Chapter 6: Segmentation

Classification gives one label per image. Semantic segmentation gives one label per pixel. The challenge: the encoder downsamples (224×224 → 7×7), and the decoder must upsample back to full resolution. This requires transposed convolution (sometimes called "deconvolution").

Transposed convolution output size:
O = (W − 1) × S − 2P + K

This is the inverse of the convolution formula.
When S=2, K=4, P=1: O = (W−1)×2 − 2 + 4 = 2W
// Exactly doubles the spatial dimension

U-Net's key insight: skip connections for segmentation. Downsampling loses spatial detail. Upsampling alone produces blurry boundaries. U-Net concatenates encoder features (which have fine spatial detail) with decoder features (which have semantic context) at each resolution level. This gives sharp, semantically meaningful per-pixel predictions.

Exercise 6.1: Transposed Conv Output Size Derive

Transposed convolution with K=4, S=2, P=1 applied to a 14×14 feature map. Output size?

pixels

Show derivation

O = (14 − 1) × 2 − 2×1 + 4 = 13×2 − 2 + 4 = 26 − 2 + 4 = 28

The combination K=4, S=2, P=1 is the standard "doubling" transposed convolution. It exactly doubles the spatial dimension: 7→14, 14→28, 28→56, 56→112. This is the mirror image of stride-2 convolution's halving.

Exercise 6.2: U-Net Feature Map Sizes Trace

A U-Net encoder: input 256×256, then 4 max-pool 2×2 (stride 2) operations. What are the feature map sizes at each level?

256 → 128 → 64 → 32 → 16 (halving each time) 256 → 254 → 252 → 250 → 248 (losing 2 per pool) 256 → 64 → 16 → 4 → 1 (quartering each time)

Show derivation

Each 2×2 maxpool with stride 2: O = W/2

256 → 128 → 64 → 32 → 16

The bottleneck is 16×16. The decoder path mirrors this: 16 → 32 → 64 → 128 → 256, with skip connections concatenating encoder features at each level. The channel counts typically double on the way down (64→128→256→512→1024) and halve on the way up.

Exercise 6.3: U-Net Skip Concatenation Derive

In the U-Net decoder, at 64×64 resolution: the upsampled decoder feature has 256 channels. The skip connection from the encoder also has 256 channels. After concatenation, a 3×3 conv reduces channels back to 256. How many parameters in this conv (with bias)?

parameters

Show derivation

Input channels after concat = 256 + 256 = 512

Conv: 3×3, 512→256

Weights = 512 × 256 × 3 × 3 = 1,179,648

Bias = 256

Total = 1,179,904

The skip concatenation doubles the input channels, which quadruples the conv parameters compared to a conv without skip (256×256×9 = 590K vs 512×256×9 = 1.18M). This is the parameter cost of skip connections, but the boundary quality improvement is dramatic.

Exercise 6.4: Segmentation Output Shape Trace

A semantic segmentation network for 21 classes (including background) takes a 512×512 input and produces per-pixel predictions. What is the output tensor shape?

[512, 512, 1] — one class index per pixel [512, 512, 21] — logits for each class at each pixel [21, 1] — one probability per class [64, 64, 21] — downsampled predictions

Show explanation

The output is [512, 512, 21] — a full-resolution prediction with 21 logits per pixel. After applying argmax along the channel dimension, you get a [512, 512] label map. The final layer is a 1×1 convolution that maps from the decoder's channel dimension to the number of classes. Loss: per-pixel cross-entropy between predicted logits and ground-truth labels.

Exercise 6.5: Bilinear Upsampling vs Transposed Conv Derive

To upsample 14×14×512 to 28×28×256: Option A uses transposed conv (K=4, S=2, P=1). Option B uses bilinear upsampling (2x) followed by 3×3 conv. How many parameters does each approach use?

million more params for transposed conv

Show derivation

Transposed conv (4×4, 512→256): 512 × 256 × 4 × 4 = 2,097,152

Bilinear up (0 params) + 3×3 conv (512→256): 512 × 256 × 3 × 3 = 1,179,648

Difference = 2,097,152 − 1,179,648 = 917,504 ≈ 0.92M

Bilinear + conv is cheaper and often preferred because transposed convolutions can produce "checkerboard" artifacts when K is not divisible by S. Modern segmentation networks (e.g., DeepLabV3+) typically use bilinear upsampling followed by convolution.

Chapter 7: Stereo Vision

Two cameras, separated by a known distance, observe the same scene. An object appears at different horizontal positions in the left and right images. The difference in position is disparity, and it encodes depth: closer objects have larger disparity, farther objects have smaller disparity.

The stereo depth equation:
d = f × B / Z

d = disparity (pixels)
f = focal length (pixels)
B = baseline (meters, distance between cameras)
Z = depth (meters, distance to object)

Rearranged: Z = f × B / d

Disparity is inversely proportional to depth. This means stereo depth estimation has high precision for nearby objects (large disparity, many pixels to work with) and poor precision for distant objects (small disparity, sub-pixel accuracy needed). A 1-pixel error in disparity at d=100 pixels means a 1% depth error, but at d=5 pixels it means a 20% depth error.

Exercise 7.1: Compute Disparity Derive

f = 500 pixels, B = 0.12 meters, Z = 5 meters. What is the disparity in pixels?

pixels

Show derivation

d = f × B / Z = 500 × 0.12 / 5 = 60 / 5 = 12 pixels

12 pixels of disparity is easy to detect reliably. For comparison, an object at 50 meters would have d = 500 × 0.12 / 50 = 1.2 pixels — much harder, requiring sub-pixel matching algorithms.

Exercise 7.2: Depth from Disparity Derive

Same camera setup (f=500px, B=0.12m). You measure a disparity of 8 pixels. What is the object's depth?

meters

Show derivation

Z = f × B / d = 500 × 0.12 / 8 = 60 / 8 = 7.5 meters

Exercise 7.3: Depth Error from Disparity Error Derive

With f=500, B=0.12, true depth Z=10m: if the disparity measurement has a ±1 pixel error, what is the depth error range?

First find true disparity d, then compute Z for d±1.

meters (total range: Z_max − Z_min)

Show derivation

True disparity: d = 500 × 0.12 / 10 = 6 pixels

d − 1 = 5: Z_max = 60/5 = 12.0 m

d + 1 = 7: Z_min = 60/7 = 8.53 m

Range = 12.0 − 8.53 = 3.43 m

A ±1 pixel error at 10m depth produces ~3.4m of uncertainty. Compare to an object at 2m (d=30 pixels): Z_min=60/31=1.94m, Z_max=60/29=2.07m, range=0.13m. Depth precision degrades quadratically with distance: ΔZ ≈ Z²/(fB) × Δd.

Exercise 7.4: Epipolar Constraint Trace

The epipolar constraint says: for a point x in the left image, its corresponding point x' in the right image lies on a line (the epipolar line). For a perfectly rectified stereo pair (horizontal baseline), what does this constraint simplify to?

x and x' have the same y-coordinate — search only horizontally x and x' have the same x-coordinate — search only vertically x' can be anywhere in the right image

Show explanation

For a rectified stereo pair, the epipolar lines are horizontal — every corresponding point has the same y-coordinate. This reduces stereo matching from a 2D search to a 1D search along the same scanline. Rectification is a preprocessing step that warps both images so that epipolar lines become horizontal, using the camera calibration parameters. Mathematically, the fundamental matrix F of a rectified pair has the form F = [0, 0, 0; 0, 0, -1; 0, 1, 0] (up to scale), enforcing y' = y.

Exercise 7.5: Disparity Search Range Derive

Your stereo camera (f=500, B=0.12) must detect objects from 1m to 30m. What range of disparities must you search?

pixels (d_max − d_min)

Show derivation

d_max (nearest, 1m) = 500 × 0.12 / 1 = 60 pixels

d_min (farthest, 30m) = 500 × 0.12 / 30 = 2 pixels

Range = 60 − 2 = 58 pixels

A search range of 58 pixels means for each pixel in the left image, you must compare 58 candidate positions in the right image. For a 640×480 image: 640 × 480 × 58 = ~17.8 million comparisons. Modern stereo networks (e.g., RAFT-Stereo) construct a 3D cost volume of this size and process it with 3D convolutions.

Chapter 9: Capstone: Detection System Design

You're designing a YOLOv8-style single-shot detector. This capstone ties together everything from this workbook: convolution sizing, receptive fields, feature pyramids, anchor math, NMS, and latency budgets. You need to know every number in the pipeline.

A real detection system is an end-to-end latency budget. Image preprocessing, backbone inference, neck (FPN), head prediction, post-processing (decode + NMS) — each stage has a compute cost, and the total must fit within your frame deadline. At 30 FPS, you have 33ms per frame. Every layer you add eats into that budget.

Exercise 9.1: Backbone FLOPs Derive

Your YOLOv8-Medium backbone is similar to CSPDarknet53-medium. Approximate it as: 5 stages of 3×3 convolutions with channels [48, 96, 192, 384, 536] at spatial sizes [320, 160, 80, 40, 20] (input 640×640). Each stage has ~6 conv layers. Estimate backbone GMACs.

Per conv: MACs ≈ H×W×C_out×C_in×K². Sum across all stages.

GMACs (approximate)

Show derivation

Stage 1 (320×320, 48ch, 6 layers): 6 × 320×320×48×48×9 ≈ 6 × 0.21G = 1.27G

Stage 2 (160×160, 96ch, 6 layers): 6 × 160×160×96×96×9 ≈ 6 × 0.21G = 1.27G

Stage 3 (80×80, 192ch, 6 layers): 6 × 80×80×192×192×9 ≈ 6 × 0.21G = 1.27G

Stage 4 (40×40, 384ch, 6 layers): 6 × 40×40×384×384×9 ≈ 6 × 0.21G = 1.27G

Stage 5 (20×20, 536ch, 6 layers): 6 × 20×20×536×536×9 ≈ 6 × 0.12G = 0.72G

Total ≈ 5.8 GMACs

The actual YOLOv8-M backbone uses ~8-12 GMACs depending on exact block structure (CSP bottlenecks have fewer MACs than plain convs). The key takeaway: most compute is in the early high-resolution stages. Stage 1 at 320×320 uses as many FLOPs as Stage 5 at 20×20 despite having 12x fewer channels, because spatial area dominates.

Exercise 9.2: Total Detection Anchors Derive

YOLOv8 is anchor-free but still predicts on 3 feature map scales: 80×80, 40×40, 20×20. It predicts 1 box per grid cell (no anchor presets). How many total predictions?

total predictions

Show derivation

80×80 = 6,400

40×40 = 1,600

20×20 = 400

Total = 6,400 + 1,600 + 400 = 8,400

Compare to anchor-based Faster R-CNN: 9 anchors per location would give 9 × 8,400 = 75,600 predictions. YOLOv8's anchor-free design reduces predictions by 9x, which dramatically speeds up post-processing (NMS). Each prediction includes: 4 box coordinates + 1 objectness + C class scores = 4 + 1 + 80 = 85 values for COCO.

Exercise 9.3: NMS Latency Derive

After confidence thresholding (score > 0.25), about 200 detections remain. NMS is O(N²). If each IoU computation takes 10 nanoseconds, how long does NMS take?

milliseconds

Show derivation

IoU computations = 200 × 199 / 2 = 19,900

Time = 19,900 × 10 ns = 199,000 ns = 0.199 ms

NMS is negligible at 0.2ms. The bottleneck in a detection pipeline is always the backbone forward pass (5-15ms on GPU), not post-processing. This is why optimizing the backbone (pruning, quantization, TensorRT) gives the biggest latency wins.

Exercise 9.4: GPU Inference Time Derive

Your YOLOv8-M has ~20 GMACs total (backbone + neck + head). An NVIDIA RTX 4090 delivers ~83 TFLOPS (FP16, with tensor cores). What is the theoretical minimum inference time?

Remember: 1 MAC ≈ 2 FLOPs. And real utilization is ~50% of peak due to memory bandwidth bottlenecks.

ms (with 50% utilization)

Show derivation

FLOPs = 2 × 20 GMACs = 40 GFLOPs

Peak throughput = 83 TFLOPS = 83,000 GFLOPs/s

Theoretical time = 40 / 83,000 = 0.00048 seconds = 0.48 ms

With 50% utilization: 0.48 / 0.5 = 0.96 ms

Under 1ms for the forward pass — easily supports real-time detection at 30+ FPS even with post-processing overhead. In practice, YOLOv8-M achieves ~2-3ms on RTX 4090 because of memory transfer latency, kernel launch overhead, and some layers being memory-bandwidth-bound rather than compute-bound. TensorRT optimization can bring this closer to the theoretical limit.

Exercise 9.5: End-to-End Latency Budget Design

Fill in a realistic latency budget for a 30 FPS detection pipeline on RTX 4090. Target: ≤ 33ms total.

→

Preprocess: resize + normalize (~1ms) Backbone + FPN forward pass (~3ms) Detection head predictions (~0.5ms) Decode + NMS post-processing (~0.5ms) Draw boxes + display (~1ms)

Show correct order

Preprocess (~1ms) → Backbone + FPN (~3ms) → Head predictions (~0.5ms) → Decode + NMS (~0.5ms) → Draw + display (~1ms). Total ≈ 6ms, well under 33ms. The backbone dominates at ~50% of total latency, confirming it's the optimization target. With pipelining (process frame N+1 while displaying frame N), you can overlap stages and further reduce effective latency.

Exercise 9.6: Quantization Savings Derive

Your YOLOv8-M model has 25M parameters in FP32. You quantize to INT8. How much model size do you save, and approximately how much faster is inference (assume memory-bandwidth-bound layers)?

MB saved

Show derivation

FP32 size: 25M × 4 bytes = 100 MB

INT8 size: 25M × 1 byte = 25 MB

Savings: 100 − 25 = 75 MB (4x smaller)

For memory-bandwidth-bound layers, throughput scales linearly with model size reduction: ~4x faster. For compute-bound layers, INT8 tensor cores give ~2x the throughput of FP16. In practice, INT8 YOLOv8 achieves ~2-3x end-to-end speedup with minimal accuracy loss (<1% mAP drop on COCO). TensorRT's post-training quantization handles the conversion automatically.

The proof of work. If you completed every exercise in this workbook — traced convolutions by hand, computed receptive fields, implemented IoU and NMS, estimated FLOPs and latency budgets — you can design, debug, and optimize a real detection pipeline. These are the exact calculations CV engineers do daily, from choosing the right backbone to meeting a latency SLA. "What I cannot create, I do not understand."

Related Lessons

Topic	Lesson
Vision-Language Models	VLM — From Absolute Zero
Diffusion Models	Diffusion — From Absolute Zero
NeRF & 3D Gaussian Splatting	NeRF & 3DGS — From Absolute Zero
Contrastive Learning (CLIP)	Contrastive Learning — From Absolute Zero
Visual Language Actions	VLA — From Absolute Zero

Computer Vision

Chapter 0: Convolution Math

Chapter 1: Receptive Field

Chapter 2: Pooling & Feature Maps

Chapter 3: Object Detection: Anchors

Chapter 4: Non-Maximum Suppression

Chapter 5: Image Classification Architectures

Chapter 6: Segmentation

Chapter 7: Stereo Vision

Chapter 8: Homography

Chapter 9: Capstone: Detection System Design

Related Lessons