Workbook — Computer Vision

Computer Vision

From convolution arithmetic to full detection pipelines. Every formula a CV engineer needs — output sizes, receptive fields, anchor math, IoU, NMS, disparity, homographies — all derivable by hand, all testable in-browser.

Prerequisites: Basic linear algebra (matrix multiply, dot products) + Arithmetic (floor division). That's it.
10
Chapters
53
Exercises
5
Exercise Types
Mastery
0 / 53 exercises (0%)
0
Day Streak
Best: 0

Chapter 0: Convolution Math

You're debugging a CNN and the shapes don't match. The feature map coming out of layer 3 is 14×14 but the next layer expects 7×7. What happened? To debug CNN architectures, you need one formula burned into memory:

Convolution output size:
O = ⌊(W − K + 2P) / S⌋ + 1

W = input spatial dimension (width or height)
K = kernel size
P = padding (pixels added to each side)
S = stride

This formula governs every convolutional layer in every CNN ever built. A 3×3 kernel on a 5×5 input with no padding and stride 1: O = (5 − 3 + 0)/1 + 1 = 3. The output is 3×3.

The output size formula is the single most important equation in CNN engineering. Every shape mismatch, every padding choice, every downsampling decision comes back to ⌊(W − K + 2P)/S⌋ + 1. Memorize it.

How does convolution actually compute a single output pixel? The kernel slides across the input. At each position, we element-wise multiply the kernel with the overlapping input region, then sum all products. For a 3×3 kernel at position (0,0) of a 5×5 input:

output[0,0] = ∑i=0..2j=0..2 input[i,j] × kernel[i,j]
Exercise 0.1: Basic Output Size Derive

A 3×3 kernel is applied to a 32×32 input with stride=1, padding=0. What is the output spatial dimension (one side)?

pixels
Show derivation
O = ⌊(32 − 3 + 0) / 1⌋ + 1 = 29 + 1 = 30

Without padding, a 3×3 kernel shrinks each dimension by 2 pixels (K−1). This is why "valid" convolution produces smaller outputs.

Exercise 0.2: Stride and Padding Derive

A 3×3 kernel, stride=2, padding=1, applied to a 224×224 input. Output size?

pixels
Show derivation
O = ⌊(224 − 3 + 2×1) / 2⌋ + 1 = ⌊223 / 2⌋ + 1 = 111 + 1 = 112

Stride=2 with padding=1 is the standard "halving" recipe. 224→112→56→28→14→7 — this is the exact downsampling chain in ResNet. Each layer halves the spatial dimension.

Exercise 0.3: Trace One Output Pixel Trace

Given a 5×5 input and a 3×3 kernel (no padding, stride=1), compute output[1,1].

Input:
1  0  2  1  3
0  1  1  0  2
2  1  0  2  1
1  0  3  1  0
0  2  1  0  1

Kernel:
1  0  −1
0  1   0
−1  0  1

output[1,1] overlaps input rows 1-3, cols 1-3. Multiply element-wise and sum.

(integer)
Show derivation
Patch at (1,1) = input[1:4, 1:4]:
[[1, 1, 0], [1, 0, 2], [0, 3, 1]]
Kernel: [[1, 0, −1], [0, 1, 0], [−1, 0, 1]]
Products: 1(1) + 1(0) + 0(−1) + 1(0) + 0(1) + 2(0) + 0(−1) + 3(0) + 1(1)
= 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1 = 2

This kernel detects diagonal edges — it has +1 on the main diagonal and −1 on the anti-diagonal. For this particular patch, most values align with the zero entries of the kernel, so only the two diagonal corners contribute. The result is 2.

Exercise 0.4: "Same" Padding Derive

"Same" padding means the output size equals the input size (when stride=1). For a 5×5 kernel with stride=1, how much padding P do you need on each side?

Set O = W and solve: W = (W − K + 2P)/1 + 1 ⇒ P = ?

pixels per side
Show derivation
W = (W − 5 + 2P)/1 + 1
W − 1 = W − 5 + 2P
2P = 4 ⇒ P = 2

The general rule: for any kernel K with stride=1, "same" padding requires P = (K−1)/2. For K=3: P=1. For K=5: P=2. For K=7: P=3. This is why odd kernel sizes are universal — they give integer padding for "same" convolutions.

Exercise 0.5: Conv Layer Parameters Derive

A conv layer: 3×3 kernel, 64 input channels, 128 output channels, with bias. How many learnable parameters?

Each output filter has shape [Cin, K, K] plus one bias.

parameters
Show derivation
Weights per filter = Cin × K × K = 64 × 3 × 3 = 536
Total weights = Cout × 536 = 128 × 536 = 73,728
Biases = Cout = 128
Total = 73,728 + 128 = 73,856

The parameter count formula for a conv layer is Cout × (Cin × K × K + 1). Notice it doesn't depend on the spatial size of the input — a 3×3 conv with 64→128 channels uses the same 73K parameters whether the input is 224×224 or 7×7. This is the fundamental advantage of convolutions over fully-connected layers.

Exercise 0.6: Implement conv2d() Build

Write a 2D convolution for a single-channel input. No padding, stride=1.

Return a 2D array. Use nested loops: for each output position, sum the element-wise product of the patch and kernel.
Show solution
javascript
function conv2d(input, kernel) {
  const H = input.length, W = input[0].length;
  const Kh = kernel.length, Kw = kernel[0].length;
  const Oh = H - Kh + 1, Ow = W - Kw + 1;
  const out = [];
  for (let i = 0; i < Oh; i++) {
    out[i] = [];
    for (let j = 0; j < Ow; j++) {
      let s = 0;
      for (let ki = 0; ki < Kh; ki++)
        for (let kj = 0; kj < Kw; kj++)
          s += input[i+ki][j+kj] * kernel[ki][kj];
      out[i][j] = s;
    }
  }
  return out;
}

Chapter 1: Receptive Field

A single pixel in a deep feature map "sees" a region of the original input. That region is the receptive field. Understanding it tells you whether your network can even detect objects at a given scale. If your receptive field is 32×32 pixels but the target object is 200×200, the network can only see fragments.

Receptive field after layer i:
rfi = rfi−1 + (Ki − 1) × ∏j=1..i−1 Sj

rf0 = 1 (a single input pixel)
Ki = kernel size at layer i
Sj = stride at layer j
The stride product multiplies all strides before the current layer.
Why the stride product matters. Each stride-2 layer doubles the "zoom level" of subsequent layers. A 3×3 kernel after one stride-2 layer covers (3−1)×2 = 4 extra input pixels. After two stride-2 layers, that same 3×3 kernel covers (3−1)×4 = 8 extra input pixels. Strides are multiplicative — they compound like interest.
Exercise 1.1: Three Stacked Convolutions Derive

Three consecutive 3×3 conv layers, all stride=1, no pooling. What is the receptive field?

Start with rf=1. Each 3×3 stride-1 layer adds (3−1)×1 = 2.

pixels × pixels
Show derivation
rf0 = 1
rf1 = 1 + (3−1)×1 = 3
rf2 = 3 + (3−1)×1 = 5
rf3 = 5 + (3−1)×1 = 7

Three 3×3 layers give a 7×7 receptive field — the same field of view as a single 7×7 kernel, but with fewer parameters (3×9 = 27 vs 49) and more nonlinearities. This is the key insight behind VGG: stack small kernels to get large receptive fields cheaply.

Exercise 1.2: With Stride-2 Layer Derive

Layer 1: 3×3, stride=2. Layer 2: 3×3, stride=1. Layer 3: 3×3, stride=1. What is the receptive field after layer 3?

pixels
Show derivation
Start: rf = 1, jump = 1
Layer 1 (3×3, s=2): rf = 1 + (3−1)×1 = 3, jump = 1×2 = 2
Layer 2 (3×3, s=1): rf = 3 + (3−1)×2 = 7, jump = 2×1 = 2
Layer 3 (3×3, s=1): rf = 7 + (3−1)×2 = 11, jump = 2

The stride-2 in layer 1 doubles the "jump" — every subsequent 3×3 kernel covers 2×(3−1) = 4 extra input pixels instead of 2. Compare to three stride-1 layers which give rf=7. One stride-2 layer early on grows the receptive field much faster: 11 vs 7 with the same number of layers.

Exercise 1.3: ResNet First 5 Layers Derive

ResNet-50 starts with: conv1 (7×7, stride=2), maxpool (3×3, stride=2), then three 3×3 stride-1 convs in the first residual block. What is the receptive field after these 5 layers?

Track the jump (cumulative stride product) at each layer. Layer i adds (Ki−1) × jump to the receptive field, then multiply jump by Si.

pixels
Show derivation
Start: rf=1, jump=1
conv1 (7×7, s=2): rf = 1 + 6×1 = 7, jump = 1×2 = 2
maxpool (3×3, s=2): rf = 7 + 2×2 = 11, jump = 2×2 = 4
conv (3×3, s=1): rf = 11 + 2×4 = 19, jump = 4
conv (3×3, s=1): rf = 19 + 2×4 = 27, jump = 4
conv (3×3, s=1): rf = 27 + 2×4 = 35, jump = 4

After just the stem and first residual block, each feature map pixel already sees a 35×35 region of the original 224×224 input. By the end of ResNet-50, the receptive field exceeds the entire input image — meaning each final feature map pixel is theoretically influenced by all input pixels.

Exercise 1.4: Dilation Expands RF Trace
A 3×3 kernel with dilation=2 (atrous convolution) has an effective kernel size of Keff = K + (K−1)×(d−1). What is Keff for a 3×3 kernel with dilation=2?
Show derivation
Keff = 3 + (3−1)×(2−1) = 3 + 2 = 5

Dilated convolutions skip every (d−1) pixels. A 3×3 kernel with dilation=2 samples a 5×5 region but only uses 9 weights. This is how DeepLab achieves large receptive fields in segmentation without increasing parameters or reducing resolution.

Exercise 1.5: Why Not One Big Kernel? Trace
Three 3×3 conv layers (64 channels each) give a 7×7 receptive field. A single 7×7 conv layer also gives 7×7. Which uses fewer parameters?
Show derivation
3 × (64 × 64 × 3 × 3) = 3 × 36,864 = 110,592
1 × (64 × 64 × 7 × 7) = 200,704

Three stacked 3×3 layers use 45% fewer parameters than one 7×7 layer while achieving the same receptive field. Plus they add two extra ReLU nonlinearities, making the function more expressive. This is the VGGNet insight (2014) that killed large kernels.

Chapter 2: Pooling & Feature Maps

Pooling layers reduce spatial dimensions without learnable parameters. They summarize local regions — max pooling takes the strongest activation, average pooling takes the mean. The output size formula is the same as convolution: O = ⌊(W − K + 2P) / S⌋ + 1.

Pooling is convolution's non-learned cousin. Same sliding window, same output size formula, but no parameters to train. Max pooling selects the strongest feature; average pooling summarizes. Modern architectures often replace pooling with stride-2 convolutions, gaining learnability at the cost of extra parameters.
Exercise 2.1: Max Pool a Feature Map Trace

Apply 2×2 max pooling (stride=2) to this 4×4 feature map. What is the value at output position (0,1)?

Feature map:
1  3  2  4
5  6  7  8
3  2  1  0
1  0  3  2
(value)
Show derivation
Output (0,1) covers input rows 0-1, cols 2-3: [[2,4],[7,8]]
max(2, 4, 7, 8) = 8

The full 2×2 output is: [[6, 8], [3, 3]]. Max pooling with 2×2 stride 2 halves each dimension and keeps the strongest activation in each quadrant.

Exercise 2.2: Global Average Pooling Derive

The final conv layer of ResNet-50 outputs a 7×7×2048 feature map. Global Average Pooling (GAP) averages each channel over the entire spatial dimension. What is the GAP output shape?

Show reasoning

GAP averages over the spatial dimensions (H and W) independently for each channel. Input [7, 7, 2048] → each of the 2048 channels gets averaged over 7×7=49 positions → output [1, 1, 2048]. This eliminates 7×7 − 1 = 48 spatial positions per channel, leaving a single number per channel.

Exercise 2.3: GAP vs Fully Connected Derive

Compare: (A) GAP then a 2048→1000 FC layer vs (B) Flatten 7×7×2048 then a 100,352→1000 FC layer. How many parameters does approach B use? How many does A use?

Just the FC layer parameters (weights + bias).

million params saved by GAP
Show derivation
Approach A (GAP + FC): 2048 × 1000 + 1000 = 2,049,000
Approach B (Flatten + FC): 100,352 × 1000 + 1000 = 100,353,000
Savings = 100,353,000 − 2,049,000 = 98,304,000 ≈ 98.3M

GAP saves ~98.3M parameters — a 49x reduction. This is why every modern classifier uses GAP: it collapses spatial information into a compact vector with zero learned parameters, then lets a tiny FC layer do the classification. Introduced by GoogLeNet (2014), GAP also acts as a regularizer since it forces the network to learn spatially-distributed features.

Exercise 2.4: Feature Map Memory Derive

A conv layer outputs 56×56×256 feature maps (float32). How much memory does this activation tensor consume per image?

MB
Show derivation
Elements = 56 × 56 × 256 = 802,816
Bytes = 802,816 × 4 = 3,211,264 bytes
MB = 3,211,264 / (1024 × 1024) ≈ 3.06 MB

During training, every layer's activations must be stored for backpropagation. ResNet-50 with batch_size=32 stores ~3 GB of activations across all layers. This is why activation checkpointing (recomputing activations instead of storing them) is critical for training large CNNs on limited GPU memory.

Exercise 2.5: Depthwise Separable Conv Derive

A standard 3×3 conv: 256 input channels, 256 output channels. A depthwise separable conv splits this into: (1) depthwise 3×3 conv (one filter per channel, 256 filters) + (2) pointwise 1×1 conv (256→256). What is the parameter ratio: standard / depthwise_separable?

× (ratio)
Show derivation
Standard: 256 × 256 × 3 × 3 = 589,824
Depthwise: 256 × 1 × 3 × 3 = 2,304
Pointwise: 256 × 256 × 1 × 1 = 65,536
Separable total = 2,304 + 65,536 = 67,840
Ratio = 589,824 / 67,840 ≈ 8.7×

In general, the ratio is approximately K² for large channel counts (here 3² = 9, close to our 8.7). MobileNet uses depthwise separable convolutions throughout, achieving ~8-9x parameter reduction per layer with minimal accuracy loss. The general formula: standard/(depthwise+pointwise) = Cout×Cin×K² / (Cin×K² + Cin×Cout) ≈ K² when C is large.

Chapter 3: Object Detection: Anchors

How does a detection network know where to look? It doesn't search randomly. Instead, it tiles the image with thousands of predefined anchor boxes — rectangles of various sizes and aspect ratios — and predicts offsets from each anchor to refine the location. The anchor system converts "find any object anywhere" into "classify and refine each of these specific boxes."

Anchor generation (Faster R-CNN style):
Scales: {128, 256, 512}    // 3 sizes
Aspect ratios: {1:2, 1:1, 2:1}    // 3 shapes
Anchors per location = scales × ratios = 9
For a feature map of size H×W: total anchors = H × W × 9
Anchors are the detection network's hypothesis space. Each anchor says "there might be an object here, at roughly this scale and shape." The network's job is reduced to: for each anchor, (1) is there an object? (2) how much should I shift/resize this box? This is why anchor-based detectors are fast — they amortize the search problem into a fixed set of classifications.

To evaluate how well a predicted box matches the ground truth, we use Intersection over Union (IoU):

IoU(A, B) = Area(A ∩ B) / Area(A ∪ B) = Area(intersection) / (Area(A) + Area(B) − Area(intersection))
Exercise 3.1: Total Anchors Derive

A Faster R-CNN uses 3 scales × 3 aspect ratios = 9 anchors per location. The backbone outputs a 32×32 feature map. How many total anchor boxes?

anchors
Show derivation
Total = 32 × 32 × 9 = 9,216 anchors

And this is for a single feature map scale. FPN-based detectors like RetinaNet use 5 feature map levels (P3-P7), with sizes like 100×100, 50×50, 25×25, 13×13, 7×7. Total anchors across all levels: 9 × (10000+2500+625+169+49) = 9 × 13,343 = ~120K anchors per image. The vast majority are negative (no object) — this class imbalance motivated focal loss.

Exercise 3.2: IoU by Hand Derive

Compute IoU for boxes A=[0, 0, 10, 10] and B=[5, 5, 15, 15] (format: [x1, y1, x2, y2]).

IoU (0 to 1)
Show derivation
Intersection: x1=max(0,5)=5, y1=max(0,5)=5, x2=min(10,15)=10, y2=min(10,15)=10
Intersection area = (10−5)×(10−5) = 25
Area A = (10−0)×(10−0) = 100
Area B = (15−5)×(15−5) = 100
Union = 100 + 100 − 25 = 175
IoU = 25 / 175 = 1/7 ≈ 0.143

An IoU of 0.143 means only 14.3% overlap. In COCO evaluation, a detection is considered a true positive at IoU ≥ 0.5. The COCO primary metric (AP) averages over IoU thresholds from 0.5 to 0.95 in steps of 0.05.

Exercise 3.3: Implement computeIoU() Build

Write a function that computes IoU between two boxes in [x1, y1, x2, y2] format.

Find intersection coordinates with max/min. Clamp intersection dimensions to 0. Compute areas. IoU = intersection / union.
Show solution
javascript
function computeIoU(a, b) {
  const ix1 = Math.max(a[0], b[0]);
  const iy1 = Math.max(a[1], b[1]);
  const ix2 = Math.min(a[2], b[2]);
  const iy2 = Math.min(a[3], b[3]);
  const iw = Math.max(0, ix2 - ix1);
  const ih = Math.max(0, iy2 - iy1);
  const inter = iw * ih;
  const areaA = (a[2]-a[0]) * (a[3]-a[1]);
  const areaB = (b[2]-b[0]) * (b[3]-b[1]);
  return inter / (areaA + areaB - inter);
}
Exercise 3.4: Anchor Matching Trace
In Faster R-CNN, an anchor is labeled positive if its IoU with any ground-truth box is ≥ 0.7, and negative if its max IoU with any ground-truth box is < 0.3. What happens to anchors with IoU between 0.3 and 0.7?
Show explanation

Anchors in the "gray zone" (IoU 0.3-0.7) are ignored — they contribute no loss during training. This prevents ambiguous examples from corrupting the classifier. Of the ~9K anchors from a 32×32 feature map, typically only ~256 are sampled for training (128 positive + 128 negative), with the rest discarded.

Exercise 3.5: Multi-Scale Anchors Derive

An FPN-based detector uses 5 feature map levels with spatial sizes: P3=80×80, P4=40×40, P5=20×20, P6=10×10, P7=5×5. Each level has 9 anchors per location. Total anchors?

total anchors
Show derivation
P3: 80×80 = 6,400 locations
P4: 40×40 = 1,600
P5: 20×20 = 400
P6: 10×10 = 100
P7: 5×5 = 25
Total locations = 8,525
Total anchors = 8,525 × 9 = 76,725

~84% of anchors come from P3 alone (the finest level). This heavy skew toward small objects is intentional — small objects are harder to detect and need denser coverage. But it also means ~75K of these ~77K anchors are negative, which is the class imbalance problem that focal loss addresses.

Exercise 3.6: Find the Bug Debug

This IoU function returns values greater than 1 for some inputs. Click the buggy line.

function computeIoU(a, b) {
  const ix1 = Math.max(a[0], b[0]);
  const iy1 = Math.max(a[1], b[1]);
  const ix2 = Math.min(a[2], b[2]);
  const iy2 = Math.min(a[3], b[3]);
  const inter = (ix2 - ix1) * (iy2 - iy1);
  const areaA = (a[2]-a[0]) * (a[3]-a[1]);
  const areaB = (b[2]-b[0]) * (b[3]-b[1]);
  return inter / (areaA + areaB - inter);
}
Show explanation

Line 6 is the bug. When boxes don't overlap, ix2 - ix1 or iy2 - iy1 can be negative. Two negative numbers multiplied give a positive intersection area, which corrupts the IoU. The fix: clamp both dimensions to zero with Math.max(0, ix2 - ix1) and Math.max(0, iy2 - iy1). Without this clamp, non-overlapping boxes can produce IoU > 1 because the "intersection" is positive while the union is smaller than the sum of areas.

Chapter 4: Non-Maximum Suppression

A detector with 77K anchors might produce hundreds of detections for a single object. Non-Maximum Suppression (NMS) prunes these duplicates: keep the highest-confidence detection, remove all others that overlap it significantly, repeat.

NMS Algorithm:
1. Sort all detections by confidence score (descending)
2. Pick the top detection, add it to the output list
3. Compute IoU of this detection with all remaining detections
4. Remove any detection with IoU ≥ threshold (e.g. 0.5)
5. Repeat from step 2 until no detections remain
NMS is greedy and imperfect. It assumes that if two boxes overlap heavily, they must be detecting the same object. This fails for occluded objects standing side-by-side. Soft-NMS (reducing scores instead of removing) and end-to-end set prediction (DETR) are modern alternatives.
Exercise 4.1: Trace NMS Trace

5 detections with scores and boxes. IoU threshold = 0.5. Trace NMS and count how many survive.

Det A: score=0.9, box=[10,10,50,50]
Det B: score=0.85, box=[12,12,52,52]
Det C: score=0.7, box=[100,100,140,140]
Det D: score=0.6, box=[105,102,145,142]
Det E: score=0.3, box=[200,200,240,240]

A and B overlap heavily (IoU ≈ 0.68). C and D overlap heavily (IoU ≈ 0.65). E is isolated.

Show trace

Round 1: Pick A (score=0.9). IoU(A,B)≈0.68 ≥ 0.5 → suppress B. IoU(A,C)≈0 < 0.5 → keep. IoU(A,D)≈0 < 0.5 → keep. IoU(A,E)≈0 < 0.5 → keep. Remaining: {C, D, E}.

Round 2: Pick C (score=0.7). IoU(C,D)≈0.65 ≥ 0.5 → suppress D. IoU(C,E)≈0 < 0.5 → keep. Remaining: {E}.

Round 3: Pick E (score=0.3). No remaining detections. Done.

Survivors: A, C, E — three distinct detections for three distinct regions.

Exercise 4.2: NMS Failure Case Trace
Two people standing side-by-side have bounding boxes with IoU = 0.55. Your NMS threshold is 0.5. What happens?
Show explanation

Since IoU (0.55) ≥ threshold (0.5), the lower-scoring box is suppressed. This is a false negative caused by NMS. Solutions: (1) raise the threshold to 0.6 or 0.7, but this lets more duplicate detections through, (2) use Soft-NMS which decays the score instead of removing, or (3) use DETR-style set prediction which avoids NMS entirely by predicting a fixed set of unique objects.

Exercise 4.3: Implement nms() Build

Implement NMS. Input: array of detections [{box:[x1,y1,x2,y2], score:float}], IoU threshold. Return surviving detections in order.

Sort by score descending. Greedy loop: pick top, remove overlapping, repeat.
Show solution
javascript
function nms(detections, iouThreshold) {
  function iou(a, b) {
    const ix1=Math.max(a[0],b[0]), iy1=Math.max(a[1],b[1]);
    const ix2=Math.min(a[2],b[2]), iy2=Math.min(a[3],b[3]);
    const inter = Math.max(0,ix2-ix1)*Math.max(0,iy2-iy1);
    const aa=(a[2]-a[0])*(a[3]-a[1]);
    const ab=(b[2]-b[0])*(b[3]-b[1]);
    return inter/(aa+ab-inter);
  }
  const sorted = [...detections].sort((a,b) => b.score-a.score);
  const keep = [];
  const suppressed = new Set();
  for (let i=0; i<sorted.length; i++) {
    if (suppressed.has(i)) continue;
    keep.push(sorted[i]);
    for (let j=i+1; j<sorted.length; j++) {
      if (!suppressed.has(j) && iou(sorted[i].box,sorted[j].box) >= iouThreshold)
        suppressed.add(j);
    }
  }
  return keep;
}
Exercise 4.4: NMS Complexity Derive

NMS processes N detections. In the worst case (no suppression — every detection is isolated), how many IoU computations does the greedy algorithm perform?

IoU ops for N=100
Show derivation
Round 1: compare top with 99 remaining = 99 IoU ops
Round 2: compare with 98 remaining = 98 IoU ops
...
Total = 99 + 98 + ... + 1 = N(N−1)/2 = 100×99/2 = 4,950

O(N²) complexity. With N=77K anchors this would be ~3 billion IoU ops. In practice, a confidence threshold (e.g., score > 0.05) pre-filters to a few hundred detections before NMS, making it cheap. GPU-accelerated NMS (torchvision.ops.nms) handles the remaining cost.

Exercise 4.5: Arrange the Detection Pipeline Design

Put these detection post-processing steps in the correct order.

?
?
?
?
?
Decode box offsets to coordinates Apply confidence threshold Per-class NMS Keep top-K results Raw network output (logits + offsets)
Show correct order

1. Raw network output2. Decode box offsets (apply anchor transforms to get actual coordinates) → 3. Apply confidence threshold (remove low-score detections) → 4. Per-class NMS (remove duplicates within each class) → 5. Keep top-K (limit final output count)

Chapter 5: Image Classification Architectures

The core innovation of ResNet is embarrassingly simple: add the input of a block to its output. Instead of learning H(x), the block learns F(x) = H(x) − x, and the output is F(x) + x. This is the residual connection (or skip connection).

ResNet residual block:
output = F(x) + x

F(x) = W2 · ReLU(W1 · x)    // learned residual
x                               // identity shortcut
Why gradients flow better. During backprop, ∂output/∂x = ∂F(x)/∂x + 1. That "+1" is the identity gradient from the skip connection. Even if the learned gradients ∂F/∂x vanish, the gradient never drops below 1 along the skip path. This is how 152-layer networks train successfully — the gradient has a "highway" through the skip connections.
Exercise 5.1: ResNet-50 Parameter Count Derive

ResNet-50 has 4 stages with bottleneck blocks. Each bottleneck has three convolutions: 1×1 (reduce), 3×3 (process), 1×1 (expand). Stage dimensions:

Stage 1: 3 blocks, channels 64→64→256
Stage 2: 4 blocks, channels 128→128→512
Stage 3: 6 blocks, channels 256→256→1024
Stage 4: 3 blocks, channels 512→512→2048
Plus: conv1 (7×7, 3→64) + FC (2048→1000)

Per bottleneck block: 1×1 conv (Cin→Cmid) + 3×3 conv (Cmid→Cmid) + 1×1 conv (Cmid→Cout). Compute total for Stage 3 (6 blocks, first block has Cin=512 from Stage 2).

million params (Stage 3 only)
Show derivation
Block 1 (512→256→256→1024): 512×256 + 256×256×9 + 256×1024 = 131K + 590K + 262K = 983K
+ 1×1 downsample shortcut: 512×1024 = 524K. Block 1 total ≈ 1,507K
Blocks 2-6 (1024→256→256→1024 each): 1024×256 + 256×256×9 + 256×1024 = 262K + 590K + 262K = 1,114K each
Stage 3 total = 1,507K + 5 × 1,114K = 1,507K + 5,530K = 7,077K ≈ 7.1M

ResNet-50 total is ~25.6M parameters. Stage 3 (1024-channel blocks) and Stage 4 (2048-channel blocks) dominate because parameter count scales with C². The full count: conv1 (9.4K) + Stage1 (0.2M) + Stage2 (1.2M) + Stage3 (7.1M) + Stage4 (15.0M) + FC (2.0M) ≈ 25.6M.

Exercise 5.2: FLOPs for One Conv Layer Derive

A 3×3 conv layer: 256 input channels, 256 output channels, applied to a 56×56 feature map. How many multiply-accumulate operations (MACs)?

Each output element requires Cin × K × K multiplications. Total MACs = output_elements × macs_per_element = (Hout × Wout × Cout) × (Cin × K × K).

GMACs (billions of MACs)
Show derivation
Output spatial size (same padding): 56 × 56
MACs per output element = 256 × 3 × 3 = 2,304
Total output elements = 56 × 56 × 256 = 802,816
Total MACs = 802,816 × 2,304 = 1,850,490,880 ≈ 1.85 GMACs

Note: FLOPs ≈ 2 × MACs (one multiply + one add per MAC). So this layer uses ~3.7 GFLOPs. The key insight: FLOPs scale with spatial area. The same conv layer at 28×28 uses 4x fewer FLOPs than at 56×56. This is why downsampling early is computationally efficient — most FLOPs are spent in the early, high-resolution layers.

Exercise 5.3: Skip Connection Gradient Trace
In a 100-layer network without skip connections, the gradient through the chain rule is a product of ~100 terms. If each term is 0.9, the gradient magnitude at the first layer is 0.9100 ≈ ?
Show computation
0.9100 = e100×ln(0.9) = e100×(−0.1054) = e−10.54 ≈ 0.0000265

Even with per-layer gradients of 0.9 (close to 1!), 100 layers of multiplication shrink the signal to near zero. This is the vanishing gradient problem. ResNet's skip connections bypass this chain by providing an additive gradient path: the gradient through the identity shortcut is always 1, regardless of depth.

Exercise 5.4: Find the Bug Debug

This ResNet block implementation crashes with a shape mismatch on the addition. Click the buggy line.

function residualBlock(x, W1, W2) {
  // x: [batch, 256, 56, 56]
  let out = conv2d(x, W1);     // W1: 3x3, 256→512
  out = relu(out);              // [batch, 512, 56, 56]
  out = conv2d(out, W2);        // W2: 3x3, 512→256
  return out + x;               // CRASH: [batch,256,56,56]
}
Show explanation

Line 3 is the bug. The first conv changes channels from 256 to 512, but the second conv brings it back to 256. While the final shape matches x, the issue is that W1 projects to 512 channels unnecessarily. In a standard ResNet basic block, both convolutions should maintain the same channel count (256→256, 256→256). If you need to change channels, you must add a 1×1 projection shortcut: return out + conv1x1(x, Ws). The code as written doesn't crash on shapes (both are [batch,256,56,56]), but the intermediate 512-channel expansion is wrong for a basic block. For a bottleneck block, the expansion goes 256→64→64→256 (reduce-process-expand), not 256→512→256.

Exercise 5.5: ResNet-50 Total FLOPs Derive

A rough estimate: ResNet-50 processes a 224×224 input. The spatial dimensions halve at each stage while channels double. Approximate total GMACs (use the rule that most FLOPs are in Stage 3 at 14×14 resolution).

Show reasoning

ResNet-50 uses ~4.1 GMACs (~8.2 GFLOPs) for a 224×224 input. The majority of compute is in Stage 2 (56×56, 128 channels) and Stage 3 (28×28, 256 channels). Despite Stage 4 having the most parameters (512 channels), it operates at 7×7 resolution, so its FLOP contribution is small. This is a general principle: parameters concentrate in deep layers, FLOPs concentrate in early layers.

Chapter 6: Segmentation

Classification gives one label per image. Semantic segmentation gives one label per pixel. The challenge: the encoder downsamples (224×224 → 7×7), and the decoder must upsample back to full resolution. This requires transposed convolution (sometimes called "deconvolution").

Transposed convolution output size:
O = (W − 1) × S − 2P + K

This is the inverse of the convolution formula.
When S=2, K=4, P=1: O = (W−1)×2 − 2 + 4 = 2W
// Exactly doubles the spatial dimension
U-Net's key insight: skip connections for segmentation. Downsampling loses spatial detail. Upsampling alone produces blurry boundaries. U-Net concatenates encoder features (which have fine spatial detail) with decoder features (which have semantic context) at each resolution level. This gives sharp, semantically meaningful per-pixel predictions.
Exercise 6.1: Transposed Conv Output Size Derive

Transposed convolution with K=4, S=2, P=1 applied to a 14×14 feature map. Output size?

pixels
Show derivation
O = (14 − 1) × 2 − 2×1 + 4 = 13×2 − 2 + 4 = 26 − 2 + 4 = 28

The combination K=4, S=2, P=1 is the standard "doubling" transposed convolution. It exactly doubles the spatial dimension: 7→14, 14→28, 28→56, 56→112. This is the mirror image of stride-2 convolution's halving.

Exercise 6.2: U-Net Feature Map Sizes Trace

A U-Net encoder: input 256×256, then 4 max-pool 2×2 (stride 2) operations. What are the feature map sizes at each level?

Show derivation
Each 2×2 maxpool with stride 2: O = W/2
256 → 128 → 64 → 32 → 16

The bottleneck is 16×16. The decoder path mirrors this: 16 → 32 → 64 → 128 → 256, with skip connections concatenating encoder features at each level. The channel counts typically double on the way down (64→128→256→512→1024) and halve on the way up.

Exercise 6.3: U-Net Skip Concatenation Derive

In the U-Net decoder, at 64×64 resolution: the upsampled decoder feature has 256 channels. The skip connection from the encoder also has 256 channels. After concatenation, a 3×3 conv reduces channels back to 256. How many parameters in this conv (with bias)?

parameters
Show derivation
Input channels after concat = 256 + 256 = 512
Conv: 3×3, 512→256
Weights = 512 × 256 × 3 × 3 = 1,179,648
Bias = 256
Total = 1,179,904

The skip concatenation doubles the input channels, which quadruples the conv parameters compared to a conv without skip (256×256×9 = 590K vs 512×256×9 = 1.18M). This is the parameter cost of skip connections, but the boundary quality improvement is dramatic.

Exercise 6.4: Segmentation Output Shape Trace
A semantic segmentation network for 21 classes (including background) takes a 512×512 input and produces per-pixel predictions. What is the output tensor shape?
Show explanation

The output is [512, 512, 21] — a full-resolution prediction with 21 logits per pixel. After applying argmax along the channel dimension, you get a [512, 512] label map. The final layer is a 1×1 convolution that maps from the decoder's channel dimension to the number of classes. Loss: per-pixel cross-entropy between predicted logits and ground-truth labels.

Exercise 6.5: Bilinear Upsampling vs Transposed Conv Derive

To upsample 14×14×512 to 28×28×256: Option A uses transposed conv (K=4, S=2, P=1). Option B uses bilinear upsampling (2x) followed by 3×3 conv. How many parameters does each approach use?

million more params for transposed conv
Show derivation
Transposed conv (4×4, 512→256): 512 × 256 × 4 × 4 = 2,097,152
Bilinear up (0 params) + 3×3 conv (512→256): 512 × 256 × 3 × 3 = 1,179,648
Difference = 2,097,152 − 1,179,648 = 917,504 ≈ 0.92M

Bilinear + conv is cheaper and often preferred because transposed convolutions can produce "checkerboard" artifacts when K is not divisible by S. Modern segmentation networks (e.g., DeepLabV3+) typically use bilinear upsampling followed by convolution.

Chapter 7: Stereo Vision

Two cameras, separated by a known distance, observe the same scene. An object appears at different horizontal positions in the left and right images. The difference in position is disparity, and it encodes depth: closer objects have larger disparity, farther objects have smaller disparity.

The stereo depth equation:
d = f × B / Z

d = disparity (pixels)
f = focal length (pixels)
B = baseline (meters, distance between cameras)
Z = depth (meters, distance to object)

Rearranged: Z = f × B / d
Disparity is inversely proportional to depth. This means stereo depth estimation has high precision for nearby objects (large disparity, many pixels to work with) and poor precision for distant objects (small disparity, sub-pixel accuracy needed). A 1-pixel error in disparity at d=100 pixels means a 1% depth error, but at d=5 pixels it means a 20% depth error.
Exercise 7.1: Compute Disparity Derive

f = 500 pixels, B = 0.12 meters, Z = 5 meters. What is the disparity in pixels?

pixels
Show derivation
d = f × B / Z = 500 × 0.12 / 5 = 60 / 5 = 12 pixels

12 pixels of disparity is easy to detect reliably. For comparison, an object at 50 meters would have d = 500 × 0.12 / 50 = 1.2 pixels — much harder, requiring sub-pixel matching algorithms.

Exercise 7.2: Depth from Disparity Derive

Same camera setup (f=500px, B=0.12m). You measure a disparity of 8 pixels. What is the object's depth?

meters
Show derivation
Z = f × B / d = 500 × 0.12 / 8 = 60 / 8 = 7.5 meters
Exercise 7.3: Depth Error from Disparity Error Derive

With f=500, B=0.12, true depth Z=10m: if the disparity measurement has a ±1 pixel error, what is the depth error range?

First find true disparity d, then compute Z for d±1.

meters (total range: Zmax − Zmin)
Show derivation
True disparity: d = 500 × 0.12 / 10 = 6 pixels
d − 1 = 5: Zmax = 60/5 = 12.0 m
d + 1 = 7: Zmin = 60/7 = 8.53 m
Range = 12.0 − 8.53 = 3.43 m

A ±1 pixel error at 10m depth produces ~3.4m of uncertainty. Compare to an object at 2m (d=30 pixels): Zmin=60/31=1.94m, Zmax=60/29=2.07m, range=0.13m. Depth precision degrades quadratically with distance: ΔZ ≈ Z²/(fB) × Δd.

Exercise 7.4: Epipolar Constraint Trace
The epipolar constraint says: for a point x in the left image, its corresponding point x' in the right image lies on a line (the epipolar line). For a perfectly rectified stereo pair (horizontal baseline), what does this constraint simplify to?
Show explanation

For a rectified stereo pair, the epipolar lines are horizontal — every corresponding point has the same y-coordinate. This reduces stereo matching from a 2D search to a 1D search along the same scanline. Rectification is a preprocessing step that warps both images so that epipolar lines become horizontal, using the camera calibration parameters. Mathematically, the fundamental matrix F of a rectified pair has the form F = [0, 0, 0; 0, 0, -1; 0, 1, 0] (up to scale), enforcing y' = y.

Exercise 7.5: Disparity Search Range Derive

Your stereo camera (f=500, B=0.12) must detect objects from 1m to 30m. What range of disparities must you search?

pixels (dmax − dmin)
Show derivation
dmax (nearest, 1m) = 500 × 0.12 / 1 = 60 pixels
dmin (farthest, 30m) = 500 × 0.12 / 30 = 2 pixels
Range = 60 − 2 = 58 pixels

A search range of 58 pixels means for each pixel in the left image, you must compare 58 candidate positions in the right image. For a 640×480 image: 640 × 480 × 58 = ~17.8 million comparisons. Modern stereo networks (e.g., RAFT-Stereo) construct a 3D cost volume of this size and process it with 3D convolutions.

Chapter 8: Homography

A homography is a 3×3 matrix H that maps points from one plane to another. If you photograph a flat surface (a poster, a building facade, a tabletop) from two different viewpoints, the relationship between the two images is a homography. It encodes rotation, translation, scaling, and perspective distortion.

Homography transformation:
[x', y', w']T = H × [x, y, 1]T

H = [[h11, h12, h13],
     [h21, h22, h23],
     [h31, h32, h33]]

Output point: (x'/w', y'/w') // divide by w' to get Cartesian coords
H has 8 degrees of freedom (9 entries, but scale is arbitrary)
4 point correspondences determine a homography. Each point gives 2 equations (one for x', one for y'). With 8 unknowns, you need 4 points × 2 equations = 8 equations. In practice, you use more points + RANSAC to handle noisy matches.
Exercise 8.1: Apply a Homography Trace

Given H and point [1, 0, 1]T, compute the transformed point.

H = [[2, 0, 1],
     [0, 1, 0],
     [0, 0, 1]]

Multiply H × [1, 0, 1]T, then divide by the third component.

Show derivation
H × [1, 0, 1]T = [2×1 + 0×0 + 1×1, 0×1 + 1×0 + 0×1, 0×1 + 0×0 + 1×1] = [3, 0, 1]
Cartesian: (3/1, 0/1) = (3, 0)

This H applies 2x scaling in x (h11=2) plus translation by 1 in x (h13=1). Point (1,0) becomes (2×1+1, 0) = (3, 0). The h31 and h32 entries (both 0 here) create perspective distortion — when they're nonzero, the denominator w' varies across the image, creating foreshortening.

Exercise 8.2: Degrees of Freedom Trace
A homography has 8 DoF (9 entries minus scale). How many equations does each point correspondence provide, and how many point pairs are needed for a minimal solution?
Show explanation

Each point correspondence (x,y) → (x',y') provides 2 equations (one constraining x', one constraining y'). With 8 unknowns, the minimal solution requires 8/2 = 4 point pairs. This is the Direct Linear Transform (DLT) algorithm. In practice, RANSAC picks random 4-point subsets, solves H, counts inliers, and repeats to find the best H robust to outlier matches.

Exercise 8.3: RANSAC Iterations Derive

You have 100 feature matches, 40% are outliers (60% inlier ratio). RANSAC samples 4 points per iteration. What probability that at least one iteration samples 4 inliers after N iterations?

P(all 4 inliers in one draw) = 0.64. P(success in N draws) = 1 − (1 − 0.64)N. How many iterations for 99% confidence?

iterations (for 99% confidence)
Show derivation
P(4 inliers) = 0.64 = 0.1296
P(failure in one draw) = 1 − 0.1296 = 0.8704
Want: 1 − 0.8704N ≥ 0.99
0.8704N ≤ 0.01
N ≥ ln(0.01) / ln(0.8704) = −4.605 / −0.1389 = 33.2
N = 34 iterations (round up)

Only 34 iterations needed even with 40% outliers. This is why RANSAC is so practical — even heavily contaminated matches can be handled quickly. With 80% outliers, the same calculation gives N = ln(0.01)/ln(1−0.24) = ~2871 iterations, which is still fast for modern hardware.

Exercise 8.4: Arrange the Stitching Pipeline Design

Put these image stitching steps in the correct order.

?
?
?
?
?
Detect keypoints (SIFT/ORB) Match descriptors between images RANSAC to estimate homography H Warp one image using H Blend warped images together
Show correct order

1. Detect keypoints (SIFT, ORB, or SuperPoint) → 2. Match descriptors (nearest-neighbor in descriptor space) → 3. RANSAC + homography estimation (robust fitting from matches) → 4. Warp image (apply H to align to reference frame) → 5. Blend (feather/multiband blend for seamless compositing)

Exercise 8.5: When Homography Fails Trace
A homography maps one plane to another. In which scenario does a single homography not correctly model the image transformation?
Show explanation

A homography maps a single plane. When the scene contains objects at multiple depths and the camera translates (not just rotates), nearby and distant objects shift by different amounts (parallax). No single homography can capture this — you'd need one homography per depth plane, or a full 3D model. This is why panorama stitching works best with pure rotation (all points are "at infinity" relative to the rotation center) or for flat scenes.

Chapter 9: Capstone: Detection System Design

You're designing a YOLOv8-style single-shot detector. This capstone ties together everything from this workbook: convolution sizing, receptive fields, feature pyramids, anchor math, NMS, and latency budgets. You need to know every number in the pipeline.

A real detection system is an end-to-end latency budget. Image preprocessing, backbone inference, neck (FPN), head prediction, post-processing (decode + NMS) — each stage has a compute cost, and the total must fit within your frame deadline. At 30 FPS, you have 33ms per frame. Every layer you add eats into that budget.
Exercise 9.1: Backbone FLOPs Derive

Your YOLOv8-Medium backbone is similar to CSPDarknet53-medium. Approximate it as: 5 stages of 3×3 convolutions with channels [48, 96, 192, 384, 536] at spatial sizes [320, 160, 80, 40, 20] (input 640×640). Each stage has ~6 conv layers. Estimate backbone GMACs.

Per conv: MACs ≈ H×W×Cout×Cin×K². Sum across all stages.

GMACs (approximate)
Show derivation
Stage 1 (320×320, 48ch, 6 layers): 6 × 320×320×48×48×9 ≈ 6 × 0.21G = 1.27G
Stage 2 (160×160, 96ch, 6 layers): 6 × 160×160×96×96×9 ≈ 6 × 0.21G = 1.27G
Stage 3 (80×80, 192ch, 6 layers): 6 × 80×80×192×192×9 ≈ 6 × 0.21G = 1.27G
Stage 4 (40×40, 384ch, 6 layers): 6 × 40×40×384×384×9 ≈ 6 × 0.21G = 1.27G
Stage 5 (20×20, 536ch, 6 layers): 6 × 20×20×536×536×9 ≈ 6 × 0.12G = 0.72G
Total ≈ 5.8 GMACs

The actual YOLOv8-M backbone uses ~8-12 GMACs depending on exact block structure (CSP bottlenecks have fewer MACs than plain convs). The key takeaway: most compute is in the early high-resolution stages. Stage 1 at 320×320 uses as many FLOPs as Stage 5 at 20×20 despite having 12x fewer channels, because spatial area dominates.

Exercise 9.2: Total Detection Anchors Derive

YOLOv8 is anchor-free but still predicts on 3 feature map scales: 80×80, 40×40, 20×20. It predicts 1 box per grid cell (no anchor presets). How many total predictions?

total predictions
Show derivation
80×80 = 6,400
40×40 = 1,600
20×20 = 400
Total = 6,400 + 1,600 + 400 = 8,400

Compare to anchor-based Faster R-CNN: 9 anchors per location would give 9 × 8,400 = 75,600 predictions. YOLOv8's anchor-free design reduces predictions by 9x, which dramatically speeds up post-processing (NMS). Each prediction includes: 4 box coordinates + 1 objectness + C class scores = 4 + 1 + 80 = 85 values for COCO.

Exercise 9.3: NMS Latency Derive

After confidence thresholding (score > 0.25), about 200 detections remain. NMS is O(N²). If each IoU computation takes 10 nanoseconds, how long does NMS take?

milliseconds
Show derivation
IoU computations = 200 × 199 / 2 = 19,900
Time = 19,900 × 10 ns = 199,000 ns = 0.199 ms

NMS is negligible at 0.2ms. The bottleneck in a detection pipeline is always the backbone forward pass (5-15ms on GPU), not post-processing. This is why optimizing the backbone (pruning, quantization, TensorRT) gives the biggest latency wins.

Exercise 9.4: GPU Inference Time Derive

Your YOLOv8-M has ~20 GMACs total (backbone + neck + head). An NVIDIA RTX 4090 delivers ~83 TFLOPS (FP16, with tensor cores). What is the theoretical minimum inference time?

Remember: 1 MAC ≈ 2 FLOPs. And real utilization is ~50% of peak due to memory bandwidth bottlenecks.

ms (with 50% utilization)
Show derivation
FLOPs = 2 × 20 GMACs = 40 GFLOPs
Peak throughput = 83 TFLOPS = 83,000 GFLOPs/s
Theoretical time = 40 / 83,000 = 0.00048 seconds = 0.48 ms
With 50% utilization: 0.48 / 0.5 = 0.96 ms

Under 1ms for the forward pass — easily supports real-time detection at 30+ FPS even with post-processing overhead. In practice, YOLOv8-M achieves ~2-3ms on RTX 4090 because of memory transfer latency, kernel launch overhead, and some layers being memory-bandwidth-bound rather than compute-bound. TensorRT optimization can bring this closer to the theoretical limit.

Exercise 9.5: End-to-End Latency Budget Design

Fill in a realistic latency budget for a 30 FPS detection pipeline on RTX 4090. Target: ≤ 33ms total.

?
?
?
?
?
Preprocess: resize + normalize (~1ms) Backbone + FPN forward pass (~3ms) Detection head predictions (~0.5ms) Decode + NMS post-processing (~0.5ms) Draw boxes + display (~1ms)
Show correct order

Preprocess (~1ms) → Backbone + FPN (~3ms) → Head predictions (~0.5ms) → Decode + NMS (~0.5ms) → Draw + display (~1ms). Total ≈ 6ms, well under 33ms. The backbone dominates at ~50% of total latency, confirming it's the optimization target. With pipelining (process frame N+1 while displaying frame N), you can overlap stages and further reduce effective latency.

Exercise 9.6: Quantization Savings Derive

Your YOLOv8-M model has 25M parameters in FP32. You quantize to INT8. How much model size do you save, and approximately how much faster is inference (assume memory-bandwidth-bound layers)?

MB saved
Show derivation
FP32 size: 25M × 4 bytes = 100 MB
INT8 size: 25M × 1 byte = 25 MB
Savings: 100 − 25 = 75 MB (4x smaller)

For memory-bandwidth-bound layers, throughput scales linearly with model size reduction: ~4x faster. For compute-bound layers, INT8 tensor cores give ~2x the throughput of FP16. In practice, INT8 YOLOv8 achieves ~2-3x end-to-end speedup with minimal accuracy loss (<1% mAP drop on COCO). TensorRT's post-training quantization handles the conversion automatically.

The proof of work. If you completed every exercise in this workbook — traced convolutions by hand, computed receptive fields, implemented IoU and NMS, estimated FLOPs and latency budgets — you can design, debug, and optimize a real detection pipeline. These are the exact calculations CV engineers do daily, from choosing the right backbone to meeting a latency SLA. "What I cannot create, I do not understand."

Related Lessons

TopicLesson
Vision-Language ModelsVLM — From Absolute Zero
Diffusion ModelsDiffusion — From Absolute Zero
NeRF & 3D Gaussian SplattingNeRF & 3DGS — From Absolute Zero
Contrastive Learning (CLIP)Contrastive Learning — From Absolute Zero
Visual Language ActionsVLA — From Absolute Zero