From convolution arithmetic to full detection pipelines. Every formula a CV engineer needs — output sizes, receptive fields, anchor math, IoU, NMS, disparity, homographies — all derivable by hand, all testable in-browser.
You're debugging a CNN and the shapes don't match. The feature map coming out of layer 3 is 14×14 but the next layer expects 7×7. What happened? To debug CNN architectures, you need one formula burned into memory:
This formula governs every convolutional layer in every CNN ever built. A 3×3 kernel on a 5×5 input with no padding and stride 1: O = (5 − 3 + 0)/1 + 1 = 3. The output is 3×3.
How does convolution actually compute a single output pixel? The kernel slides across the input. At each position, we element-wise multiply the kernel with the overlapping input region, then sum all products. For a 3×3 kernel at position (0,0) of a 5×5 input:
A 3×3 kernel is applied to a 32×32 input with stride=1, padding=0. What is the output spatial dimension (one side)?
Without padding, a 3×3 kernel shrinks each dimension by 2 pixels (K−1). This is why "valid" convolution produces smaller outputs.
A 3×3 kernel, stride=2, padding=1, applied to a 224×224 input. Output size?
Stride=2 with padding=1 is the standard "halving" recipe. 224→112→56→28→14→7 — this is the exact downsampling chain in ResNet. Each layer halves the spatial dimension.
Given a 5×5 input and a 3×3 kernel (no padding, stride=1), compute output[1,1].
output[1,1] overlaps input rows 1-3, cols 1-3. Multiply element-wise and sum.
This kernel detects diagonal edges — it has +1 on the main diagonal and −1 on the anti-diagonal. For this particular patch, most values align with the zero entries of the kernel, so only the two diagonal corners contribute. The result is 2.
"Same" padding means the output size equals the input size (when stride=1). For a 5×5 kernel with stride=1, how much padding P do you need on each side?
Set O = W and solve: W = (W − K + 2P)/1 + 1 ⇒ P = ?
The general rule: for any kernel K with stride=1, "same" padding requires P = (K−1)/2. For K=3: P=1. For K=5: P=2. For K=7: P=3. This is why odd kernel sizes are universal — they give integer padding for "same" convolutions.
A conv layer: 3×3 kernel, 64 input channels, 128 output channels, with bias. How many learnable parameters?
Each output filter has shape [Cin, K, K] plus one bias.
The parameter count formula for a conv layer is Cout × (Cin × K × K + 1). Notice it doesn't depend on the spatial size of the input — a 3×3 conv with 64→128 channels uses the same 73K parameters whether the input is 224×224 or 7×7. This is the fundamental advantage of convolutions over fully-connected layers.
Write a 2D convolution for a single-channel input. No padding, stride=1.
javascript function conv2d(input, kernel) { const H = input.length, W = input[0].length; const Kh = kernel.length, Kw = kernel[0].length; const Oh = H - Kh + 1, Ow = W - Kw + 1; const out = []; for (let i = 0; i < Oh; i++) { out[i] = []; for (let j = 0; j < Ow; j++) { let s = 0; for (let ki = 0; ki < Kh; ki++) for (let kj = 0; kj < Kw; kj++) s += input[i+ki][j+kj] * kernel[ki][kj]; out[i][j] = s; } } return out; }
A single pixel in a deep feature map "sees" a region of the original input. That region is the receptive field. Understanding it tells you whether your network can even detect objects at a given scale. If your receptive field is 32×32 pixels but the target object is 200×200, the network can only see fragments.
Three consecutive 3×3 conv layers, all stride=1, no pooling. What is the receptive field?
Start with rf=1. Each 3×3 stride-1 layer adds (3−1)×1 = 2.
Three 3×3 layers give a 7×7 receptive field — the same field of view as a single 7×7 kernel, but with fewer parameters (3×9 = 27 vs 49) and more nonlinearities. This is the key insight behind VGG: stack small kernels to get large receptive fields cheaply.
Layer 1: 3×3, stride=2. Layer 2: 3×3, stride=1. Layer 3: 3×3, stride=1. What is the receptive field after layer 3?
The stride-2 in layer 1 doubles the "jump" — every subsequent 3×3 kernel covers 2×(3−1) = 4 extra input pixels instead of 2. Compare to three stride-1 layers which give rf=7. One stride-2 layer early on grows the receptive field much faster: 11 vs 7 with the same number of layers.
ResNet-50 starts with: conv1 (7×7, stride=2), maxpool (3×3, stride=2), then three 3×3 stride-1 convs in the first residual block. What is the receptive field after these 5 layers?
Track the jump (cumulative stride product) at each layer. Layer i adds (Ki−1) × jump to the receptive field, then multiply jump by Si.
After just the stem and first residual block, each feature map pixel already sees a 35×35 region of the original 224×224 input. By the end of ResNet-50, the receptive field exceeds the entire input image — meaning each final feature map pixel is theoretically influenced by all input pixels.
Dilated convolutions skip every (d−1) pixels. A 3×3 kernel with dilation=2 samples a 5×5 region but only uses 9 weights. This is how DeepLab achieves large receptive fields in segmentation without increasing parameters or reducing resolution.
Three stacked 3×3 layers use 45% fewer parameters than one 7×7 layer while achieving the same receptive field. Plus they add two extra ReLU nonlinearities, making the function more expressive. This is the VGGNet insight (2014) that killed large kernels.
Pooling layers reduce spatial dimensions without learnable parameters. They summarize local regions — max pooling takes the strongest activation, average pooling takes the mean. The output size formula is the same as convolution: O = ⌊(W − K + 2P) / S⌋ + 1.
Apply 2×2 max pooling (stride=2) to this 4×4 feature map. What is the value at output position (0,1)?
The full 2×2 output is: [[6, 8], [3, 3]]. Max pooling with 2×2 stride 2 halves each dimension and keeps the strongest activation in each quadrant.
The final conv layer of ResNet-50 outputs a 7×7×2048 feature map. Global Average Pooling (GAP) averages each channel over the entire spatial dimension. What is the GAP output shape?
GAP averages over the spatial dimensions (H and W) independently for each channel. Input [7, 7, 2048] → each of the 2048 channels gets averaged over 7×7=49 positions → output [1, 1, 2048]. This eliminates 7×7 − 1 = 48 spatial positions per channel, leaving a single number per channel.
Compare: (A) GAP then a 2048→1000 FC layer vs (B) Flatten 7×7×2048 then a 100,352→1000 FC layer. How many parameters does approach B use? How many does A use?
Just the FC layer parameters (weights + bias).
GAP saves ~98.3M parameters — a 49x reduction. This is why every modern classifier uses GAP: it collapses spatial information into a compact vector with zero learned parameters, then lets a tiny FC layer do the classification. Introduced by GoogLeNet (2014), GAP also acts as a regularizer since it forces the network to learn spatially-distributed features.
A conv layer outputs 56×56×256 feature maps (float32). How much memory does this activation tensor consume per image?
During training, every layer's activations must be stored for backpropagation. ResNet-50 with batch_size=32 stores ~3 GB of activations across all layers. This is why activation checkpointing (recomputing activations instead of storing them) is critical for training large CNNs on limited GPU memory.
A standard 3×3 conv: 256 input channels, 256 output channels. A depthwise separable conv splits this into: (1) depthwise 3×3 conv (one filter per channel, 256 filters) + (2) pointwise 1×1 conv (256→256). What is the parameter ratio: standard / depthwise_separable?
In general, the ratio is approximately K² for large channel counts (here 3² = 9, close to our 8.7). MobileNet uses depthwise separable convolutions throughout, achieving ~8-9x parameter reduction per layer with minimal accuracy loss. The general formula: standard/(depthwise+pointwise) = Cout×Cin×K² / (Cin×K² + Cin×Cout) ≈ K² when C is large.
How does a detection network know where to look? It doesn't search randomly. Instead, it tiles the image with thousands of predefined anchor boxes — rectangles of various sizes and aspect ratios — and predicts offsets from each anchor to refine the location. The anchor system converts "find any object anywhere" into "classify and refine each of these specific boxes."
To evaluate how well a predicted box matches the ground truth, we use Intersection over Union (IoU):
A Faster R-CNN uses 3 scales × 3 aspect ratios = 9 anchors per location. The backbone outputs a 32×32 feature map. How many total anchor boxes?
And this is for a single feature map scale. FPN-based detectors like RetinaNet use 5 feature map levels (P3-P7), with sizes like 100×100, 50×50, 25×25, 13×13, 7×7. Total anchors across all levels: 9 × (10000+2500+625+169+49) = 9 × 13,343 = ~120K anchors per image. The vast majority are negative (no object) — this class imbalance motivated focal loss.
Compute IoU for boxes A=[0, 0, 10, 10] and B=[5, 5, 15, 15] (format: [x1, y1, x2, y2]).
An IoU of 0.143 means only 14.3% overlap. In COCO evaluation, a detection is considered a true positive at IoU ≥ 0.5. The COCO primary metric (AP) averages over IoU thresholds from 0.5 to 0.95 in steps of 0.05.
Write a function that computes IoU between two boxes in [x1, y1, x2, y2] format.
javascript function computeIoU(a, b) { const ix1 = Math.max(a[0], b[0]); const iy1 = Math.max(a[1], b[1]); const ix2 = Math.min(a[2], b[2]); const iy2 = Math.min(a[3], b[3]); const iw = Math.max(0, ix2 - ix1); const ih = Math.max(0, iy2 - iy1); const inter = iw * ih; const areaA = (a[2]-a[0]) * (a[3]-a[1]); const areaB = (b[2]-b[0]) * (b[3]-b[1]); return inter / (areaA + areaB - inter); }
Anchors in the "gray zone" (IoU 0.3-0.7) are ignored — they contribute no loss during training. This prevents ambiguous examples from corrupting the classifier. Of the ~9K anchors from a 32×32 feature map, typically only ~256 are sampled for training (128 positive + 128 negative), with the rest discarded.
An FPN-based detector uses 5 feature map levels with spatial sizes: P3=80×80, P4=40×40, P5=20×20, P6=10×10, P7=5×5. Each level has 9 anchors per location. Total anchors?
~84% of anchors come from P3 alone (the finest level). This heavy skew toward small objects is intentional — small objects are harder to detect and need denser coverage. But it also means ~75K of these ~77K anchors are negative, which is the class imbalance problem that focal loss addresses.
This IoU function returns values greater than 1 for some inputs. Click the buggy line.
function computeIoU(a, b) { const ix1 = Math.max(a[0], b[0]); const iy1 = Math.max(a[1], b[1]); const ix2 = Math.min(a[2], b[2]); const iy2 = Math.min(a[3], b[3]); const inter = (ix2 - ix1) * (iy2 - iy1); const areaA = (a[2]-a[0]) * (a[3]-a[1]); const areaB = (b[2]-b[0]) * (b[3]-b[1]); return inter / (areaA + areaB - inter); }
Line 6 is the bug. When boxes don't overlap, ix2 - ix1 or iy2 - iy1 can be negative. Two negative numbers multiplied give a positive intersection area, which corrupts the IoU. The fix: clamp both dimensions to zero with Math.max(0, ix2 - ix1) and Math.max(0, iy2 - iy1). Without this clamp, non-overlapping boxes can produce IoU > 1 because the "intersection" is positive while the union is smaller than the sum of areas.
A detector with 77K anchors might produce hundreds of detections for a single object. Non-Maximum Suppression (NMS) prunes these duplicates: keep the highest-confidence detection, remove all others that overlap it significantly, repeat.
5 detections with scores and boxes. IoU threshold = 0.5. Trace NMS and count how many survive.
A and B overlap heavily (IoU ≈ 0.68). C and D overlap heavily (IoU ≈ 0.65). E is isolated.
Round 1: Pick A (score=0.9). IoU(A,B)≈0.68 ≥ 0.5 → suppress B. IoU(A,C)≈0 < 0.5 → keep. IoU(A,D)≈0 < 0.5 → keep. IoU(A,E)≈0 < 0.5 → keep. Remaining: {C, D, E}.
Round 2: Pick C (score=0.7). IoU(C,D)≈0.65 ≥ 0.5 → suppress D. IoU(C,E)≈0 < 0.5 → keep. Remaining: {E}.
Round 3: Pick E (score=0.3). No remaining detections. Done.
Survivors: A, C, E — three distinct detections for three distinct regions.
Since IoU (0.55) ≥ threshold (0.5), the lower-scoring box is suppressed. This is a false negative caused by NMS. Solutions: (1) raise the threshold to 0.6 or 0.7, but this lets more duplicate detections through, (2) use Soft-NMS which decays the score instead of removing, or (3) use DETR-style set prediction which avoids NMS entirely by predicting a fixed set of unique objects.
Implement NMS. Input: array of detections [{box:[x1,y1,x2,y2], score:float}], IoU threshold. Return surviving detections in order.
javascript function nms(detections, iouThreshold) { function iou(a, b) { const ix1=Math.max(a[0],b[0]), iy1=Math.max(a[1],b[1]); const ix2=Math.min(a[2],b[2]), iy2=Math.min(a[3],b[3]); const inter = Math.max(0,ix2-ix1)*Math.max(0,iy2-iy1); const aa=(a[2]-a[0])*(a[3]-a[1]); const ab=(b[2]-b[0])*(b[3]-b[1]); return inter/(aa+ab-inter); } const sorted = [...detections].sort((a,b) => b.score-a.score); const keep = []; const suppressed = new Set(); for (let i=0; i<sorted.length; i++) { if (suppressed.has(i)) continue; keep.push(sorted[i]); for (let j=i+1; j<sorted.length; j++) { if (!suppressed.has(j) && iou(sorted[i].box,sorted[j].box) >= iouThreshold) suppressed.add(j); } } return keep; }
NMS processes N detections. In the worst case (no suppression — every detection is isolated), how many IoU computations does the greedy algorithm perform?
O(N²) complexity. With N=77K anchors this would be ~3 billion IoU ops. In practice, a confidence threshold (e.g., score > 0.05) pre-filters to a few hundred detections before NMS, making it cheap. GPU-accelerated NMS (torchvision.ops.nms) handles the remaining cost.
Put these detection post-processing steps in the correct order.
1. Raw network output → 2. Decode box offsets (apply anchor transforms to get actual coordinates) → 3. Apply confidence threshold (remove low-score detections) → 4. Per-class NMS (remove duplicates within each class) → 5. Keep top-K (limit final output count)
The core innovation of ResNet is embarrassingly simple: add the input of a block to its output. Instead of learning H(x), the block learns F(x) = H(x) − x, and the output is F(x) + x. This is the residual connection (or skip connection).
ResNet-50 has 4 stages with bottleneck blocks. Each bottleneck has three convolutions: 1×1 (reduce), 3×3 (process), 1×1 (expand). Stage dimensions:
Per bottleneck block: 1×1 conv (Cin→Cmid) + 3×3 conv (Cmid→Cmid) + 1×1 conv (Cmid→Cout). Compute total for Stage 3 (6 blocks, first block has Cin=512 from Stage 2).
ResNet-50 total is ~25.6M parameters. Stage 3 (1024-channel blocks) and Stage 4 (2048-channel blocks) dominate because parameter count scales with C². The full count: conv1 (9.4K) + Stage1 (0.2M) + Stage2 (1.2M) + Stage3 (7.1M) + Stage4 (15.0M) + FC (2.0M) ≈ 25.6M.
A 3×3 conv layer: 256 input channels, 256 output channels, applied to a 56×56 feature map. How many multiply-accumulate operations (MACs)?
Each output element requires Cin × K × K multiplications. Total MACs = output_elements × macs_per_element = (Hout × Wout × Cout) × (Cin × K × K).
Note: FLOPs ≈ 2 × MACs (one multiply + one add per MAC). So this layer uses ~3.7 GFLOPs. The key insight: FLOPs scale with spatial area. The same conv layer at 28×28 uses 4x fewer FLOPs than at 56×56. This is why downsampling early is computationally efficient — most FLOPs are spent in the early, high-resolution layers.
Even with per-layer gradients of 0.9 (close to 1!), 100 layers of multiplication shrink the signal to near zero. This is the vanishing gradient problem. ResNet's skip connections bypass this chain by providing an additive gradient path: the gradient through the identity shortcut is always 1, regardless of depth.
This ResNet block implementation crashes with a shape mismatch on the addition. Click the buggy line.
function residualBlock(x, W1, W2) { // x: [batch, 256, 56, 56] let out = conv2d(x, W1); // W1: 3x3, 256→512 out = relu(out); // [batch, 512, 56, 56] out = conv2d(out, W2); // W2: 3x3, 512→256 return out + x; // CRASH: [batch,256,56,56] }
Line 3 is the bug. The first conv changes channels from 256 to 512, but the second conv brings it back to 256. While the final shape matches x, the issue is that W1 projects to 512 channels unnecessarily. In a standard ResNet basic block, both convolutions should maintain the same channel count (256→256, 256→256). If you need to change channels, you must add a 1×1 projection shortcut: return out + conv1x1(x, Ws). The code as written doesn't crash on shapes (both are [batch,256,56,56]), but the intermediate 512-channel expansion is wrong for a basic block. For a bottleneck block, the expansion goes 256→64→64→256 (reduce-process-expand), not 256→512→256.
A rough estimate: ResNet-50 processes a 224×224 input. The spatial dimensions halve at each stage while channels double. Approximate total GMACs (use the rule that most FLOPs are in Stage 3 at 14×14 resolution).
ResNet-50 uses ~4.1 GMACs (~8.2 GFLOPs) for a 224×224 input. The majority of compute is in Stage 2 (56×56, 128 channels) and Stage 3 (28×28, 256 channels). Despite Stage 4 having the most parameters (512 channels), it operates at 7×7 resolution, so its FLOP contribution is small. This is a general principle: parameters concentrate in deep layers, FLOPs concentrate in early layers.
Classification gives one label per image. Semantic segmentation gives one label per pixel. The challenge: the encoder downsamples (224×224 → 7×7), and the decoder must upsample back to full resolution. This requires transposed convolution (sometimes called "deconvolution").
Transposed convolution with K=4, S=2, P=1 applied to a 14×14 feature map. Output size?
The combination K=4, S=2, P=1 is the standard "doubling" transposed convolution. It exactly doubles the spatial dimension: 7→14, 14→28, 28→56, 56→112. This is the mirror image of stride-2 convolution's halving.
A U-Net encoder: input 256×256, then 4 max-pool 2×2 (stride 2) operations. What are the feature map sizes at each level?
The bottleneck is 16×16. The decoder path mirrors this: 16 → 32 → 64 → 128 → 256, with skip connections concatenating encoder features at each level. The channel counts typically double on the way down (64→128→256→512→1024) and halve on the way up.
In the U-Net decoder, at 64×64 resolution: the upsampled decoder feature has 256 channels. The skip connection from the encoder also has 256 channels. After concatenation, a 3×3 conv reduces channels back to 256. How many parameters in this conv (with bias)?
The skip concatenation doubles the input channels, which quadruples the conv parameters compared to a conv without skip (256×256×9 = 590K vs 512×256×9 = 1.18M). This is the parameter cost of skip connections, but the boundary quality improvement is dramatic.
The output is [512, 512, 21] — a full-resolution prediction with 21 logits per pixel. After applying argmax along the channel dimension, you get a [512, 512] label map. The final layer is a 1×1 convolution that maps from the decoder's channel dimension to the number of classes. Loss: per-pixel cross-entropy between predicted logits and ground-truth labels.
To upsample 14×14×512 to 28×28×256: Option A uses transposed conv (K=4, S=2, P=1). Option B uses bilinear upsampling (2x) followed by 3×3 conv. How many parameters does each approach use?
Bilinear + conv is cheaper and often preferred because transposed convolutions can produce "checkerboard" artifacts when K is not divisible by S. Modern segmentation networks (e.g., DeepLabV3+) typically use bilinear upsampling followed by convolution.
Two cameras, separated by a known distance, observe the same scene. An object appears at different horizontal positions in the left and right images. The difference in position is disparity, and it encodes depth: closer objects have larger disparity, farther objects have smaller disparity.
f = 500 pixels, B = 0.12 meters, Z = 5 meters. What is the disparity in pixels?
12 pixels of disparity is easy to detect reliably. For comparison, an object at 50 meters would have d = 500 × 0.12 / 50 = 1.2 pixels — much harder, requiring sub-pixel matching algorithms.
Same camera setup (f=500px, B=0.12m). You measure a disparity of 8 pixels. What is the object's depth?
With f=500, B=0.12, true depth Z=10m: if the disparity measurement has a ±1 pixel error, what is the depth error range?
First find true disparity d, then compute Z for d±1.
A ±1 pixel error at 10m depth produces ~3.4m of uncertainty. Compare to an object at 2m (d=30 pixels): Zmin=60/31=1.94m, Zmax=60/29=2.07m, range=0.13m. Depth precision degrades quadratically with distance: ΔZ ≈ Z²/(fB) × Δd.
For a rectified stereo pair, the epipolar lines are horizontal — every corresponding point has the same y-coordinate. This reduces stereo matching from a 2D search to a 1D search along the same scanline. Rectification is a preprocessing step that warps both images so that epipolar lines become horizontal, using the camera calibration parameters. Mathematically, the fundamental matrix F of a rectified pair has the form F = [0, 0, 0; 0, 0, -1; 0, 1, 0] (up to scale), enforcing y' = y.
Your stereo camera (f=500, B=0.12) must detect objects from 1m to 30m. What range of disparities must you search?
A search range of 58 pixels means for each pixel in the left image, you must compare 58 candidate positions in the right image. For a 640×480 image: 640 × 480 × 58 = ~17.8 million comparisons. Modern stereo networks (e.g., RAFT-Stereo) construct a 3D cost volume of this size and process it with 3D convolutions.
A homography is a 3×3 matrix H that maps points from one plane to another. If you photograph a flat surface (a poster, a building facade, a tabletop) from two different viewpoints, the relationship between the two images is a homography. It encodes rotation, translation, scaling, and perspective distortion.
Given H and point [1, 0, 1]T, compute the transformed point.
Multiply H × [1, 0, 1]T, then divide by the third component.
This H applies 2x scaling in x (h11=2) plus translation by 1 in x (h13=1). Point (1,0) becomes (2×1+1, 0) = (3, 0). The h31 and h32 entries (both 0 here) create perspective distortion — when they're nonzero, the denominator w' varies across the image, creating foreshortening.
Each point correspondence (x,y) → (x',y') provides 2 equations (one constraining x', one constraining y'). With 8 unknowns, the minimal solution requires 8/2 = 4 point pairs. This is the Direct Linear Transform (DLT) algorithm. In practice, RANSAC picks random 4-point subsets, solves H, counts inliers, and repeats to find the best H robust to outlier matches.
You have 100 feature matches, 40% are outliers (60% inlier ratio). RANSAC samples 4 points per iteration. What probability that at least one iteration samples 4 inliers after N iterations?
P(all 4 inliers in one draw) = 0.64. P(success in N draws) = 1 − (1 − 0.64)N. How many iterations for 99% confidence?
Only 34 iterations needed even with 40% outliers. This is why RANSAC is so practical — even heavily contaminated matches can be handled quickly. With 80% outliers, the same calculation gives N = ln(0.01)/ln(1−0.24) = ~2871 iterations, which is still fast for modern hardware.
Put these image stitching steps in the correct order.
1. Detect keypoints (SIFT, ORB, or SuperPoint) → 2. Match descriptors (nearest-neighbor in descriptor space) → 3. RANSAC + homography estimation (robust fitting from matches) → 4. Warp image (apply H to align to reference frame) → 5. Blend (feather/multiband blend for seamless compositing)
A homography maps a single plane. When the scene contains objects at multiple depths and the camera translates (not just rotates), nearby and distant objects shift by different amounts (parallax). No single homography can capture this — you'd need one homography per depth plane, or a full 3D model. This is why panorama stitching works best with pure rotation (all points are "at infinity" relative to the rotation center) or for flat scenes.
You're designing a YOLOv8-style single-shot detector. This capstone ties together everything from this workbook: convolution sizing, receptive fields, feature pyramids, anchor math, NMS, and latency budgets. You need to know every number in the pipeline.
Your YOLOv8-Medium backbone is similar to CSPDarknet53-medium. Approximate it as: 5 stages of 3×3 convolutions with channels [48, 96, 192, 384, 536] at spatial sizes [320, 160, 80, 40, 20] (input 640×640). Each stage has ~6 conv layers. Estimate backbone GMACs.
Per conv: MACs ≈ H×W×Cout×Cin×K². Sum across all stages.
The actual YOLOv8-M backbone uses ~8-12 GMACs depending on exact block structure (CSP bottlenecks have fewer MACs than plain convs). The key takeaway: most compute is in the early high-resolution stages. Stage 1 at 320×320 uses as many FLOPs as Stage 5 at 20×20 despite having 12x fewer channels, because spatial area dominates.
YOLOv8 is anchor-free but still predicts on 3 feature map scales: 80×80, 40×40, 20×20. It predicts 1 box per grid cell (no anchor presets). How many total predictions?
Compare to anchor-based Faster R-CNN: 9 anchors per location would give 9 × 8,400 = 75,600 predictions. YOLOv8's anchor-free design reduces predictions by 9x, which dramatically speeds up post-processing (NMS). Each prediction includes: 4 box coordinates + 1 objectness + C class scores = 4 + 1 + 80 = 85 values for COCO.
After confidence thresholding (score > 0.25), about 200 detections remain. NMS is O(N²). If each IoU computation takes 10 nanoseconds, how long does NMS take?
NMS is negligible at 0.2ms. The bottleneck in a detection pipeline is always the backbone forward pass (5-15ms on GPU), not post-processing. This is why optimizing the backbone (pruning, quantization, TensorRT) gives the biggest latency wins.
Your YOLOv8-M has ~20 GMACs total (backbone + neck + head). An NVIDIA RTX 4090 delivers ~83 TFLOPS (FP16, with tensor cores). What is the theoretical minimum inference time?
Remember: 1 MAC ≈ 2 FLOPs. And real utilization is ~50% of peak due to memory bandwidth bottlenecks.
Under 1ms for the forward pass — easily supports real-time detection at 30+ FPS even with post-processing overhead. In practice, YOLOv8-M achieves ~2-3ms on RTX 4090 because of memory transfer latency, kernel launch overhead, and some layers being memory-bandwidth-bound rather than compute-bound. TensorRT optimization can bring this closer to the theoretical limit.
Fill in a realistic latency budget for a 30 FPS detection pipeline on RTX 4090. Target: ≤ 33ms total.
Preprocess (~1ms) → Backbone + FPN (~3ms) → Head predictions (~0.5ms) → Decode + NMS (~0.5ms) → Draw + display (~1ms). Total ≈ 6ms, well under 33ms. The backbone dominates at ~50% of total latency, confirming it's the optimization target. With pipelining (process frame N+1 while displaying frame N), you can overlap stages and further reduce effective latency.
Your YOLOv8-M model has 25M parameters in FP32. You quantize to INT8. How much model size do you save, and approximately how much faster is inference (assume memory-bandwidth-bound layers)?
For memory-bandwidth-bound layers, throughput scales linearly with model size reduction: ~4x faster. For compute-bound layers, INT8 tensor cores give ~2x the throughput of FP16. In practice, INT8 YOLOv8 achieves ~2-3x end-to-end speedup with minimal accuracy loss (<1% mAP drop on COCO). TensorRT's post-training quantization handles the conversion automatically.
| Topic | Lesson |
|---|---|
| Vision-Language Models | VLM — From Absolute Zero |
| Diffusion Models | Diffusion — From Absolute Zero |
| NeRF & 3D Gaussian Splatting | NeRF & 3DGS — From Absolute Zero |
| Contrastive Learning (CLIP) | Contrastive Learning — From Absolute Zero |
| Visual Language Actions | VLA — From Absolute Zero |