FCN — Veanors

Chapter 0: The Problem

An image classification network looks at a photo and says “cat.” Useful, but crude. It tells you what is in the image, but nothing about where.

What if you need to know exactly which pixels belong to the cat? Which pixels are background? Which pixels are a dog standing next to the cat? This is semantic segmentation — assigning a class label to every single pixel in the image.

Before 2014, approaches to semantic segmentation were complex Rube Goldberg machines: generate region proposals, extract hand-crafted features, classify each proposal with an SVM, then stitch everything together with CRFs and superpixels. Slow. Fragile. Not end-to-end learnable.

Classification CNNs like AlexNet and VGGNet were dominating ImageNet — but they had a fatal structural limitation: fully connected layers. These layers flatten the spatial feature maps into a 1D vector, destroying all location information. The network outputs a single label, not a spatial map.

The core question: Can we take a powerful classification CNN — one that has already learned rich hierarchical features from millions of ImageNet images — and surgically modify it so that instead of outputting a single label, it outputs a label for every pixel? Long, Shelhamer, and Darrell showed the answer is yes, and the modification is surprisingly simple.

Classification vs. Segmentation

Classification gives one label for the whole image. Segmentation labels every pixel. Click to toggle between the two views.

Why can't a standard classification CNN (like VGGNet) directly produce a segmentation map?

Its fully connected layers flatten spatial feature maps into a 1D vector, destroying all pixel location information and outputting a single label It doesn't have enough layers CNNs can only process grayscale images

Chapter 1: The Key Insight

The paper's central insight is stunningly elegant: a fully connected layer is just a convolution with a kernel that covers the entire input.

Think about it. A fully connected layer takes, say, a 7×7×512 feature map, flattens it to a 25,088-element vector, and multiplies by a weight matrix. But this is mathematically identical to convolving the 7×7×512 input with a 7×7×512 filter. The output: a single number per filter. Stack 4096 such filters and you get the equivalent of fc6 in VGGNet.

Here is the trick: once you view fc6 as a 7×7 convolution, there is nothing stopping you from feeding in a larger input. Give it a 14×14×512 feature map instead. The 7×7 filter slides over the input and produces a 2×2 output. Give it 21×21? You get 3×3. The network now outputs a spatial map of predictions, not a single label.

Convolutionalization: Replace every FC layer with its equivalent 1×1 or k×k convolution. The resulting network is “fully convolutional” — it accepts inputs of any spatial size and produces correspondingly-sized spatial outputs. A classification network becomes a dense prediction engine with zero architectural changes to the convolutional backbone — only the FC layers are reinterpreted.

The final layer becomes a 1×1 convolution with C output channels (one per class). Each spatial location in the output is a per-pixel class prediction. The entire network — from raw pixels to per-pixel labels — is differentiable and trainable end-to-end.

There is one catch: the repeated pooling layers in VGGNet reduce spatial resolution by a factor of 32. A 224×224 input produces a 7×7 output map. Each output cell covers a 32×32 patch of the input. That is very coarse — we need to get back to full resolution. The next chapters solve this.

What is “convolutionalization” in the context of FCN?

Replacing fully connected layers with equivalent convolutional layers, so the network accepts any input size and outputs a spatial prediction map Adding more convolutional layers to the network Removing all pooling layers

Chapter 2: From Classification to Dense Prediction

Let us trace the full transformation from a classification CNN to a dense prediction FCN, step by step.

Step 1: The original classifier

VGG-16 takes a 224×224×3 image. It passes through 13 convolutional layers with 5 max-pooling layers (each halving spatial dimensions: 224 → 112 → 56 → 28 → 14 → 7). Then three fully connected layers: fc6 (4096), fc7 (4096), fc8 (1000 classes). Output: a single 1000-element probability vector.

Step 2: Convolutionalize the FC layers

fc6 (25088→4096) becomes a 7×7 conv with 4096 filters. fc7 (4096→4096) becomes a 1×1 conv with 4096 filters. fc8 (4096→1000) becomes a 1×1 conv with C filters (C = number of segmentation classes, e.g. 21 for PASCAL VOC).

Step 3: Feed a larger image

Now give the network a 500×500 image. The convolutional backbone produces a 16×16×512 feature map at pool5. The convolutionalized fc6 slides its 7×7 kernel over this, producing a 10×10 map. fc7 and fc8 are 1×1 convolutions, preserving spatial size. Result: a 10×10×C score map, where each location holds class scores for a 32×32 patch of the input.

Efficiency gain: The naive approach would classify each 32×32 patch independently — running the full network 100 times for a 10×10 grid. The FCN computes all 100 predictions in a single forward pass by sharing computation across overlapping receptive fields. For a 500×500 image, this is 5× faster than patch-wise classification (22ms vs 120ms for AlexNet-based FCN).

Step 4: The coarse output

The 10×10 output is much smaller than the 500×500 input. Each output pixel covers a 32×32 region. The prediction is coarse — it captures the right semantics (it knows where the cat is, roughly) but lacks fine detail (object boundaries are blocky). This is the FCN-32s baseline — “32s” because the output stride is 32.

Convolutionalization: FC → Conv

See how replacing FC layers with convolutions turns a fixed-size classifier into a spatially flexible dense predictor. Drag the slider to change input size.

Input size224×224

If VGG-16 is convolutionalized and given a 384×384 input, what is the approximate spatial size of its output (given pool5 stride = 32)?

12×12 — the input is divided by the stride factor of 32 384×384 — the output matches the input 1×1 — just like the original classifier

Chapter 3: Upsampling

The convolutionalized network produces a coarse output — a 10×10 map for a 500×500 image. We need to upsample this back to full resolution. The paper uses transposed convolution (also called “deconvolution”, though this is a misnomer).

What is transposed convolution?

Normal convolution with stride 2 takes a 4×4 input and produces a 2×2 output (downsampling). Transposed convolution with stride 2 takes a 2×2 input and produces a 4×4 output (upsampling). It is literally the reverse operation: it reverses the forward and backward passes of convolution.

More precisely, imagine inserting zeros between the input pixels (spacing them out), then applying a regular convolution. The result is a larger output. The filter weights determine how the upsampling interpolates — and critically, these weights are learnable.

Bilinear initialization

The paper initializes the transposed convolution filters to perform bilinear interpolation — a simple weighted average of neighboring pixels. But because the filters are part of the network, they can be fine-tuned by backpropagation. The network learns to upsample in a way that is optimized for the segmentation task, not just generic interpolation.

Why not just use bilinear interpolation? You could. But learned upsampling adapts to the data. The network might learn to sharpen edges, enhance boundaries between classes, or suppress noise — things a fixed interpolation kernel cannot do. In practice, the paper found that learning the upsampling filters improved results, especially when combined with skip connections (Chapter 4).

Transposed Convolution

A 2×2 coarse prediction is upsampled to 4×4 via transposed convolution. Each input value is multiplied by the kernel and placed in the output with stride spacing. Overlapping regions are summed.

Step 0: Input

In FCN-32s, the entire upsampling from the coarse output to full resolution is done in one giant 32× transposed convolution. This produces a full-resolution prediction, but the boundaries are still blobby because all the fine spatial detail was lost in the 32× downsampling. This motivates the key innovation of the paper: skip connections.

Why is transposed convolution preferred over fixed bilinear upsampling in FCN?

The upsampling filters are learnable — the network can adapt them via backpropagation to sharpen boundaries and improve segmentation quality It is faster to compute It produces smaller output

Chapter 4: Skip Connections

This is the most important architectural contribution of the paper. The insight: deep layers know what, shallow layers know where. Combine them.

The problem with FCN-32s

FCN-32s upsamples directly from the final prediction layer (stride 32). By the time the signal reaches this layer, it has passed through five pooling operations. The network knows there is a cat in the upper-left of the image, but the exact boundaries are lost. The 32× upsampling produces a blobby segmentation.

FCN-16s: Adding pool4

Pool4 has stride 16 — its feature maps are 2× finer than pool5. The paper adds a 1×1 convolution on pool4 to produce class predictions at stride 16. Then it fuses these predictions with the pool5 predictions: upsample the pool5 predictions by 2× (transposed convolution), add them element-wise to the pool4 predictions, then upsample the fused result by 16× to full resolution.

Result: a 3.0 point improvement in mean IU (59.4 → 62.4). The boundaries become noticeably sharper.

FCN-8s: Adding pool3

Pool3 has stride 8 — even finer. The paper adds another skip: fuse pool3 predictions with the already-fused pool4+pool5 predictions. Upsample the fused result by 8×.

Result: another 0.3 points (62.4 → 62.7). Small but visible improvement in fine details. Below pool3, returns diminish — the paper stopped here.

The deep jet: Deep layers have large receptive fields and capture semantic meaning (what objects are). Shallow layers have small receptive fields and capture fine spatial detail (where edges are). The skip architecture creates a “deep jet” — a nonlinear multiscale feature hierarchy where predictions at each scale are fused to produce output that is both semantically correct and spatially precise. This idea became the foundation of virtually every subsequent segmentation architecture (U-Net, DeepLab, PSPNet, FPN).

Skip Architecture: FCN-32s → 16s → 8s

Toggle between the three FCN variants. See how adding skip connections from earlier pooling layers progressively refines the segmentation output. The coarse prediction is shown in green, skip predictions in orange/blue, and fused output in teal.

Why element-wise addition, not concatenation?

The paper uses summation to fuse skip connections, not concatenation (which U-Net would later use). Summation is parameter-free and keeps the channel count constant. The skip predictions are zero-initialized so the network starts as FCN-32s and gradually learns to incorporate finer information. This makes training stable — each refinement stage starts from a working model.

What is the fundamental tension that skip connections resolve in FCN?

Deep layers capture what (semantic meaning) but lose where (spatial precision) — skip connections fuse deep coarse features with shallow fine features to get both Deeper networks train faster but are less accurate Skip connections reduce the number of parameters

Chapter 5: Architecture

The full FCN-8s architecture is an adapted VGG-16 backbone with three key modifications: convolutionalized FC layers, learned upsampling, and skip connections from pool3 and pool4.

The backbone: VGG-16

The paper tried three backbones: AlexNet (39.8 mean IU), GoogLeNet (42.5), and VGG-16 (56.0). VGG-16 won decisively despite being the slowest (210ms vs 50ms for AlexNet). Its uniform 3×3 filter architecture and depth created the best features for dense prediction.

Layer-by-layer walkthrough

For a 500×500 input with 21 PASCAL VOC classes:

conv1

500×500×3 → 500×500×64 (two 3×3 conv layers)

↓ pool1 (stride 2)

conv2

250×250×64 → 250×250×128 (two 3×3 conv layers)

↓ pool2 (stride 2)

conv3

125×125×128 → 125×125×256 (three 3×3 conv layers)

↓ pool3 (stride 2) — skip to FCN-8s

conv4

62×62×256 → 62×62×512 (three 3×3 conv layers)

↓ pool4 (stride 2) — skip to FCN-16s

conv5

31×31×512 → 31×31×512 (three 3×3 conv layers)

↓ pool5 (stride 2)

fc6 → conv6

15×15×512 → 15×15×4096 (7×7 conv, was FC)

↓

fc7 → conv7

15×15×4096 → 15×15×4096 (1×1 conv, was FC)

↓

score

15×15×4096 → 15×15×21 (1×1 conv, C classes)

↓ fuse skip from pool4, pool3, then upsample 8×

output

500×500×21 (per-pixel class scores)

Parameter count

The convolutionalized VGG-16 has 134M parameters — most are in the converted fc6 layer (7×7×512×4096 ≈ 102M). The 1×1 convolutions for skip predictions add negligible parameters. The transposed convolution filters for upsampling are initialized to bilinear and fine-tuned.

Transfer learning: All 13 convolutional layers are initialized from ImageNet-pretrained VGG-16. The converted FC layers use the same weights, just reshaped into convolutional filters. Only the final scoring layer (1×1 conv to 21 classes) and skip prediction layers are randomly initialized. This massive transfer of learned features is what makes FCN work — training the full network from scratch was not feasible.

Why did FCN use VGG-16 over AlexNet and GoogLeNet as its backbone?

VGG-16 achieved 56.0 mean IU vs 39.8 (AlexNet) and 42.5 (GoogLeNet) — its deep, uniform 3×3 architecture produced the best dense prediction features VGG-16 was the fastest VGG-16 had the fewest parameters

Chapter 6: Training

Training FCN requires careful attention to loss functions, optimization, and the fine-tuning strategy.

Per-pixel cross-entropy loss

The loss is a standard multinomial logistic loss (cross-entropy), computed independently at every pixel:

L = − (1/N) ∑_i,j log p(y_ij | x; θ)

Where y_ij is the ground-truth class at pixel (i,j), p is the softmax probability for that class, and N is the total number of valid pixels. Pixels marked as ambiguous or difficult in the ground truth are masked out and ignored.

Optimization

SGD with momentum 0.9, weight decay 5×10⁻⁴, minibatch size 20 images. Learning rate: 10⁻⁴ for VGG-16 backbone. Biases get 2× the learning rate. The class scoring layer is zero-initialized (not random — the paper found no benefit from random init).

Staged training

Stage 1

Train FCN-32s from ImageNet-pretrained VGG-16. Fine-tune all layers. ~3 days on one GPU.

↓

Stage 2

Initialize FCN-16s from trained FCN-32s. Add pool4 skip with zero-initialized 1×1 conv. Learning rate ×0.01. ~1 day.

↓

Stage 3

Initialize FCN-8s from trained FCN-16s. Add pool3 skip with zero-initialized 1×1 conv. ~1 day.

Why zero-initialize skips? When a skip prediction layer is zero-initialized, the fused output initially equals the coarser model’s output. The network starts from a known-good state (FCN-32s) and gradually learns to refine it with finer features. This is much more stable than random initialization, which would corrupt the already-trained coarse predictions.

Whole-image training

Unlike prior work that randomly sampled patches, FCN trains on whole images. Each image is effectively a batch of all overlapping patches. The paper showed this is just as effective as patch sampling, but faster because convolutional computation is shared. No data augmentation was used (random mirroring and jittering yielded no improvement).

Class balancing

PASCAL VOC labels are mildly unbalanced (~3/4 background). The paper found class balancing unnecessary — the per-pixel loss naturally handles the imbalance for this dataset. This is a notable simplification over prior methods that required careful class-frequency weighting.

Why are the skip connection prediction layers zero-initialized instead of randomly initialized?

So the network starts with the exact output of the coarser model — zero-initialized skips add nothing initially, then gradually learn to refine the predictions Random initialization is too slow to converge To reduce the number of parameters

Chapter 7: Results

FCN was evaluated on three benchmarks: PASCAL VOC (the main benchmark), NYUDv2 (RGB-D indoor scenes), and SIFT Flow (outdoor scenes with geometric labels).

PASCAL VOC 2012

FCN-8s achieved 62.2% mean IU on the test set — a 20% relative improvement over the previous state-of-the-art (SDS at 51.6%). Inference time: ~175ms per image, compared to ~50 seconds for SDS. That is a 286× speedup.

The skip architecture progression

FCN-32s

59.4 mean IU — coarse, blobby output

↓ +pool4 skip

FCN-16s

62.4 mean IU — sharper boundaries (+3.0)

↓ +pool3 skip

FCN-8s

62.7 mean IU — fine detail (+0.3, diminishing returns)

FCN Results on PASCAL VOC

Mean IU scores comparing FCN variants and prior methods. FCN-8s achieves a 20% relative improvement over the previous state-of-the-art.

NYUDv2

FCN-16s with RGB-HHA late fusion achieved 34.0 mean IU, improving over Gupta et al.’s 28.6. The “HHA” encoding converts depth into horizontal disparity, height above ground, and surface normal angle — a richer representation than raw depth.

SIFT Flow

FCN-16s with a two-headed architecture simultaneously predicted 33 semantic classes and 3 geometric classes, achieving 85.2% pixel accuracy on semantics and 94.3% on geometry — state-of-the-art on both tasks with essentially the speed of a single model.

What the numbers mean: The 20% relative improvement over SDS is significant, but the real story is the simplicity. SDS required proposals, feature extraction, SVM classifiers, and CRF refinement — a complex multi-stage pipeline. FCN is a single neural network, trained end-to-end, that runs in 175ms. It made semantic segmentation a solved-enough problem that the field could move on to harder variants (instance segmentation, panoptic segmentation).

How much faster was FCN-8s compared to the prior state-of-the-art (SDS) on PASCAL VOC?

About 286× faster — 175ms vs ~50 seconds, because FCN runs a single forward pass while SDS requires proposals, feature extraction, and refinement About 2× faster They had similar speeds

Chapter 8: What the Network Learns

The FCN paper reveals a beautiful organizing principle: different layers of the network encode fundamentally different types of information, and the skip architecture exploits this structure.

Shallow layers: WHERE

Pool1 and pool2 features respond to edges, textures, and colors. They have small receptive fields (a few pixels) and high spatial resolution. These features know where boundaries are, but have no idea what object they belong to. A cat’s ear edge looks the same as a car’s fender edge at this level.

Mid layers: WHAT + WHERE

Pool3 and pool4 features respond to parts and patterns — eyes, wheels, window grids. They have medium receptive fields and medium resolution. These are the sweet spot: they carry enough semantic information to distinguish object classes, while retaining enough spatial detail to localize boundaries. This is why pool3 and pool4 are the optimal skip sources.

Deep layers: WHAT

Pool5 and the convolutionalized FC layers respond to whole objects and scenes. They have huge receptive fields (hundreds of pixels) but very low spatial resolution. They know what is in the image (a cat sitting on a chair in a room) but can only localize it to a 32×32 pixel block.

The coarse-to-fine principle: The skip architecture is not arbitrary — it mirrors a fundamental property of deep convolutional networks. Hierarchical features naturally organize from spatial precision (shallow) to semantic abstraction (deep). The skip connections simply make this hierarchy available to the prediction, combining the “what” from deep layers with the “where” from shallow layers. Every successful segmentation architecture since FCN has exploited this same principle.

Coarse-to-Fine Feature Hierarchy

Visualize what each layer “sees.” Shallow layers detect edges and textures (WHERE). Deep layers recognize objects (WHAT). The skip architecture fuses both.

Failure modes

The paper shows an illuminating failure case: lifejackets on a boat are misclassified as people. This reveals that even with skip connections, the network occasionally relies too heavily on appearance (bright-colored blob in a boat context) over geometric structure (lifejackets are not human-shaped). Later architectures with CRF post-processing (DeepLab) or attention mechanisms partially address this.

Why are pool3 and pool4 (not pool1 or pool2) used as skip connection sources in FCN?

Pool3/pool4 have the best balance of semantic meaning and spatial precision — pool1/pool2 are too low-level (edges only, no semantic content), while pool5 is too coarse Pool1 and pool2 have too many channels Pool3/pool4 are faster to compute

Chapter 9: Connections

What FCN built on

VGGNet (Simonyan & Zisserman, 2014): The backbone architecture. VGG-16’s uniform 3×3 convolutions and ImageNet-pretrained features provided the foundation for FCN’s dense predictions. The paper also tested AlexNet and GoogLeNet backbones.

OverFeat (Sermanet et al., 2013): Introduced the shift-and-stitch trick for dense prediction with convnets. FCN considered this approach but found learned upsampling more effective.

Transfer learning (Donahue et al., 2014): Demonstrated that features learned on ImageNet transfer to other visual tasks. FCN extended this from classification to pixel-level prediction.

What FCN enabled

U-Net (Ronneberger et al., 2015): Extended FCN’s skip connections into a symmetric encoder-decoder with skip concatenation (not addition). Became the standard for biomedical image segmentation and many other domains.

DeepLab (Chen et al., 2014-2018): Combined FCN-style dense prediction with atrous (dilated) convolutions and CRF post-processing for sharper boundaries. DeepLabv3+ remains widely used.

Feature Pyramid Network (FPN) (Lin et al., 2017): Generalized FCN’s multiscale skip architecture for object detection, creating a top-down pathway with lateral connections at every pyramid level.

PSPNet (Zhao et al., 2017): Added a pyramid pooling module on top of FCN features to capture multi-scale context, winning the ImageNet Scene Parsing Challenge.

Mask R-CNN (He et al., 2017): Combined FPN with an FCN-style mask prediction head for instance segmentation, enabling per-object pixel masks.

FCN’s legacy: Before FCN, semantic segmentation was a pipeline of hand-crafted components. After FCN, it became end-to-end learnable with a single network. Every modern segmentation architecture — U-Net, DeepLab, PSPNet, Mask R-CNN, Segment Anything — is a descendant of FCN. The three core ideas (convolutionalization, learned upsampling, skip connections) remain the DNA of dense prediction.

Cheat sheet

Core idea

Replace FC layers with 1×1 convolutions → any-size input, spatial output

Key innovation

Skip connections fuse deep (what) + shallow (where) for precise segmentation

Variants

FCN-32s (59.4), FCN-16s (62.4), FCN-8s (62.7 mean IU on PASCAL VOC)

Upsampling

Learned transposed convolution, initialized to bilinear interpolation

Impact

Foundation of all modern segmentation: U-Net, DeepLab, FPN, Mask R-CNN

How does U-Net extend FCN’s skip connection idea?

U-Net uses skip concatenation (not addition) in a symmetric encoder-decoder, passing full feature maps from the encoder to the decoder at each level for richer spatial information U-Net removes all skip connections U-Net uses a different backbone

Fully Convolutional Networks