The first architecture to adapt classification CNNs for dense per-pixel prediction — replacing fully connected layers with 1×1 convolutions, fusing coarse-to-fine features via skip connections, and learning end-to-end upsampling for semantic segmentation.
An image classification network looks at a photo and says “cat.” Useful, but crude. It tells you what is in the image, but nothing about where.
What if you need to know exactly which pixels belong to the cat? Which pixels are background? Which pixels are a dog standing next to the cat? This is semantic segmentation — assigning a class label to every single pixel in the image.
Before 2014, approaches to semantic segmentation were complex Rube Goldberg machines: generate region proposals, extract hand-crafted features, classify each proposal with an SVM, then stitch everything together with CRFs and superpixels. Slow. Fragile. Not end-to-end learnable.
Classification CNNs like AlexNet and VGGNet were dominating ImageNet — but they had a fatal structural limitation: fully connected layers. These layers flatten the spatial feature maps into a 1D vector, destroying all location information. The network outputs a single label, not a spatial map.
Classification gives one label for the whole image. Segmentation labels every pixel. Click to toggle between the two views.
The paper's central insight is stunningly elegant: a fully connected layer is just a convolution with a kernel that covers the entire input.
Think about it. A fully connected layer takes, say, a 7×7×512 feature map, flattens it to a 25,088-element vector, and multiplies by a weight matrix. But this is mathematically identical to convolving the 7×7×512 input with a 7×7×512 filter. The output: a single number per filter. Stack 4096 such filters and you get the equivalent of fc6 in VGGNet.
Here is the trick: once you view fc6 as a 7×7 convolution, there is nothing stopping you from feeding in a larger input. Give it a 14×14×512 feature map instead. The 7×7 filter slides over the input and produces a 2×2 output. Give it 21×21? You get 3×3. The network now outputs a spatial map of predictions, not a single label.
The final layer becomes a 1×1 convolution with C output channels (one per class). Each spatial location in the output is a per-pixel class prediction. The entire network — from raw pixels to per-pixel labels — is differentiable and trainable end-to-end.
There is one catch: the repeated pooling layers in VGGNet reduce spatial resolution by a factor of 32. A 224×224 input produces a 7×7 output map. Each output cell covers a 32×32 patch of the input. That is very coarse — we need to get back to full resolution. The next chapters solve this.
Let us trace the full transformation from a classification CNN to a dense prediction FCN, step by step.
VGG-16 takes a 224×224×3 image. It passes through 13 convolutional layers with 5 max-pooling layers (each halving spatial dimensions: 224 → 112 → 56 → 28 → 14 → 7). Then three fully connected layers: fc6 (4096), fc7 (4096), fc8 (1000 classes). Output: a single 1000-element probability vector.
fc6 (25088→4096) becomes a 7×7 conv with 4096 filters. fc7 (4096→4096) becomes a 1×1 conv with 4096 filters. fc8 (4096→1000) becomes a 1×1 conv with C filters (C = number of segmentation classes, e.g. 21 for PASCAL VOC).
Now give the network a 500×500 image. The convolutional backbone produces a 16×16×512 feature map at pool5. The convolutionalized fc6 slides its 7×7 kernel over this, producing a 10×10 map. fc7 and fc8 are 1×1 convolutions, preserving spatial size. Result: a 10×10×C score map, where each location holds class scores for a 32×32 patch of the input.
The 10×10 output is much smaller than the 500×500 input. Each output pixel covers a 32×32 region. The prediction is coarse — it captures the right semantics (it knows where the cat is, roughly) but lacks fine detail (object boundaries are blocky). This is the FCN-32s baseline — “32s” because the output stride is 32.
See how replacing FC layers with convolutions turns a fixed-size classifier into a spatially flexible dense predictor. Drag the slider to change input size.
The convolutionalized network produces a coarse output — a 10×10 map for a 500×500 image. We need to upsample this back to full resolution. The paper uses transposed convolution (also called “deconvolution”, though this is a misnomer).
Normal convolution with stride 2 takes a 4×4 input and produces a 2×2 output (downsampling). Transposed convolution with stride 2 takes a 2×2 input and produces a 4×4 output (upsampling). It is literally the reverse operation: it reverses the forward and backward passes of convolution.
More precisely, imagine inserting zeros between the input pixels (spacing them out), then applying a regular convolution. The result is a larger output. The filter weights determine how the upsampling interpolates — and critically, these weights are learnable.
The paper initializes the transposed convolution filters to perform bilinear interpolation — a simple weighted average of neighboring pixels. But because the filters are part of the network, they can be fine-tuned by backpropagation. The network learns to upsample in a way that is optimized for the segmentation task, not just generic interpolation.
A 2×2 coarse prediction is upsampled to 4×4 via transposed convolution. Each input value is multiplied by the kernel and placed in the output with stride spacing. Overlapping regions are summed.
In FCN-32s, the entire upsampling from the coarse output to full resolution is done in one giant 32× transposed convolution. This produces a full-resolution prediction, but the boundaries are still blobby because all the fine spatial detail was lost in the 32× downsampling. This motivates the key innovation of the paper: skip connections.
This is the most important architectural contribution of the paper. The insight: deep layers know what, shallow layers know where. Combine them.
FCN-32s upsamples directly from the final prediction layer (stride 32). By the time the signal reaches this layer, it has passed through five pooling operations. The network knows there is a cat in the upper-left of the image, but the exact boundaries are lost. The 32× upsampling produces a blobby segmentation.
Pool4 has stride 16 — its feature maps are 2× finer than pool5. The paper adds a 1×1 convolution on pool4 to produce class predictions at stride 16. Then it fuses these predictions with the pool5 predictions: upsample the pool5 predictions by 2× (transposed convolution), add them element-wise to the pool4 predictions, then upsample the fused result by 16× to full resolution.
Result: a 3.0 point improvement in mean IU (59.4 → 62.4). The boundaries become noticeably sharper.
Pool3 has stride 8 — even finer. The paper adds another skip: fuse pool3 predictions with the already-fused pool4+pool5 predictions. Upsample the fused result by 8×.
Result: another 0.3 points (62.4 → 62.7). Small but visible improvement in fine details. Below pool3, returns diminish — the paper stopped here.
Toggle between the three FCN variants. See how adding skip connections from earlier pooling layers progressively refines the segmentation output. The coarse prediction is shown in green, skip predictions in orange/blue, and fused output in teal.
The paper uses summation to fuse skip connections, not concatenation (which U-Net would later use). Summation is parameter-free and keeps the channel count constant. The skip predictions are zero-initialized so the network starts as FCN-32s and gradually learns to incorporate finer information. This makes training stable — each refinement stage starts from a working model.
The full FCN-8s architecture is an adapted VGG-16 backbone with three key modifications: convolutionalized FC layers, learned upsampling, and skip connections from pool3 and pool4.
The paper tried three backbones: AlexNet (39.8 mean IU), GoogLeNet (42.5), and VGG-16 (56.0). VGG-16 won decisively despite being the slowest (210ms vs 50ms for AlexNet). Its uniform 3×3 filter architecture and depth created the best features for dense prediction.
For a 500×500 input with 21 PASCAL VOC classes:
The convolutionalized VGG-16 has 134M parameters — most are in the converted fc6 layer (7×7×512×4096 ≈ 102M). The 1×1 convolutions for skip predictions add negligible parameters. The transposed convolution filters for upsampling are initialized to bilinear and fine-tuned.
Training FCN requires careful attention to loss functions, optimization, and the fine-tuning strategy.
The loss is a standard multinomial logistic loss (cross-entropy), computed independently at every pixel:
Where yij is the ground-truth class at pixel (i,j), p is the softmax probability for that class, and N is the total number of valid pixels. Pixels marked as ambiguous or difficult in the ground truth are masked out and ignored.
SGD with momentum 0.9, weight decay 5×10−4, minibatch size 20 images. Learning rate: 10−4 for VGG-16 backbone. Biases get 2× the learning rate. The class scoring layer is zero-initialized (not random — the paper found no benefit from random init).
Unlike prior work that randomly sampled patches, FCN trains on whole images. Each image is effectively a batch of all overlapping patches. The paper showed this is just as effective as patch sampling, but faster because convolutional computation is shared. No data augmentation was used (random mirroring and jittering yielded no improvement).
PASCAL VOC labels are mildly unbalanced (~3/4 background). The paper found class balancing unnecessary — the per-pixel loss naturally handles the imbalance for this dataset. This is a notable simplification over prior methods that required careful class-frequency weighting.
FCN was evaluated on three benchmarks: PASCAL VOC (the main benchmark), NYUDv2 (RGB-D indoor scenes), and SIFT Flow (outdoor scenes with geometric labels).
FCN-8s achieved 62.2% mean IU on the test set — a 20% relative improvement over the previous state-of-the-art (SDS at 51.6%). Inference time: ~175ms per image, compared to ~50 seconds for SDS. That is a 286× speedup.
Mean IU scores comparing FCN variants and prior methods. FCN-8s achieves a 20% relative improvement over the previous state-of-the-art.
FCN-16s with RGB-HHA late fusion achieved 34.0 mean IU, improving over Gupta et al.’s 28.6. The “HHA” encoding converts depth into horizontal disparity, height above ground, and surface normal angle — a richer representation than raw depth.
FCN-16s with a two-headed architecture simultaneously predicted 33 semantic classes and 3 geometric classes, achieving 85.2% pixel accuracy on semantics and 94.3% on geometry — state-of-the-art on both tasks with essentially the speed of a single model.
The FCN paper reveals a beautiful organizing principle: different layers of the network encode fundamentally different types of information, and the skip architecture exploits this structure.
Pool1 and pool2 features respond to edges, textures, and colors. They have small receptive fields (a few pixels) and high spatial resolution. These features know where boundaries are, but have no idea what object they belong to. A cat’s ear edge looks the same as a car’s fender edge at this level.
Pool3 and pool4 features respond to parts and patterns — eyes, wheels, window grids. They have medium receptive fields and medium resolution. These are the sweet spot: they carry enough semantic information to distinguish object classes, while retaining enough spatial detail to localize boundaries. This is why pool3 and pool4 are the optimal skip sources.
Pool5 and the convolutionalized FC layers respond to whole objects and scenes. They have huge receptive fields (hundreds of pixels) but very low spatial resolution. They know what is in the image (a cat sitting on a chair in a room) but can only localize it to a 32×32 pixel block.
Visualize what each layer “sees.” Shallow layers detect edges and textures (WHERE). Deep layers recognize objects (WHAT). The skip architecture fuses both.
The paper shows an illuminating failure case: lifejackets on a boat are misclassified as people. This reveals that even with skip connections, the network occasionally relies too heavily on appearance (bright-colored blob in a boat context) over geometric structure (lifejackets are not human-shaped). Later architectures with CRF post-processing (DeepLab) or attention mechanisms partially address this.
VGGNet (Simonyan & Zisserman, 2014): The backbone architecture. VGG-16’s uniform 3×3 convolutions and ImageNet-pretrained features provided the foundation for FCN’s dense predictions. The paper also tested AlexNet and GoogLeNet backbones.
OverFeat (Sermanet et al., 2013): Introduced the shift-and-stitch trick for dense prediction with convnets. FCN considered this approach but found learned upsampling more effective.
Transfer learning (Donahue et al., 2014): Demonstrated that features learned on ImageNet transfer to other visual tasks. FCN extended this from classification to pixel-level prediction.
U-Net (Ronneberger et al., 2015): Extended FCN’s skip connections into a symmetric encoder-decoder with skip concatenation (not addition). Became the standard for biomedical image segmentation and many other domains.
DeepLab (Chen et al., 2014-2018): Combined FCN-style dense prediction with atrous (dilated) convolutions and CRF post-processing for sharper boundaries. DeepLabv3+ remains widely used.
Feature Pyramid Network (FPN) (Lin et al., 2017): Generalized FCN’s multiscale skip architecture for object detection, creating a top-down pathway with lateral connections at every pyramid level.
PSPNet (Zhao et al., 2017): Added a pyramid pooling module on top of FCN features to capture multi-scale context, winning the ImageNet Scene Parsing Challenge.
Mask R-CNN (He et al., 2017): Combined FPN with an FCN-style mask prediction head for instance segmentation, enabling per-object pixel masks.