Yu & Koltun — Princeton / Intel Labs, ICLR 2016 • arXiv:1511.07122

Multi-Scale Context
Aggregation
by Dilated Convolutions

How do you see fine detail and big-picture context at the same time? This paper grows the receptive field exponentially — without ever losing a single pixel of resolution.

Prerequisites: Convolution basics + Semantic segmentation intuition (we derive the rest)
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: The Dilemma

You want to label every single pixel in a photograph. This pixel is "cat." That pixel is "grass." The one next to it is "sky." This task is called semantic segmentation, and it requires two things that seem to contradict each other.

First, you need fine-grained detail. The boundary between the cat's ear and the sky is only a few pixels wide. You cannot afford to blur it.

Second, you need big-picture context. To know that a dark region is a cat and not a shadow, you need to see the whole animal — the ears, the body, the tail. That means your network needs a wide receptive field: the region of the input image that influences each output prediction.

The fundamental tension: Classification networks get context by pooling and downsampling — shrinking the image until one neuron sees everything. But segmentation needs pixel-level output. Every time you downsample, you lose resolution. Every time you keep resolution, you limit how far each neuron can see. How do you get both?

By 2015, the standard approach was to take a classification network (like VGG-16), remove the fully-connected layers, and bolt on upsampling to recover the lost resolution. This worked, but it was a compromise. The network threw away spatial information through pooling, then tried desperately to reconstruct it.

Yu and Koltun asked a different question: what if we never lose the resolution in the first place?

Why is semantic segmentation harder than image classification?

Chapter 1: What Is Dense Prediction?

In image classification, the network looks at an entire image and produces one label: "cat," "dog," "airplane." The output is a single vector of class probabilities.

In dense prediction, the network produces a label for every pixel. The output is not a vector but a full-resolution map — the same spatial size as the input. Semantic segmentation is one example. Others include depth estimation (predict the distance of each pixel) and optical flow (predict the motion of each pixel).

Think of the difference like this. Classification is like glancing at a photo and saying "beach." Segmentation is like taking a colored pen and carefully outlining every object: this region is sand, this is water, that is a person, that is an umbrella. You need to get the boundaries exactly right.

The resolution requirement: If your input is 512×512, your output should also be 512×512 (or close to it). Every pixel needs its own prediction. This is what makes dense prediction fundamentally different from classification — and why classification architectures are a poor fit without modification.

The metric for semantic segmentation is mean Intersection over Union (mIoU). For each class, you compute the overlap between your prediction and the ground truth, divided by their union. Then you average across all classes. A blurry prediction that gets the rough region right but misses the boundaries will score poorly.

TaskInputOutputResolution
ClassificationImageSingle label1×1
SegmentationImagePer-pixel labelsH×W
Depth estimationImagePer-pixel depthH×W
Optical flowImage pairPer-pixel motionH×W
What distinguishes dense prediction from image classification?

Chapter 2: The Pooling Trap

Classification networks like VGG-16 use successive pooling layers to build up context. Each 2×2 max-pool halves the spatial dimensions and doubles the receptive field. After five pooling layers, a 224×224 image is reduced to 7×7 feature maps. Each of those 49 neurons effectively "sees" the entire image.

This is perfect for classification. You want to compress spatial information down to a single global decision. But for segmentation, you have a problem.

Input
512 × 512 pixels — full resolution, every detail preserved
↓ pool (2×)
256 × 256
Half resolution. Fine edges start to blur.
↓ pool (2×)
128 × 128
Quarter resolution. Small objects start to vanish.
↓ pool ×3 more
16 × 16
1/32 resolution. Spatial detail is gone. Wide context, but where did the cat's whiskers go?

The dominant approach in 2015 was FCN (Fully Convolutional Networks) by Long et al. They took VGG-16, replaced the fully-connected layers with 1×1 convolutions, and used "deconvolution" (transposed convolution) to upsample the output back to input resolution. FCN-8s fused predictions from three different scales to recover some detail.

Chen et al. (DeepLab) kept the pooling layers but replaced the stride with dilation in later layers, and added a CRF (conditional random field) to sharpen boundaries in post-processing.

Both approaches are workarounds. FCN says: "lose the resolution, then try to reconstruct it." DeepLab says: "lose the resolution, then fix the boundaries with a separate model." Yu and Koltun's insight is that you should never lose the resolution at all. The question is how to get a large receptive field without pooling.
Why do pooling layers hurt dense prediction performance?

Chapter 3: Dilated Convolutions

A standard 3×3 convolution looks at 9 adjacent pixels. The kernel slides across the feature map, and at each position, it multiplies the 3×3 patch by its 9 weights and sums the result. The receptive field — the region of the input that influences each output — is exactly 3×3.

Now imagine spacing out the kernel elements. Instead of touching 9 adjacent pixels, we skip pixels between them. This is a dilated convolution (also called "atrous convolution," from the French a trous — "with holes").

Formally, given a discrete input F, a 3×3 filter k, and a dilation factor l, the dilated convolution is:

(F *l k)(p) = ∑s+lt=p F(s) · k(t)

When l = 1, this is the standard convolution — no gaps. When l = 2, we skip every other pixel. The kernel still has only 9 weights, but it covers a 5×5 area. When l = 4, the 9 weights span a 9×9 area.

Dilation Rate (l)Kernel SizeParametersEffective Coverage
13×393×3
23×395×5
43×399×9
83×3917×17
163×3933×33
The magic: Same number of parameters. Same computational cost. But the filter "sees" a much larger region. We are not constructing a bigger filter and filling it with zeros — the convolution operator itself is modified to sample at wider spacings. No information is lost because the output has the same spatial resolution as the input.

The name "dilated convolution" is deliberate. The paper emphasizes that no "dilated filter" is actually constructed. The operator simply accesses input elements at stride l instead of stride 1. This is more efficient than literally building a sparse (2l+1)×(2l+1) kernel filled with zeros.

A 3×3 convolution with dilation rate 4 covers the same area as what size standard convolution?

Chapter 4: Exponential Receptive Fields

Here is where the idea becomes powerful. Stack multiple dilated convolutions with exponentially increasing dilation rates: 1, 2, 4, 8, 16, ...

Apply a 3×3 convolution at each layer, doubling the dilation each time:

Fi+1 = Fi *2i ki     for i = 0, 1, 2, ...

After layer i+1, each element has a receptive field of size (2i+2 − 1) × (2i+2 − 1). The receptive field grows exponentially, but the number of parameters at each layer is constant — just 9 weights.

Layer 1 (dilation 1)
Receptive field: 3×3
Layer 2 (dilation 2)
Receptive field: 7×7
Layer 3 (dilation 4)
Receptive field: 15×15
Layer 4 (dilation 8)
Receptive field: 31×31
Layer 5 (dilation 16)
Receptive field: 63×63

Compare this to standard convolutions without pooling. A stack of n layers of 3×3 convolutions (dilation 1) gives a receptive field of (2n+1) × (2n+1) — linear growth. You would need 31 layers of standard 3×3 convolutions to get a 63×63 receptive field. The dilated approach does it in 5 layers.

Linear vs exponential: Standard convolutions grow the receptive field by 2 pixels per layer. Dilated convolutions with doubling rates grow it by a factor of 2 per layer. After n layers: standard gives (2n+1)×(2n+1), dilated gives (2n+1−1)×(2n+1−1). This exponential growth is what makes large receptive fields practical without pooling.

Crucially, this exponential expansion happens without any loss of resolution or coverage. Every input pixel contributes to the computation. There are no gaps in the receptive field — each position in the input is covered. The paper proves this by showing the receptive field is a dense square at every layer, not a sparse set of scattered points.

After three layers of 3×3 dilated convolutions with rates [1, 2, 4], what is the receptive field size?

Chapter 5: The Context Module

The paper's central contribution is a plug-in module that can be added to any dense prediction architecture. The context module takes C feature maps as input and produces C feature maps as output — same shape in, same shape out. You can insert it between a front-end predictor and a final classifier without changing anything else.

The basic context module has 8 layers:

Layer12345678
Kernel3×33×33×33×33×33×33×31×1
Dilation112481611
Receptive field3×35×59×917×1733×3365×6567×6767×67
Channels (basic)CCCCCCCC

Each layer applies a 3×3×C convolution (processing all channels) followed by a ReLU activation. The final 1×1 layer produces the output. The dilation rates follow the exponential scheme — 1, 1, 2, 4, 8, 16 — giving a 67×67 receptive field with only about 64C2 parameters total.

Identity initialization: Standard random initialization did not work for this module. The authors found that identity initialization was essential: set all filters so each layer passes the input directly to the next. This means kb(t, a) = 1 when t=0 and a=b, and 0 otherwise. The module starts by doing nothing, then backpropagation gradually learns to aggregate context. This is a form of identity initialization similar to what Le et al. advocated for recurrent networks.

The authors also designed a large context module that widens the channel count in deeper layers: 2C, 4C, 8C, 16C, 32C, 32C, C. More channels in deeper layers capture richer multi-scale features, at the cost of more parameters.

The beauty of this design: it is a rectangular prism of convolutions. No pooling. No subsampling. No upsampling. No skip connections. Just convolutions with increasing dilation, stacked in a box. Every intermediate feature map has the same spatial resolution as the input.

Why was identity initialization critical for the context module?

Chapter 6: The Front End

The context module needs input feature maps. Where do they come from? The paper builds a front-end module by adapting VGG-16 for dense prediction — and simplifying it by removing components that were designed for classification but actually hurt segmentation.

The key modifications to VGG-16:

Remove last 2 pooling layers
Pool-4 and Pool-5 are deleted entirely. No striding either. This keeps the feature maps at 1/8 resolution instead of 1/32.
Dilate subsequent convolutions
Every conv after the removed pools gets dilation 2 for each removed pool. Convs after both removed pools use dilation 4.
Remove intermediate padding
Padding was used in VGG for classification but is neither necessary nor justified for dense prediction.
Output: 64×64 feature maps
21 channels (one per VOC class). Full spatial detail preserved.

This is simpler than both FCN-8s (which kept the pooling and added multi-scale skip connections) and DeepLab (which kept the pooling layers but replaced stride with dilation). The authors found that removing these vestiges of the classification architecture actually increased accuracy.

ModelApproachmIoU (VOC test)
FCN-8sKeep pools, add deconv + skip fusions62.2%
DeepLabKeep pools, replace stride with dilation62.1%
DeepLab-MSc+ multi-scale input62.9%
This paperRemove pools entirely + dilate67.6%
Simplification improves accuracy. The front-end module alone — without the context module or any CRF post-processing — outperformed DeepLab+CRF (66.4%). Sometimes the best thing you can do is remove the parts that do not belong. The pooling layers were vestigial organs from classification: useful for their original purpose, harmful when repurposed for dense prediction.
How did the authors adapt VGG-16 for their front-end module?

Chapter 7: Showcase — Receptive Field Explorer

Now you can see the core idea for yourself. The simulation below shows two approaches side-by-side: standard convolutions with pooling (left) vs dilated convolutions (right).

On the left, standard 3×3 convolutions are stacked. To match the dilated network's receptive field, pooling is required — but watch the resolution shrink. On the right, dilated convolutions grow the receptive field exponentially while keeping every pixel.

Use the Layers slider to add layers and watch the receptive field grow. The highlighted cells show which input pixels influence one output pixel.

Standard + Pooling vs Dilated Convolutions

Left: standard conv with pooling — receptive field grows, but resolution shrinks. Right: dilated conv — receptive field grows exponentially, resolution unchanged. Green cells show the receptive field; blue cells show sampled positions at the current layer.

Layers1
1 layer — both have 3×3 receptive field

The key takeaway: after 4 layers, the dilated network sees a 31×31 region of the input from every single output pixel — while the standard network either (a) only sees a 9×9 region without pooling, or (b) sees the 31×31 region but at 1/8 the resolution with pooling. Dilated convolutions give you the best of both worlds.

Receptive Field Growth: Standard vs Dilated

Watch the receptive field area grow over layers. Standard convolutions grow linearly. Dilated convolutions grow exponentially.

Click to animate
After 5 layers, what is the receptive field of dilated convolutions with rates [1, 2, 4, 8, 16] vs 5 standard 3×3 convolutions?

Chapter 8: The Experiments

The paper evaluates on Pascal VOC 2012, the standard benchmark for semantic segmentation. The experiments are carefully controlled: each component is tested in isolation, and the context module is plugged into three different architectures to show it helps consistently.

Experiment 1: Front-end alone. The simplified front-end (VGG-16 with pools removed) already beats FCN-8s and DeepLab by 5+ points mIoU. Removing vestigial classification components helps.

Experiment 2: Adding the context module. The context module is plugged into three different setups: (1) front-end alone, (2) front-end + dense CRF, (3) front-end + CRF-RNN. In every case, the context module improves accuracy. The large context module helps more than the basic one.

ArchitectureNo Context+ Basic+ Large
Front-end only69.8%70.9%71.7%
+ Dense CRF72.1%72.7%73.3%
+ CRF-RNN71.6%72.5%73.5%
Context + CRF are synergistic. The context module helps with or without subsequent structured prediction. And structured prediction helps with or without the context module. They address different aspects: the context module aggregates multi-scale information, the CRF sharpens boundaries. Combining them gives the best results.

Experiment 3: Test set results. On the VOC-2012 test set, the full system (front-end + large context + CRF-RNN) achieves 75.3% mIoU, outperforming all prior work at the time.

Experiment 4: Additional datasets. The paper also evaluates on KITTI and Cityscapes. On Cityscapes (2048×1024 images), they add two more dilated layers (dilation 32 and 64), creating a 10-layer context module called Dilation10. The model outperformed all prior work in the Cityscapes benchmark evaluation by Cordts et al.

DatasetModelmIoU
VOC 2012 (test)Front-end + Large ctx + CRF-RNN75.3%
KITTIDilation7Outperforms DeepLab-LFOV
Cityscapes (test)Dilation1067.1% (category: 86.5%)
What happens when the context module is combined with CRF-based structured prediction?

Chapter 9: Connections

Dilated convolutions did not just improve one benchmark. They became a fundamental building block for dense prediction across computer vision.

Dilated convolutions and DeepLab. Chen et al. had already used dilation in DeepLab (calling it "atrous convolution"), but only to simplify the adapted classification network. Yu and Koltun went further: they designed a module from scratch specifically for multi-scale context aggregation, with the exponentially increasing dilation rates as the core architectural principle. DeepLabv2 and v3 later adopted multi-scale dilation rates (ASPP — Atrous Spatial Pyramid Pooling) directly inspired by this work.

Dilated convolutions and WaveNet. Van den Oord et al. (2016) used the same idea for audio generation: stacked dilated causal convolutions with rates 1, 2, 4, ..., 512 give WaveNet a receptive field of thousands of audio samples while keeping sample-level resolution. The architecture is strikingly similar to this paper's context module, applied to a 1D signal.

Dilated convolutions and the algorithme a trous. The dilated convolution operator comes from wavelet theory. Holschneider et al. (1987) used it in the algorithme a trous for multi-resolution signal decomposition. The paper carefully distinguishes: the algorithme a trous uses dilated convolutions, but is not equivalent to them. The operator is a general tool; the algorithm is a specific application.

Dilated Convolutions (2016)
Exponential receptive fields without resolution loss for dense prediction
↓ adopted by
DeepLabv2/v3, PSPNet
Multi-scale dilated pooling (ASPP) becomes the standard for segmentation
↓ same idea for audio
WaveNet (2016)
Dilated causal convolutions for autoregressive audio generation
↓ enabling
Modern dense prediction
Panoptic segmentation, depth estimation, point cloud processing — all use dilated convolutions
The lasting insight. You do not need pooling to see the big picture. Dilation gives you multi-scale context for free — no resolution loss, no extra parameters, no upsampling to undo the damage. This principle — that you can expand the receptive field without contracting the representation — has influenced every major segmentation architecture since 2016.

Paper details. "Multi-Scale Context Aggregation by Dilated Convolutions," Fisher Yu, Vladlen Koltun. ICLR 2016. arXiv:1511.07122. First submitted November 2015.

← Back to Veanors Hub

Which 2016 audio model used the same exponentially increasing dilation rates for 1D signals?