How do you see fine detail and big-picture context at the same time? This paper grows the receptive field exponentially — without ever losing a single pixel of resolution.
You want to label every single pixel in a photograph. This pixel is "cat." That pixel is "grass." The one next to it is "sky." This task is called semantic segmentation, and it requires two things that seem to contradict each other.
First, you need fine-grained detail. The boundary between the cat's ear and the sky is only a few pixels wide. You cannot afford to blur it.
Second, you need big-picture context. To know that a dark region is a cat and not a shadow, you need to see the whole animal — the ears, the body, the tail. That means your network needs a wide receptive field: the region of the input image that influences each output prediction.
By 2015, the standard approach was to take a classification network (like VGG-16), remove the fully-connected layers, and bolt on upsampling to recover the lost resolution. This worked, but it was a compromise. The network threw away spatial information through pooling, then tried desperately to reconstruct it.
Yu and Koltun asked a different question: what if we never lose the resolution in the first place?
In image classification, the network looks at an entire image and produces one label: "cat," "dog," "airplane." The output is a single vector of class probabilities.
In dense prediction, the network produces a label for every pixel. The output is not a vector but a full-resolution map — the same spatial size as the input. Semantic segmentation is one example. Others include depth estimation (predict the distance of each pixel) and optical flow (predict the motion of each pixel).
Think of the difference like this. Classification is like glancing at a photo and saying "beach." Segmentation is like taking a colored pen and carefully outlining every object: this region is sand, this is water, that is a person, that is an umbrella. You need to get the boundaries exactly right.
The metric for semantic segmentation is mean Intersection over Union (mIoU). For each class, you compute the overlap between your prediction and the ground truth, divided by their union. Then you average across all classes. A blurry prediction that gets the rough region right but misses the boundaries will score poorly.
| Task | Input | Output | Resolution |
|---|---|---|---|
| Classification | Image | Single label | 1×1 |
| Segmentation | Image | Per-pixel labels | H×W |
| Depth estimation | Image | Per-pixel depth | H×W |
| Optical flow | Image pair | Per-pixel motion | H×W |
Classification networks like VGG-16 use successive pooling layers to build up context. Each 2×2 max-pool halves the spatial dimensions and doubles the receptive field. After five pooling layers, a 224×224 image is reduced to 7×7 feature maps. Each of those 49 neurons effectively "sees" the entire image.
This is perfect for classification. You want to compress spatial information down to a single global decision. But for segmentation, you have a problem.
The dominant approach in 2015 was FCN (Fully Convolutional Networks) by Long et al. They took VGG-16, replaced the fully-connected layers with 1×1 convolutions, and used "deconvolution" (transposed convolution) to upsample the output back to input resolution. FCN-8s fused predictions from three different scales to recover some detail.
Chen et al. (DeepLab) kept the pooling layers but replaced the stride with dilation in later layers, and added a CRF (conditional random field) to sharpen boundaries in post-processing.
A standard 3×3 convolution looks at 9 adjacent pixels. The kernel slides across the feature map, and at each position, it multiplies the 3×3 patch by its 9 weights and sums the result. The receptive field — the region of the input that influences each output — is exactly 3×3.
Now imagine spacing out the kernel elements. Instead of touching 9 adjacent pixels, we skip pixels between them. This is a dilated convolution (also called "atrous convolution," from the French a trous — "with holes").
Formally, given a discrete input F, a 3×3 filter k, and a dilation factor l, the dilated convolution is:
When l = 1, this is the standard convolution — no gaps. When l = 2, we skip every other pixel. The kernel still has only 9 weights, but it covers a 5×5 area. When l = 4, the 9 weights span a 9×9 area.
| Dilation Rate (l) | Kernel Size | Parameters | Effective Coverage |
|---|---|---|---|
| 1 | 3×3 | 9 | 3×3 |
| 2 | 3×3 | 9 | 5×5 |
| 4 | 3×3 | 9 | 9×9 |
| 8 | 3×3 | 9 | 17×17 |
| 16 | 3×3 | 9 | 33×33 |
The name "dilated convolution" is deliberate. The paper emphasizes that no "dilated filter" is actually constructed. The operator simply accesses input elements at stride l instead of stride 1. This is more efficient than literally building a sparse (2l+1)×(2l+1) kernel filled with zeros.
Here is where the idea becomes powerful. Stack multiple dilated convolutions with exponentially increasing dilation rates: 1, 2, 4, 8, 16, ...
Apply a 3×3 convolution at each layer, doubling the dilation each time:
After layer i+1, each element has a receptive field of size (2i+2 − 1) × (2i+2 − 1). The receptive field grows exponentially, but the number of parameters at each layer is constant — just 9 weights.
Compare this to standard convolutions without pooling. A stack of n layers of 3×3 convolutions (dilation 1) gives a receptive field of (2n+1) × (2n+1) — linear growth. You would need 31 layers of standard 3×3 convolutions to get a 63×63 receptive field. The dilated approach does it in 5 layers.
Crucially, this exponential expansion happens without any loss of resolution or coverage. Every input pixel contributes to the computation. There are no gaps in the receptive field — each position in the input is covered. The paper proves this by showing the receptive field is a dense square at every layer, not a sparse set of scattered points.
The paper's central contribution is a plug-in module that can be added to any dense prediction architecture. The context module takes C feature maps as input and produces C feature maps as output — same shape in, same shape out. You can insert it between a front-end predictor and a final classifier without changing anything else.
The basic context module has 8 layers:
| Layer | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| Kernel | 3×3 | 3×3 | 3×3 | 3×3 | 3×3 | 3×3 | 3×3 | 1×1 |
| Dilation | 1 | 1 | 2 | 4 | 8 | 16 | 1 | 1 |
| Receptive field | 3×3 | 5×5 | 9×9 | 17×17 | 33×33 | 65×65 | 67×67 | 67×67 |
| Channels (basic) | C | C | C | C | C | C | C | C |
Each layer applies a 3×3×C convolution (processing all channels) followed by a ReLU activation. The final 1×1 layer produces the output. The dilation rates follow the exponential scheme — 1, 1, 2, 4, 8, 16 — giving a 67×67 receptive field with only about 64C2 parameters total.
The authors also designed a large context module that widens the channel count in deeper layers: 2C, 4C, 8C, 16C, 32C, 32C, C. More channels in deeper layers capture richer multi-scale features, at the cost of more parameters.
The beauty of this design: it is a rectangular prism of convolutions. No pooling. No subsampling. No upsampling. No skip connections. Just convolutions with increasing dilation, stacked in a box. Every intermediate feature map has the same spatial resolution as the input.
The context module needs input feature maps. Where do they come from? The paper builds a front-end module by adapting VGG-16 for dense prediction — and simplifying it by removing components that were designed for classification but actually hurt segmentation.
The key modifications to VGG-16:
This is simpler than both FCN-8s (which kept the pooling and added multi-scale skip connections) and DeepLab (which kept the pooling layers but replaced stride with dilation). The authors found that removing these vestiges of the classification architecture actually increased accuracy.
| Model | Approach | mIoU (VOC test) |
|---|---|---|
| FCN-8s | Keep pools, add deconv + skip fusions | 62.2% |
| DeepLab | Keep pools, replace stride with dilation | 62.1% |
| DeepLab-MSc | + multi-scale input | 62.9% |
| This paper | Remove pools entirely + dilate | 67.6% |
Now you can see the core idea for yourself. The simulation below shows two approaches side-by-side: standard convolutions with pooling (left) vs dilated convolutions (right).
On the left, standard 3×3 convolutions are stacked. To match the dilated network's receptive field, pooling is required — but watch the resolution shrink. On the right, dilated convolutions grow the receptive field exponentially while keeping every pixel.
Use the Layers slider to add layers and watch the receptive field grow. The highlighted cells show which input pixels influence one output pixel.
Left: standard conv with pooling — receptive field grows, but resolution shrinks. Right: dilated conv — receptive field grows exponentially, resolution unchanged. Green cells show the receptive field; blue cells show sampled positions at the current layer.
The key takeaway: after 4 layers, the dilated network sees a 31×31 region of the input from every single output pixel — while the standard network either (a) only sees a 9×9 region without pooling, or (b) sees the 31×31 region but at 1/8 the resolution with pooling. Dilated convolutions give you the best of both worlds.
Watch the receptive field area grow over layers. Standard convolutions grow linearly. Dilated convolutions grow exponentially.
The paper evaluates on Pascal VOC 2012, the standard benchmark for semantic segmentation. The experiments are carefully controlled: each component is tested in isolation, and the context module is plugged into three different architectures to show it helps consistently.
Experiment 1: Front-end alone. The simplified front-end (VGG-16 with pools removed) already beats FCN-8s and DeepLab by 5+ points mIoU. Removing vestigial classification components helps.
Experiment 2: Adding the context module. The context module is plugged into three different setups: (1) front-end alone, (2) front-end + dense CRF, (3) front-end + CRF-RNN. In every case, the context module improves accuracy. The large context module helps more than the basic one.
| Architecture | No Context | + Basic | + Large |
|---|---|---|---|
| Front-end only | 69.8% | 70.9% | 71.7% |
| + Dense CRF | 72.1% | 72.7% | 73.3% |
| + CRF-RNN | 71.6% | 72.5% | 73.5% |
Experiment 3: Test set results. On the VOC-2012 test set, the full system (front-end + large context + CRF-RNN) achieves 75.3% mIoU, outperforming all prior work at the time.
Experiment 4: Additional datasets. The paper also evaluates on KITTI and Cityscapes. On Cityscapes (2048×1024 images), they add two more dilated layers (dilation 32 and 64), creating a 10-layer context module called Dilation10. The model outperformed all prior work in the Cityscapes benchmark evaluation by Cordts et al.
| Dataset | Model | mIoU |
|---|---|---|
| VOC 2012 (test) | Front-end + Large ctx + CRF-RNN | 75.3% |
| KITTI | Dilation7 | Outperforms DeepLab-LFOV |
| Cityscapes (test) | Dilation10 | 67.1% (category: 86.5%) |
Dilated convolutions did not just improve one benchmark. They became a fundamental building block for dense prediction across computer vision.
Dilated convolutions and DeepLab. Chen et al. had already used dilation in DeepLab (calling it "atrous convolution"), but only to simplify the adapted classification network. Yu and Koltun went further: they designed a module from scratch specifically for multi-scale context aggregation, with the exponentially increasing dilation rates as the core architectural principle. DeepLabv2 and v3 later adopted multi-scale dilation rates (ASPP — Atrous Spatial Pyramid Pooling) directly inspired by this work.
Dilated convolutions and WaveNet. Van den Oord et al. (2016) used the same idea for audio generation: stacked dilated causal convolutions with rates 1, 2, 4, ..., 512 give WaveNet a receptive field of thousands of audio samples while keeping sample-level resolution. The architecture is strikingly similar to this paper's context module, applied to a 1D signal.
Dilated convolutions and the algorithme a trous. The dilated convolution operator comes from wavelet theory. Holschneider et al. (1987) used it in the algorithme a trous for multi-resolution signal decomposition. The paper carefully distinguishes: the algorithme a trous uses dilated convolutions, but is not equivalent to them. The operator is a general tool; the algorithm is a specific application.
Paper details. "Multi-Scale Context Aggregation by Dilated Convolutions," Fisher Yu, Vladlen Koltun. ICLR 2016. arXiv:1511.07122. First submitted November 2015.