VGGNet — Veanors

Chapter 0: The Problem

It's 2012. AlexNet has just shattered ImageNet records with a deep convolutional neural network — 5 convolutional layers followed by 3 fully connected layers. The computer vision world is electrified. But a question lingers: was 5 conv layers the right number?

AlexNet used large filters — 11×11 in the first layer, 5×5 in the second — to capture spatial patterns. The architecture worked, but the design choices felt somewhat arbitrary. Why those filter sizes? Why that depth? Nobody had systematically studied the effect of depth on performance.

ZFNet (Zeiler & Fergus, 2013) improved on AlexNet by tweaking filter sizes and strides, but stayed at roughly the same depth. The question remained: how deep should a CNN be?

The core question: If we keep everything else constant and just add more layers, does performance keep improving? And if so, what's the simplest way to go deeper without the architecture becoming unwieldy?

Simonyan and Zisserman at Oxford's Visual Geometry Group (VGG) set out to answer this with a beautifully systematic experiment. They would fix the filter size to the smallest meaningful value — 3×3 — and simply add more layers. The result was VGGNet, and it proved that depth is the single most important factor in CNN performance.

What was the main architectural limitation of AlexNet that VGGNet aimed to address?

The depth was shallow (5 conv layers) and nobody had systematically studied whether going deeper would improve performance AlexNet used too many parameters AlexNet couldn't process color images

Chapter 1: The Key Insight

VGGNet's insight is elegantly simple: replace large convolutional filters with stacks of small 3×3 filters.

Consider a single 5×5 convolution. It looks at a 5×5 patch of the input and produces one output value. Now consider two 3×3 convolutions stacked on top of each other. The first 3×3 filter looks at a 3×3 patch. The second 3×3 filter looks at a 3×3 patch of the first layer's output — but each of those pixels already "saw" a 3×3 region. So the second layer effectively sees a 5×5 region of the original input.

Same receptive field. But two crucial differences:

More non-linearities: Two ReLU activations instead of one, making the decision function more discriminative
Fewer parameters: 2 × (3×3) = 18 weights versus 1 × (5×5) = 25 weights per channel pair

The simplicity principle: Don't design complex filter configurations. Just use the smallest meaningful filter (3×3 — the minimum size that captures left/right, up/down, center), stack many of them, and let depth do the work. Simplicity scales; complexity doesn't.

This extends to 7×7 filters too: three stacked 3×3 layers have the same receptive field as one 7×7 layer, but with three non-linearities and 27C² parameters instead of 49C² — a 45% reduction.

How many stacked 3×3 conv layers are needed to match the receptive field of a single 7×7 conv layer?

Two layers Four layers Three layers

Chapter 2: Why 3×3?

Let's work through the receptive field analysis carefully. A receptive field is the region of the input image that influences a single output pixel. For a convolution with stride 1 and filter size k, each output pixel "sees" k pixels of the input along each spatial dimension.

The receptive field formula

For N stacked convolutions with filter size k and stride 1:

Receptive field = N × (k − 1) + 1

For k = 3:

N = 1: RF = 1×2 + 1 = 3×3
N = 2: RF = 2×2 + 1 = 5×5 (matches one 5×5 filter)
N = 3: RF = 3×2 + 1 = 7×7 (matches one 7×7 filter)

The parameter advantage

Assume both the input and output have C channels. The number of parameters per configuration:

One 5×5 layer: 5²C² = 25C²
Two 3×3 layers: 2 × 3²C² = 18C² (28% fewer)

One 7×7 layer: 7²C² = 49C²
Three 3×3 layers: 3 × 3²C² = 27C² (45% fewer)

Double benefit: Stacking small filters gives you more non-linearity (more ReLU activations between layers) with fewer parameters. It's strictly better — more expressive and more parameter-efficient.

The regularization effect

There's a subtle third benefit. Decomposing a 7×7 filter into three 3×3 filters imposes a structural constraint — the effective 7×7 filter must be expressible as a composition of three smaller filters. This acts as implicit regularization, preventing overfitting by restricting the space of learnable filters.

Three stacked 3×3 conv layers have what percentage fewer parameters than a single 7×7 conv layer (same channels)?

28% fewer 45% fewer (27C² vs 49C²) 60% fewer

Chapter 3: Architecture Configurations

VGGNet's experimental methodology was beautifully systematic. Simonyan and Zisserman defined six configurations (A through E), differing only in depth. Everything else — filter size (3×3), pooling (2×2 max), FC layers (three of them) — stayed the same.

The configuration family

Config A (VGG-11): 8 conv layers + 3 FC = 11 weight layers. The baseline.
Config A-LRN: Same as A but with Local Response Normalization after the first conv layer. Tested whether LRN (used in AlexNet) helps. Spoiler: it doesn't.
Config B (VGG-13): 10 conv layers + 3 FC = 13 weight layers. Two extra conv layers in the first two blocks.
Config C: 13 conv layers + 3 FC = 16 weight layers. Added 1×1 conv layers (borrowed from Network in Network) to add non-linearity without changing receptive fields.
Config D (VGG-16): 13 conv layers + 3 FC = 16 weight layers. Replaced the 1×1 layers with 3×3 — strictly better because it adds spatial context and non-linearity.
Config E (VGG-19): 16 conv layers + 3 FC = 19 weight layers. The deepest.

Key finding: LRN (used in AlexNet) provided no benefit — Config A-LRN performed identically to Config A. The community could safely drop it. Meanwhile, Config D outperformed Config C despite having the same depth, proving that 3×3 filters (with spatial context) beat 1×1 filters (without it).

Why does Config D (VGG-16 with 3×3 filters) outperform Config C (same depth, but with 1×1 filters)?

3×3 filters capture spatial context (left/right/up/down) while 1×1 filters only do channel mixing — both add non-linearity but 3×3 also adds spatial reasoning 1×1 filters have too many parameters 3×3 filters are faster to compute

Chapter 4: The VGG-16 Architecture

VGG-16 (Config D) became the most famous variant. Its architecture is a study in elegant repetition: five blocks of convolutions, each followed by max-pooling, then three fully connected layers. The channel count doubles after each pool: 64 → 128 → 256 → 512 → 512.

The full VGG-16 pipeline: Input 224×224×3 → [conv3-64]×2 → pool → [conv3-128]×2 → pool → [conv3-256]×3 → pool → [conv3-512]×3 → pool → [conv3-512]×3 → pool → FC-4096 → FC-4096 → FC-1000 → softmax. That's 13 conv layers and 3 FC layers = 16 weight layers total, with 138 million parameters.

Block-by-block breakdown

Each convolutional block follows the same pattern: convolution (3×3, stride 1, pad 1) → ReLU → convolution → ReLU → [optional third conv+ReLU] → 2×2 max-pool (stride 2).

Block 1: 224×224×3 → 224×224×64 → 112×112×64 after pool
Block 2: 112×112×64 → 112×112×128 → 56×56×128 after pool
Block 3: 56×56×128 → 56×56×256 → 28×28×256 after pool
Block 4: 28×28×256 → 28×28×512 → 14×14×512 after pool
Block 5: 14×14×512 → 14×14×512 → 7×7×512 after pool

Then the classifier head: 7×7×512 = 25,088 values flatten into FC-4096 → FC-4096 → FC-1000.

Where are the parameters? The conv layers contain only ~15M parameters. The FC layers contain ~124M. The first FC layer alone (25,088 → 4,096) has 102M parameters — 74% of the entire network. This massive imbalance became VGG's most criticized feature and directly motivated later architectures to replace FC layers with global average pooling.

What fraction of VGG-16's 138M parameters are in the fully connected layers?

About 50% About 75% About 90% (~124M of 138M)

Chapter 5: Training Details

Training VGGNet required careful engineering. With 16–19 layers, naive random initialization would cause gradients to vanish or explode. Simonyan and Zisserman used a clever bootstrapping strategy.

Weight initialization

They first trained the shallowest network (Config A, 11 layers) with random initialization — shallow enough that random init still works. Then for deeper networks (B through E), they initialized the first four conv layers and last three FC layers using the trained weights from Config A. Intermediate layers were initialized randomly.

Post-publication update: After submission, they discovered that Xavier initialization (Glorot & Bengio, 2010) could replace this bootstrapping entirely. But the sequential training approach reveals an important insight: depth makes optimization harder, and this challenge motivated later innovations like BatchNorm and residual connections.

Multi-scale training

The training images were rescaled so the shorter side had length S, then a random 224×224 crop was extracted. Two strategies:

Single-scale: Fixed S = 256 or S = 384. Every crop comes from the same magnification level.
Multi-scale (scale jittering): S randomly sampled from [256, 512] for each image. This forces the network to learn to recognize objects at different scales — a natural data augmentation.

Other training details

Optimizer: SGD with momentum 0.9, batch size 256
Weight decay: L2 regularization with multiplier 5×10⁻⁴
Dropout: 0.5 on the first two FC layers
Learning rate: Started at 10⁻², divided by 10 when validation error plateaued (3 times total)
Duration: 74 epochs, 2–3 weeks on 4 NVIDIA Titan Black GPUs
Augmentation: Random horizontal flips + random RGB color shift

Dense evaluation at test time

At test time, the FC layers were converted to convolutional layers (FC-4096 becomes a 7×7 conv layer), making the network fully convolutional. This let them apply the network to images of any size, producing a spatial map of class scores that was then average-pooled. Combined with multi-scale testing (multiple Q values) and horizontal flip averaging, this dense evaluation consistently outperformed multi-crop testing.

Why is multi-scale training (scale jittering) beneficial?

It forces the network to recognize objects at different sizes, acting as natural data augmentation that improves generalization It makes training faster It reduces the number of required parameters

Chapter 6: Results

The ILSVRC 2014 competition was fierce. GoogLeNet took first place with 6.7% top-5 error using a complex 22-layer Inception architecture. VGGNet came in second with 7.3% — remarkable given its far simpler design.

The depth ladder

The most important result wasn't the competition ranking but the systematic depth analysis:

VGG-11 (A): 29.6% top-1 / 10.4% top-5 error
VGG-13 (B): 28.7% top-1 / 9.9% top-5 error
VGG-16 (D): 25.6% top-1 / 8.1% top-5 error (with scale jittering)
VGG-19 (E): 25.5% top-1 / 8.0% top-5 error (with scale jittering)

Every increase in depth brought consistent improvement. From 11 to 19 layers, top-5 error dropped from 10.4% to 8.0% — a 23% relative improvement from depth alone.

Diminishing returns: VGG-16 and VGG-19 performed nearly identically (8.1% vs 8.0% top-5). The error rate "saturated" at 19 layers. Going deeper with this simple architecture didn't help further — a limitation that ResNet would later break through with skip connections.

Competition results (with ensembling)

By ensembling multiple VGGNet models and combining dense evaluation with multi-crop testing, the team achieved:

Classification: 7.3% top-5 error (2nd place, behind GoogLeNet's 6.7%)
Localization: 25.3% error (1st place)

Transfer learning performance

VGG features generalized remarkably well. On VOC-2007 and VOC-2012 (object detection and classification), Caltech-101, and Caltech-256, VGG features outperformed all prior methods — often by a large margin, even with simple linear classifiers on top of the frozen features.

What was VGGNet's key finding about depth and performance?

Every increase in depth (11 to 19 layers) consistently reduced error, with a 23% relative improvement in top-5 error, though returns diminished beyond 19 layers Deeper networks are always worse due to vanishing gradients Only the number of FC layers matters

Chapter 7: VGG as Feature Extractor

VGGNet's greatest legacy isn't its ImageNet accuracy — it's what happened after the competition. For years, pretrained VGG features became the default backbone for nearly every computer vision task.

Why VGG features work so well

The uniform architecture creates a natural feature hierarchy:

Early layers (conv1, conv2): Edges, corners, color gradients — low-level texture
Middle layers (conv3, conv4): Textures, patterns, object parts
Late layers (conv5): Object-level representations, semantic meaning

Applications that relied on VGG

Object detection (R-CNN, Fast R-CNN): VGG-16 replaced AlexNet as the feature backbone, immediately boosting mAP by several points
Neural style transfer (Gatys et al., 2015): Used VGG's layer activations to define "content loss" and "style loss" — the algorithm that launched a thousand art apps
Perceptual loss (Johnson et al., 2016): Instead of pixel-level L2 loss, compare images in VGG feature space. Still widely used in super-resolution, image synthesis, and GANs
Semantic segmentation (FCN, 2015): Converted VGG to fully convolutional for dense pixel-wise prediction
Image captioning: VGG features fed into RNNs/LSTMs for visual question answering

The "ImageNet moment": VGG demonstrated that features learned on ImageNet transfer powerfully to other tasks and datasets. This kickstarted the transfer learning revolution — the idea that you don't train from scratch, you fine-tune a pretrained model. This same idea later drove the rise of pretrained language models (BERT, GPT).

Why did VGG become the default backbone for neural style transfer?

VGG's uniform architecture creates a clean feature hierarchy where different layers capture different levels of abstraction (edges → textures → objects), making it ideal for separating content from style VGG was the fastest network available VGG was the only network with public weights

Chapter 8: The Parameter Problem

VGG-16's 138 million parameters were a serious practical concern. The model weights alone take up ~528 MB in float32. In 2014, this was enormous.

Where the parameters live

The parameter distribution is strikingly uneven:

Conv layers (13 layers): ~14.7M parameters — just 11% of the total
FC1 (7×7×512 → 4096): ~102.8M parameters — 74% of the total
FC2 (4096 → 4096): ~16.8M parameters — 12%
FC3 (4096 → 1000): ~4.1M parameters — 3%

The bottleneck: That first FC layer is the problem. Flattening a 7×7×512 feature map and fully connecting it to 4096 neurons creates 102 million weights that contribute little to the network's discriminative power but consume most of its memory and compute. GoogLeNet (concurrent with VGG) avoided this entirely with global average pooling, reducing parameters from 138M to just 6.8M.

VGG's lasting limitations

Memory: ~528 MB for weights alone (float32), plus activations during training
Compute: ~15.5 billion FLOPs per forward pass (vs ~1.5B for GoogLeNet)
Slow inference: The FC layers create a computational bottleneck
Depth ceiling: Without skip connections, training beyond 19 layers caused degradation (not overfitting — the training error also increased, a mystery that ResNet later solved)

The legacy

Despite these limitations, VGG proved the most important principle in deep learning: depth matters. Every subsequent architecture — GoogLeNet, ResNet, DenseNet — went deeper. They just found smarter ways to do it.

What architectural change did later networks (GoogLeNet, ResNet) adopt to address VGG's parameter problem?

Replaced fully connected layers with global average pooling, eliminating the FC parameter bottleneck entirely Used larger convolution filters Reduced the number of convolutional layers

Chapter 9: Connections

VGGNet sits at a pivotal moment in deep learning history. It crystallized the lessons of AlexNet, ran concurrently with GoogLeNet, and set the stage for ResNet's breakthrough.

Predecessors

AlexNet (Krizhevsky et al., 2012)

The architecture VGGNet systematically improved upon. AlexNet's 5 conv layers with large filters (11×11, 5×5) proved that deep CNNs work; VGG showed they work much better when made deeper with small filters.

ZFNet (Zeiler & Fergus, 2013)

Won ILSVRC 2013 by visualizing and tweaking AlexNet's filters. Showed that smaller first-layer filters (7×7 instead of 11×11) helped, foreshadowing VGG's extreme version of this insight.

Contemporaries

GoogLeNet / Inception (Szegedy et al., 2014)

Developed independently, also went deep (22 layers). Used a more complex "Inception module" with parallel 1×1, 3×3, 5×5 filters and global average pooling. Won ILSVRC 2014 with 6.7% vs VGG's 7.3%, but VGG's simpler design proved more influential for downstream tasks.

Successors

Batch Normalization (Ioffe & Szegedy, 2015)

Solved VGG's initialization problem. By normalizing activations within each mini-batch, BatchNorm allowed much higher learning rates and eliminated the need for careful weight initialization or dropout.

ResNet (He et al., 2015)

Broke through VGG's depth ceiling. Skip connections allowed training networks with 50, 101, even 152 layers — solving the degradation problem that prevented VGG from going beyond 19 layers. Won ILSVRC 2015 with 3.6% top-5 error.

Modern CNNs (EfficientNet, ConvNeXt)

VGG's "simplicity principle" echoes in ConvNeXt (2022), which showed that a modernized ResNet with uniform 7×7 depthwise convolutions can match Vision Transformers. The lesson endures: simple, uniform designs are powerful.

Paper at a glance

Core contribution

Systematic study proving depth (16–19 layers) with 3×3 filters achieves SOTA on ImageNet

Design principle

Simplicity: one filter size (3×3), doubling channels after each pool, uniform blocks

Results

7.3% top-5 error (ILSVRC 2014, 2nd place classification, 1st place localization)

Legacy

Pretrained VGG features became the default backbone for transfer learning, style transfer, and perceptual losses for years

What fundamental problem prevented VGGNet from going deeper than 19 layers, which ResNet later solved?

The degradation problem: beyond ~19 layers, both training and test error increased with depth (not due to overfitting). ResNet's skip connections solved this by allowing gradients to flow directly through shortcut paths. GPU memory limitations The dataset was too small

Very Deep Convolutional Networks