What if you treated every image as a sentence? The Vision Transformer does exactly that — and beat CNNs at their own game.
Look at a photo of a dog jumping to catch a frisbee. Where is the frisbee? Your eye doesn't scan the image pixel by pixel — it immediately connects the dog's outstretched paws to the object in the air, even though they're on opposite sides of the frame. That's global reasoning: understanding relationships between things that are far apart.
For decades, computer vision relied on Convolutional Neural Networks (CNNs). A CNN slides a small filter — maybe 3×3 pixels — across the image. That filter can only see its tiny local neighborhood. To "understand" that the dog and the frisbee are related, a CNN needs many stacked layers, each one expanding the receptive field a little more. It's like trying to understand a sentence by reading three letters at a time.
In 2020, a team at Google Brain asked a radical question: what if we just used a Transformer on images? The Transformer's self-attention mechanism lets every element directly attend to every other element — in a single layer. No local windows, no hierarchical buildup. Patch 1 (top-left corner) can directly communicate with patch 196 (bottom-right corner) in one shot.
The trick is turning an image into a sequence. Chop the image into a grid of fixed-size patches, flatten each patch into a vector, and feed that vector sequence to a Transformer. That's the entire Vision Transformer (ViT) idea.
Click to compare how a CNN and ViT process the same image region.
Select a mode above to see the difference.
Before we can feed an image to a Transformer, we need to turn it into a sequence. Transformers expect a 1D sequence of vectors — like a sentence of word tokens. But an image is 2D: a grid of pixels. How do we make that into a sequence?
The answer is patchification: divide the image into a regular grid of non-overlapping square patches, then flatten each patch into a single long vector. That vector becomes one "token" in the sequence.
Take a standard ImageNet-sized image: 224 × 224 pixels, with 3 color channels (RGB). ViT uses a patch size of 16 × 16 pixels. Let's count:
So: one 224×224 RGB image becomes a sequence of 196 tokens, each a flat vector of dimension 768. That's our "sentence." The sequence length is 196, which is perfectly manageable for a Transformer.
Use the slider to change patch size. Watch how the token count changes.
196 patches × 768 values each
It's a balance. Smaller patches (8×8) give more tokens (784), making attention O(n²) much more expensive. Larger patches (32×32) give fewer tokens (49) but lose fine-grained detail. The ViT paper tested 16×16 and 32×32 patches — 16×16 consistently performed better, especially on detailed recognition tasks.
We have 196 flat patch vectors, each of dimension 768. But a Transformer expects tokens of dimension d_model — let's say 768 for ViT-Base. So each patch vector is already 768-dimensional... but we still project it. Why?
The raw flattened patch is just a list of pixel values — no semantic meaning, no learned structure. We multiply each patch by a learned linear projection matrix E of shape [768, 768]. This is a full linear layer with no bias (though biases can be added). After this projection, each token lives in the model's learned embedding space, not raw pixel space.
python # Patchify + project in one shot using a strided convolution # image: [B, 3, 224, 224] # This Conv2d extracts 16x16 patches and projects each to d_model patch_embed = nn.Conv2d( in_channels=3, out_channels=768, # d_model kernel_size=16, # patch size stride=16 # non-overlapping ) # Forward pass: x = patch_embed(image) # [B, 768, 14, 14] x = x.flatten(2) # [B, 768, 196] x = x.transpose(1, 2) # [B, 196, 768] — (batch, seq_len, d_model)
Here's a subtlety: after patchification, we have 196 patch tokens. But what token summarizes the whole image for classification? ViT borrows an idea from BERT: prepend a special learnable token called the [CLS] token (short for "classification"). It has shape [1, d_model] and is the same for every image at initialization — but it learns to aggregate information through self-attention.
After prepending, we have a sequence of length 197: 1 CLS token + 196 patch tokens.
Transformers are inherently position-unaware — self-attention treats the input as a set, not a sequence. Without position information, patch 1 and patch 196 would look identical to the model. So we add a learnable position embedding to each token (including CLS). These are learned vectors of shape [197, 768] that encode the position of each patch.
Click each stage to see how the tensor changes.
Click a stage to see the tensor shape.
Here's the beautiful part: the Transformer Encoder in ViT is identical to the one used for language. No modifications. The same architecture that processes "The cat sat on the mat" now processes 197 image patch tokens. Vision is just another sequence.
Each Transformer Encoder block has two sub-layers, both with residual connections and Layer Normalization applied before each sub-layer (the "Pre-LN" variant, which is more stable to train):
python class ViTBlock(nn.Module): def __init__(self, d_model=768, n_heads=12, mlp_ratio=4): super().__init__() self.norm1 = nn.LayerNorm(d_model) self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True) self.norm2 = nn.LayerNorm(d_model) self.mlp = nn.Sequential( nn.Linear(d_model, d_model * mlp_ratio), nn.GELU(), nn.Linear(d_model * mlp_ratio, d_model) ) def forward(self, x): x_norm = self.norm1(x) attn_out, _ = self.attn(x_norm, x_norm, x_norm) x = x + attn_out # residual x = x + self.mlp(self.norm2(x)) # residual return x
In language, attention heads learn things like: "this pronoun refers to that noun." On images, attention heads learn similarly interpretable patterns. Early layers tend to attend locally (similar to CNN kernels), while deeper layers attend globally — connecting distant patches that are semantically related. A head processing a bird photo might learn to link the beak, wings, and tail even though they're far apart spatially.
See how a single Transformer block processes 8 tokens through attention and MLP.
After L Transformer Encoder blocks, we have 197 output tokens, each of shape [d_model]. These tokens have now "talked" to each other through L rounds of attention. Which token do we use to classify the image?
We use the CLS token — the special [CLS] token we prepended back in Chapter 2. After passing through all the Encoder layers, the CLS token's output vector has accumulated global information about the image by attending to all 196 patch tokens. We feed only this vector into a small MLP head (also called the classification head) that maps from [d_model] to [num_classes].
Where zL0 is the CLS token (index 0) at the final encoder layer L. The MLP head is just a single linear layer during fine-tuning (and a two-layer MLP during pre-training).
There's an alternative: instead of using the CLS token, average all 196 patch token outputs — global average pooling (GAP). This is what DeiT and many follow-up works do. The ViT paper tested both and found they perform similarly, but CLS is more "Transformer-native" while GAP slightly simplifies the architecture.
See how the two strategies extract an image-level representation from patch tokens.
Select a strategy above.
When the ViT paper came out, it had a surprising finding: trained on ImageNet-1k (~1.3M images), ViT-Base performed worse than ResNet. On ImageNet-21k (~14M images), they were comparable. Only on JFT-300M (300 million images) did ViT clearly win. Why does a Transformer need so much data to beat a CNN?
It comes down to inductive biases. A CNN is pre-loaded with assumptions: nearby pixels are related, patterns should be translation-invariant (a cat is a cat wherever it appears in the image). These biases are baked into the architecture by construction. The model doesn't need to learn them from data.
The solution isn't to collect 300M images for every project — it's transfer learning. Pre-train a large ViT on JFT-300M (or ImageNet-21k), then fine-tune on your target dataset with just the classification head replaced. The Transformer's weights already encode rich image representations. Fine-tuning with only a few thousand examples then works extremely well.
During fine-tuning, one wrinkle appears: the pre-trained position embeddings were learned for a 14×14 grid (196 patches at 16×16 on 224×224). If you fine-tune at a higher resolution (say, 384×384), you suddenly have 576 patches — the model has never seen position embeddings for positions 197–576. The fix is 2D interpolation: bilinearly interpolate the 14×14 position embedding grid to the new 24×24 grid. It works surprisingly well.
Compare accuracy as training set size grows (approximate from the ViT paper).
python # Transfer learning: fine-tune ViT-B/16 on custom 10-class dataset import timm # Load pretrained ViT-Base/16 (ImageNet-21k weights) model = timm.create_model( 'vit_base_patch16_224', pretrained=True, num_classes=10 # replace head for 10 classes ) # The backbone weights are already rich feature extractors # Fine-tune everything (or just the head for very small datasets) optimizer = torch.optim.AdamW(model.parameters(), lr=3e-5)
Self-attention is a set operation: if you shuffle the input tokens, the output shuffles identically. The model sees no difference between "patch 5 is at position 5" and "patch 5 is at position 50." This is great for parallelism, terrible for understanding images — because where a patch is matters enormously.
ViT adds a learned position embedding vector to each token. There are 197 position embeddings (one for CLS, one per patch), each of dimension d_model=768. These vectors are random at initialization and learned end-to-end during training. The model must figure out, from the task signal alone, what spatial structure to encode in these vectors.
When researchers visualized the cosine similarity between position embeddings, they found a striking pattern: nearby patches have similar embeddings, and the similarity decays with distance. The model spontaneously learns 2D spatial structure even from 1D position indices. It even partially recovers row/column structure — position embeddings in the same row are more similar to each other than to those in different rows.
Click a patch position to see which other patches have similar position embeddings (brighter = more similar). This mirrors what ViT actually learns.
Click a patch to see its similarity to all others.
The original ViT used 1D learned embeddings — simply number the patches 1 through 196 and learn one vector per position. The paper also tested 2D encodings (separate row and column embeddings, then concatenated or added) and fixed sinusoidal encodings from the original Transformer. Surprisingly, 1D learned embeddings performed as well as 2D in all experiments. The model's self-attention mechanism is powerful enough to infer 2D structure from 1D positions.
| Position Type | Params | ImageNet Acc | Notes |
|---|---|---|---|
| No position info | 0 | ~61% | Bag of patches — terrible |
| 1D learned | 197 × 768 | 77.9% | ViT default — surprisingly good |
| 2D learned | 2 × 14 × 384 | 77.9% | Same performance as 1D |
| Relative | varies | 77.9% | Marginally better on some tasks |
One of the most exciting results in the ViT paper: unlike CNNs, Transformers scale predictably. Double the parameters, and you reliably get better performance — if you have enough data. The ViT paper defined three main model sizes, all with 16×16 patches:
| Model | Layers | d_model | Heads | MLP dim | Params |
|---|---|---|---|---|---|
| ViT-B/16 | 12 | 768 | 12 | 3072 | 86M |
| ViT-L/16 | 24 | 1024 | 16 | 4096 | 307M |
| ViT-H/14 | 32 | 1280 | 16 | 5120 | 632M |
Notice ViT-H uses 14×14 patches (not 16×16) — smaller patches give more tokens (256 instead of 196) and finer resolution at the cost of longer sequences. The H variant trades compute for detail.
The data-hunger problem was a real barrier. In 2021, Facebook Research introduced DeiT (Data-efficient Image Transformers). The key insight: instead of massive datasets, use knowledge distillation from a strong CNN teacher (a RegNet or EfficientNet).
DeiT adds a second special token: the distillation token. Like CLS, it's prepended and attends to all patches. But instead of predicting class labels, it's trained to match the output distribution of the teacher CNN. The CLS token still predicts ground-truth classes. Both losses are combined. Result: a ViT-Base trained on ImageNet-1k alone, without any extra data, that matches or exceeds EfficientNet-B4.
Visualize the architecture differences across ViT variants.
Select a model variant above.
Time to see the full ViT pipeline come alive. This showcase has three interactive panels: watch an image get split into patches, explore how attention maps connect distant patches, and see how the CLS token accumulates class evidence layer by layer.
Draw on the left (or select a scene), then watch it split into patches on the right. Hover a patch to highlight it.
Hover over a patch to highlight it in both views. Click to "select" it for attention view.
Select a source patch. See which patches it attends to most strongly. Use the layer slider to watch attention patterns evolve from local (early layers) to global (deep layers).
Early layers: local attention. Deep layers: global, semantic attention.
Watch how the CLS token's "confidence" in the correct class builds layer by layer as it gathers information from all patches.
The CLS token starts as noise and converges to a class prediction over 12 layers.
ViT opened a door. Once it was clear that a pure Transformer could match and beat CNNs, researchers started asking: what else can we do with this architecture? The answer turned out to be: almost everything in vision.
| Model | Year | Key Innovation | Impact |
|---|---|---|---|
| ViT | 2020 | Pure Transformer on image patches | Proved Transformers work in vision |
| DeiT | 2021 | CNN distillation token | Made ViT trainable without JFT-300M |
| Swin | 2021 | Hierarchical windows, shifted attention | Best on detection/segmentation |
| MAE | 2022 | Mask 75% patches, reconstruct pixels | Self-supervised ViT that scales |
| DINO v2 | 2023 | Self-distillation, no labels | Universal visual features |
| DiT | 2023 | ViT as diffusion denoiser | State-of-art image generation |
An interactive timeline of ViT-derived architectures.
By 2023, the honest answer is: "it depends." CNNs (especially ConvNeXt, which redesigned them with Transformer-era best practices) remain competitive on limited data and edge devices. ViT-family models dominate when large-scale pre-training is available. For most production applications today, a pre-trained ViT from TIMM is the default starting point.
The deeper lesson: both CNNs and Transformers are computing feature hierarchies from images. CNNs do it with local operators and implicit global pooling. ViT does it with global attention from the start. Neither is "objectively correct" — they embody different priors about the world.
"We find that large Vision Transformers match or exceed the state of the art on many image recognition benchmarks, while also being conceptually simpler and more scalable than CNNs."
— Dosovitskiy et al., An Image is Worth 16×16 Words (2020)