Vision Transformer (ViT) — From Patches to Predictions

Chapter 0: Why Does Vision Need a New Idea?

Look at a photo of a dog jumping to catch a frisbee. Where is the frisbee? Your eye doesn't scan the image pixel by pixel — it immediately connects the dog's outstretched paws to the object in the air, even though they're on opposite sides of the frame. That's global reasoning: understanding relationships between things that are far apart.

For decades, computer vision relied on Convolutional Neural Networks (CNNs). A CNN slides a small filter — maybe 3×3 pixels — across the image. That filter can only see its tiny local neighborhood. To "understand" that the dog and the frisbee are related, a CNN needs many stacked layers, each one expanding the receptive field a little more. It's like trying to understand a sentence by reading three letters at a time.

The CNN limitation: A CNN processes local patches first, then gradually combines them into global understanding. Distant pixels can only "talk" to each other after passing through many layers. Long-range dependencies are expensive and slow.

What If Every Patch Could See Every Other Patch?

In 2020, a team at Google Brain asked a radical question: what if we just used a Transformer on images? The Transformer's self-attention mechanism lets every element directly attend to every other element — in a single layer. No local windows, no hierarchical buildup. Patch 1 (top-left corner) can directly communicate with patch 196 (bottom-right corner) in one shot.

The trick is turning an image into a sequence. Chop the image into a grid of fixed-size patches, flatten each patch into a vector, and feed that vector sequence to a Transformer. That's the entire Vision Transformer (ViT) idea.

CNN vs. ViT: How They See

Click to compare how a CNN and ViT process the same image region.

Select a mode above to see the difference.

Key insight: The difference isn't just architecture — it's a philosophy. CNNs are built around the inductive bias that nearby pixels are more related than distant ones. ViT makes no such assumption. It lets the data teach the model which patches matter for each other.

What is the core limitation of CNNs that ViT addresses?

CNNs can only process grayscale images, not color CNNs struggle with long-range dependencies between distant image regions CNNs require too much memory to run on GPUs

Chapter 1: Patchification — Images as Sentences

Before we can feed an image to a Transformer, we need to turn it into a sequence. Transformers expect a 1D sequence of vectors — like a sentence of word tokens. But an image is 2D: a grid of pixels. How do we make that into a sequence?

The answer is patchification: divide the image into a regular grid of non-overlapping square patches, then flatten each patch into a single long vector. That vector becomes one "token" in the sequence.

The Worked Math

Take a standard ImageNet-sized image: 224 × 224 pixels, with 3 color channels (RGB). ViT uses a patch size of 16 × 16 pixels. Let's count:

How many patches fit along one dimension? 224 ÷ 16 = 14 patches
Total patches in the 2D grid: 14 × 14 = 196 patches
Each patch contains: 16 × 16 × 3 = 768 values (pixels × channels)

So: one 224×224 RGB image becomes a sequence of 196 tokens, each a flat vector of dimension 768. That's our "sentence." The sequence length is 196, which is perfectly manageable for a Transformer.

224×224×3 → 196 patches × 768 values/patch

Compare to language: GPT-2 processes sequences of ~1024 word tokens. ViT-Base processes sequences of 196 image patches. Images, when broken into patches, are actually shorter "sentences" than many text passages.

Interactive: Image Patchification

Use the slider to change patch size. Watch how the token count changes.

Patch size 16px

196 patches × 768 values each

Why 16×16?

It's a balance. Smaller patches (8×8) give more tokens (784), making attention O(n²) much more expensive. Larger patches (32×32) give fewer tokens (49) but lose fine-grained detail. The ViT paper tested 16×16 and 32×32 patches — 16×16 consistently performed better, especially on detailed recognition tasks.

The naming convention: ViT models are named like "ViT-B/16" — the B stands for Base (model size), and /16 is the patch size. ViT-L/32 would be a Large model with 32×32 patches. We'll cover the size variants in Chapter 7.

A 224×224 image with 32×32 patches gives how many tokens?

98 tokens 196 tokens 49 tokens

Chapter 2: Linear Projection + the CLS Token

We have 196 flat patch vectors, each of dimension 768. But a Transformer expects tokens of dimension d_model — let's say 768 for ViT-Base. So each patch vector is already 768-dimensional... but we still project it. Why?

The raw flattened patch is just a list of pixel values — no semantic meaning, no learned structure. We multiply each patch by a learned linear projection matrix E of shape [768, 768]. This is a full linear layer with no bias (though biases can be added). After this projection, each token lives in the model's learned embedding space, not raw pixel space.

z_i = E · x_i where x_i ∈ ℝ⁷⁶⁸, E ∈ ℝ^{d × 768}

python
# Patchify + project in one shot using a strided convolution
# image: [B, 3, 224, 224]

# This Conv2d extracts 16x16 patches and projects each to d_model
patch_embed = nn.Conv2d(
    in_channels=3,
    out_channels=768,   # d_model
    kernel_size=16,       # patch size
    stride=16             # non-overlapping
)

# Forward pass:
x = patch_embed(image)  # [B, 768, 14, 14]
x = x.flatten(2)       # [B, 768, 196]
x = x.transpose(1, 2)  # [B, 196, 768] — (batch, seq_len, d_model)

The CLS Token

Here's a subtlety: after patchification, we have 196 patch tokens. But what token summarizes the whole image for classification? ViT borrows an idea from BERT: prepend a special learnable token called the [CLS] token (short for "classification"). It has shape [1, d_model] and is the same for every image at initialization — but it learns to aggregate information through self-attention.

After prepending, we have a sequence of length 197: 1 CLS token + 196 patch tokens.

Position Embeddings

Transformers are inherently position-unaware — self-attention treats the input as a set, not a sequence. Without position information, patch 1 and patch 196 would look identical to the model. So we add a learnable position embedding to each token (including CLS). These are learned vectors of shape [197, 768] that encode the position of each patch.

Raw image

224×224×3

↓

Patchify + project

196 × 768 patch embeddings

↓

Prepend CLS

197 × 768 tokens

↓

Add position embeddings

197 × 768 (+ learned positions)

↓

Transformer Encoder input

z₀ ∈ ℝ^{197 × 768}

Full Embedding Pipeline

Click each stage to see how the tensor changes.

Click a stage to see the tensor shape.

Key insight: The strided Conv2d trick is elegant — it does patchification and projection in one GPU kernel, making it efficient. Mathematically, it's identical to manually flattening each patch and multiplying by E.

After patchification and adding the CLS token, what is the sequence length fed to the Transformer?

196 (just the patches) 197 (patches + CLS) 768 (the embedding dimension)

Chapter 3: The Transformer Encoder on Patches

Here's the beautiful part: the Transformer Encoder in ViT is identical to the one used for language. No modifications. The same architecture that processes "The cat sat on the mat" now processes 197 image patch tokens. Vision is just another sequence.

Each Transformer Encoder block has two sub-layers, both with residual connections and Layer Normalization applied before each sub-layer (the "Pre-LN" variant, which is more stable to train):

Multi-Head Self-Attention (MSA): Every patch attends to every other patch. The CLS token attends to all patches. All patches attend to CLS.
MLP Block: A two-layer feedforward network with GELU activation, applied independently to each token.

z'_l = MSA(LN(z_l-1)) + z_l-1
z_l = MLP(LN(z'_l)) + z'_l

python
class ViTBlock(nn.Module):
    def __init__(self, d_model=768, n_heads=12, mlp_ratio=4):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.attn  = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.norm2 = nn.LayerNorm(d_model)
        self.mlp   = nn.Sequential(
            nn.Linear(d_model, d_model * mlp_ratio),
            nn.GELU(),
            nn.Linear(d_model * mlp_ratio, d_model)
        )

    def forward(self, x):
        x_norm = self.norm1(x)
        attn_out, _ = self.attn(x_norm, x_norm, x_norm)
        x = x + attn_out                # residual
        x = x + self.mlp(self.norm2(x)) # residual
        return x

What Does Attention Look Like on Images?

In language, attention heads learn things like: "this pronoun refers to that noun." On images, attention heads learn similarly interpretable patterns. Early layers tend to attend locally (similar to CNN kernels), while deeper layers attend globally — connecting distant patches that are semantically related. A head processing a bird photo might learn to link the beak, wings, and tail even though they're far apart spatially.

The MLP ratio: The hidden dimension of the MLP is d_model × 4. So for d_model=768, the MLP has a 3072-dimensional hidden layer. This ratio of 4 is standard from the original Transformer paper and ViT keeps it. It's a knob that trades parameters for capacity.

Attention in a Transformer Block

See how a single Transformer block processes 8 tokens through attention and MLP.

In a ViT Transformer block, where is Layer Normalization applied?

Before each sub-layer (Pre-LN), for training stability After each sub-layer (Post-LN), as in the original 2017 Transformer Only after the final layer, once

Chapter 4: Classification — The CLS Token Payoff

After L Transformer Encoder blocks, we have 197 output tokens, each of shape [d_model]. These tokens have now "talked" to each other through L rounds of attention. Which token do we use to classify the image?

We use the CLS token — the special [CLS] token we prepended back in Chapter 2. After passing through all the Encoder layers, the CLS token's output vector has accumulated global information about the image by attending to all 196 patch tokens. We feed only this vector into a small MLP head (also called the classification head) that maps from [d_model] to [num_classes].

y = MLP_head(LN(z_L⁰))

Where z_L⁰ is the CLS token (index 0) at the final encoder layer L. The MLP head is just a single linear layer during fine-tuning (and a two-layer MLP during pre-training).

CLS vs. Global Average Pooling

There's an alternative: instead of using the CLS token, average all 196 patch token outputs — global average pooling (GAP). This is what DeiT and many follow-up works do. The ViT paper tested both and found they perform similarly, but CLS is more "Transformer-native" while GAP slightly simplifies the architecture.

CLS Token

Borrowed from BERT
One designated "aggregator" token
Attends to all patches through L layers
Final CLS output = image representation
Extra parameter: the learned CLS embedding

Global Avg Pooling

Common in CNNs (GoogLeNet, ResNet)
Average all 196 patch outputs
No special token needed
Mean of all patch representations
Simpler, slightly fewer parameters

CLS vs. GAP Classification

See how the two strategies extract an image-level representation from patch tokens.

Select a strategy above.

Why does the CLS token work? Because self-attention is global. The CLS token can directly attend to every patch in every layer. By the final layer, it has had L opportunities to gather information from all 196 patches. It's the "editor" who read all the pages and writes the summary.

In ViT-Base, what is used for final classification?

The output of the CLS token after the final Encoder block The output of the first patch token (top-left corner) The sum of all 197 output tokens

Chapter 5: Training — Why ViT Needs More Data

When the ViT paper came out, it had a surprising finding: trained on ImageNet-1k (~1.3M images), ViT-Base performed worse than ResNet. On ImageNet-21k (~14M images), they were comparable. Only on JFT-300M (300 million images) did ViT clearly win. Why does a Transformer need so much data to beat a CNN?

It comes down to inductive biases. A CNN is pre-loaded with assumptions: nearby pixels are related, patterns should be translation-invariant (a cat is a cat wherever it appears in the image). These biases are baked into the architecture by construction. The model doesn't need to learn them from data.

The inductive bias tradeoff: CNNs get translation equivariance and locality for free. ViT gets nothing for free — it must learn all spatial relationships from scratch. That takes more data. But with enough data, the unbiased model wins because it can find patterns the biased model structurally cannot.

Transfer Learning: The Practical Fix

The solution isn't to collect 300M images for every project — it's transfer learning. Pre-train a large ViT on JFT-300M (or ImageNet-21k), then fine-tune on your target dataset with just the classification head replaced. The Transformer's weights already encode rich image representations. Fine-tuning with only a few thousand examples then works extremely well.

During fine-tuning, one wrinkle appears: the pre-trained position embeddings were learned for a 14×14 grid (196 patches at 16×16 on 224×224). If you fine-tune at a higher resolution (say, 384×384), you suddenly have 576 patches — the model has never seen position embeddings for positions 197–576. The fix is 2D interpolation: bilinearly interpolate the 14×14 position embedding grid to the new 24×24 grid. It works surprisingly well.

ViT vs. CNN: Data Efficiency

Compare accuracy as training set size grows (approximate from the ViT paper).

python
# Transfer learning: fine-tune ViT-B/16 on custom 10-class dataset
import timm

# Load pretrained ViT-Base/16 (ImageNet-21k weights)
model = timm.create_model(
    'vit_base_patch16_224',
    pretrained=True,
    num_classes=10   # replace head for 10 classes
)

# The backbone weights are already rich feature extractors
# Fine-tune everything (or just the head for very small datasets)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-5)

Rule of thumb: With fewer than ~10k samples, freeze the ViT backbone and only train the head. With 10k–100k samples, fine-tune all layers with a small learning rate. With more, go full fine-tuning. ViT is surprisingly robust to overfitting once pre-trained on large data.

Why does ViT underperform CNNs when trained only on ImageNet-1k (~1.3M images)?

ViT lacks useful inductive biases (locality, translation equivariance) that CNNs have built in, so it needs more data to learn spatial structure from scratch ViT is too small for ImageNet-1k — you need a larger model ImageNet-1k images are too high-resolution for a Transformer

Chapter 6: Position Embeddings — Where Is This Patch?

Self-attention is a set operation: if you shuffle the input tokens, the output shuffles identically. The model sees no difference between "patch 5 is at position 5" and "patch 5 is at position 50." This is great for parallelism, terrible for understanding images — because where a patch is matters enormously.

ViT adds a learned position embedding vector to each token. There are 197 position embeddings (one for CLS, one per patch), each of dimension d_model=768. These vectors are random at initialization and learned end-to-end during training. The model must figure out, from the task signal alone, what spatial structure to encode in these vectors.

What Does ViT Actually Learn?

When researchers visualized the cosine similarity between position embeddings, they found a striking pattern: nearby patches have similar embeddings, and the similarity decays with distance. The model spontaneously learns 2D spatial structure even from 1D position indices. It even partially recovers row/column structure — position embeddings in the same row are more similar to each other than to those in different rows.

Learned Position Embedding Similarity

Click a patch position to see which other patches have similar position embeddings (brighter = more similar). This mirrors what ViT actually learns.

Click a patch to see its similarity to all others.

1D vs. 2D Position Encodings

The original ViT used 1D learned embeddings — simply number the patches 1 through 196 and learn one vector per position. The paper also tested 2D encodings (separate row and column embeddings, then concatenated or added) and fixed sinusoidal encodings from the original Transformer. Surprisingly, 1D learned embeddings performed as well as 2D in all experiments. The model's self-attention mechanism is powerful enough to infer 2D structure from 1D positions.

Position Type	Params	ImageNet Acc	Notes
No position info	0	~61%	Bag of patches — terrible
1D learned	197 × 768	77.9%	ViT default — surprisingly good
2D learned	2 × 14 × 384	77.9%	Same performance as 1D
Relative	varies	77.9%	Marginally better on some tasks

Key insight: Position embeddings don't need to be geometrically perfect — they just need to be consistently different for each position so the model can distinguish them. The model's attention mechanism does the rest of the spatial reasoning work.

What does ViT use for position embeddings by default?

Learned 1D position embeddings — one vector per patch position, trained end-to-end Fixed sinusoidal encodings from the original Transformer paper 2D convolutional position features derived from patch coordinates

Chapter 7: Scaling — ViT-B, L, H and DeiT

One of the most exciting results in the ViT paper: unlike CNNs, Transformers scale predictably. Double the parameters, and you reliably get better performance — if you have enough data. The ViT paper defined three main model sizes, all with 16×16 patches:

Model	Layers	d_model	Heads	MLP dim	Params
ViT-B/16	12	768	12	3072	86M
ViT-L/16	24	1024	16	4096	307M
ViT-H/14	32	1280	16	5120	632M

Notice ViT-H uses 14×14 patches (not 16×16) — smaller patches give more tokens (256 instead of 196) and finer resolution at the cost of longer sequences. The H variant trades compute for detail.

DeiT: Training ViT Without 300M Images

The data-hunger problem was a real barrier. In 2021, Facebook Research introduced DeiT (Data-efficient Image Transformers). The key insight: instead of massive datasets, use knowledge distillation from a strong CNN teacher (a RegNet or EfficientNet).

DeiT adds a second special token: the distillation token. Like CLS, it's prepended and attends to all patches. But instead of predicting class labels, it's trained to match the output distribution of the teacher CNN. The CLS token still predicts ground-truth classes. Both losses are combined. Result: a ViT-Base trained on ImageNet-1k alone, without any extra data, that matches or exceeds EfficientNet-B4.

Teacher CNN (frozen)

e.g. RegNetY-16GF — produces soft labels

↓ distillation loss

Distillation Token

Learns to mimic teacher's output distribution

CLS Token

Learns from ground-truth labels

↓

Combined DeiT model

Matches EfficientNet on ImageNet-1k alone

ViT Model Sizes

Visualize the architecture differences across ViT variants.

Select a model variant above.

Scaling law insight: For CNNs, adding more layers past a certain depth gives diminishing returns. Transformers exhibit power-law scaling: performance scales as a power of compute, consistently. This is why the AI field moved to Transformers — the scaling recipe just keeps working.

What is the main innovation in DeiT compared to the original ViT?

DeiT uses 2D position embeddings instead of 1D DeiT uses knowledge distillation from a CNN teacher, enabling ViT to train on ImageNet-1k without massive external data DeiT replaces self-attention with convolutions in lower layers

Chapter 8: Showcase — ViT in Action

Time to see the full ViT pipeline come alive. This showcase has three interactive panels: watch an image get split into patches, explore how attention maps connect distant patches, and see how the CLS token accumulates class evidence layer by layer.

Panel 1 — Patchification Explorer

Draw on the left (or select a scene), then watch it split into patches on the right. Hover a patch to highlight it.

Hover over a patch to highlight it in both views. Click to "select" it for attention view.

Panel 2 — Attention Map Viewer

Select a source patch. See which patches it attends to most strongly. Use the layer slider to watch attention patterns evolve from local (early layers) to global (deep layers).

Encoder layer 6 / 12

Attention head Head 1

Early layers: local attention. Deep layers: global, semantic attention.

Panel 3 — CLS Token Evidence Accumulation

Watch how the CLS token's "confidence" in the correct class builds layer by layer as it gathers information from all patches.

The CLS token starts as noise and converges to a class prediction over 12 layers.

Chapter 9: Connections — Where ViT Goes Next

ViT opened a door. Once it was clear that a pure Transformer could match and beat CNNs, researchers started asking: what else can we do with this architecture? The answer turned out to be: almost everything in vision.

The ViT Family Tree

ViT (2020)

An Image is Worth 16×16 Words — pure Transformer on image patches, needs large data

↓

DeiT (2021)

Data-efficient training via CNN teacher distillation — works on ImageNet-1k alone

↓

MAE (2022)

Masked Autoencoders — mask 75% of patches, predict the missing ones. Self-supervised ViT pre-training beats supervised on many tasks

↓

DINO / DINOv2 (2021/2023)

Self-supervised ViT via self-distillation. No labels needed. DINOv2 features transfer to depth estimation, segmentation, classification

↓

DiT (2023)

Diffusion Transformers — replace the U-Net in diffusion models with a ViT backbone. Powers Stable Diffusion 3, FLUX

Key Innovations at Each Step

Model	Year	Key Innovation	Impact
ViT	2020	Pure Transformer on image patches	Proved Transformers work in vision
DeiT	2021	CNN distillation token	Made ViT trainable without JFT-300M
Swin	2021	Hierarchical windows, shifted attention	Best on detection/segmentation
MAE	2022	Mask 75% patches, reconstruct pixels	Self-supervised ViT that scales
DINO v2	2023	Self-distillation, no labels	Universal visual features
DiT	2023	ViT as diffusion denoiser	State-of-art image generation

The ViT Lineage

An interactive timeline of ViT-derived architectures.

ViT vs. CNN: Who Wins?

By 2023, the honest answer is: "it depends." CNNs (especially ConvNeXt, which redesigned them with Transformer-era best practices) remain competitive on limited data and edge devices. ViT-family models dominate when large-scale pre-training is available. For most production applications today, a pre-trained ViT from TIMM is the default starting point.

The deeper lesson: both CNNs and Transformers are computing feature hierarchies from images. CNNs do it with local operators and implicit global pooling. ViT does it with global attention from the start. Neither is "objectively correct" — they embody different priors about the world.

What to learn next: If ViT clicked for you, explore: VLMs (adding language to vision), the Diffusion lesson (where DiT comes from), and MAE (self-supervised learning). The transformer backbone you just learned carries through all of them.

"We find that large Vision Transformers match or exceed the state of the art on many image recognition benchmarks, while also being conceptually simpler and more scalable than CNNs."
— Dosovitskiy et al., An Image is Worth 16×16 Words (2020)

MAE (Masked Autoencoders) trains ViT by:

Predicting ImageNet class labels from masked patches Masking 75% of image patches and predicting their pixel values — a self-supervised task requiring no labels Distilling from a teacher CNN on masked images

See Images as Sequenceswith ViT