An Image is Worth 16x16 Words

Chapter 0: Why Patches?

You have a 224×224 pixel image. You want to classify it -- is it a cat, a car, a sunset? For thirty years, the answer was convolutional neural networks (CNNs). Stacks of small filters slide across the image, learning edges, then textures, then object parts, then whole objects. It works. It won ImageNet. It powers everything.

But transformers have eaten NLP alive. BERT, GPT, T5 -- the same architecture dominates every language task. Transformers have one superpower CNNs lack: global attention. Every token can attend to every other token, no matter how far apart. A CNN needs dozens of layers before a pixel in the top-left can "see" the bottom-right. A transformer does it in one layer.

The problem? A 224×224 image has 50,176 pixels. Self-attention scales quadratically with sequence length. Attending every pixel to every pixel would require 50,176² ≈ 2.5 billion operations per layer. That's absurd.

The core insight of ViT: Don't treat pixels as tokens. Treat patches as tokens. Chop the image into a grid of 16×16 patches. A 224×224 image becomes just 196 patches -- a perfectly manageable sequence length for a standard transformer. Each patch is a "word." The image is a "sentence."

This is what Dosovitskiy et al. showed in 2020: a pure transformer, applied directly to sequences of image patches, with no convolutional layers at all, matches or beats the best CNNs on image classification. The catch? You need a lot of data. With ImageNet alone (1.3M images), ViT underperforms ResNets. With JFT-300M (300M images), it dominates -- at a fraction of the training cost.

The full data flow at a glance: Image (224×224×3) → split into 196 patches of 16×16×3 → flatten each patch to a 768-dim vector → linear projection to D-dim embedding → prepend [CLS] token (197 tokens total) → add learned position embeddings → 12 transformer encoder layers → take CLS token output → LayerNorm → MLP head → 1000-dim logits (ImageNet classes). Every tensor shape is known. Every operation is a matrix multiply or element-wise op. There is no magic.

Prior work had tried applying attention to images in limited ways -- local attention windows, sparse attention patterns, attention augmenting CNN feature maps. The radical move of ViT was to say: forget all of that. Take the exact same transformer from NLP. Feed it image patches. See what happens. The answer changed computer vision forever.

Pixels vs Patches: Sequence Length

Drag the patch size. Watch how dramatically the sequence length drops.

Patch size16

Why can't we just treat every pixel as a token in a standard transformer?

Self-attention scales quadratically with sequence length -- 50,176 pixels would require ~2.5 billion operations per layer, which is computationally infeasible Pixels don't contain enough information individually Transformers can only process text, not images

Chapter 1: Patch Tokenization

Here is the first step of ViT, and it's beautifully simple. Take your image x ∈ R^H×W×C (height H, width W, C color channels). Chop it into a grid of non-overlapping patches, each of size P×P pixels.

For a 224×224 RGB image with P=16:

N = HW / P² = (224 × 224) / (16 × 16) = 196 patches

Each patch is a small 16×16×3 block of pixels -- 768 raw numbers. Flatten it into a single vector: x_p ∈ R⁷⁶⁸. That vector is the "word." The sequence of 196 such vectors is the "sentence." The transformer reads this sentence and classifies the image.

Why 16×16? It's a sweet spot. Smaller patches (8×8) give 784 tokens -- still manageable but 4× the compute. Larger patches (32×32) give only 49 tokens -- fast but each patch is so big that local detail is lost. The paper names its models by patch size: ViT-B/16 means Base model with 16×16 patches. ViT-L/32 means Large model with 32×32 patches.

The engineering tradeoff in numbers: Self-attention is O(N²·D) where N is sequence length. At P=8: N=784, cost ∝ 784²=614,656 per head. At P=16: N=196, cost ∝ 196²=38,416 per head. At P=32: N=49, cost ∝ 49²=2,401 per head. Going from 16 to 8 is a 16× increase in attention cost for a 2× increase in spatial resolution. The paper found that ViT-B/16 at 88.55% accuracy dominates ViT-B/32 at 80.73%, but ViT-L/16 takes 3.6× more compute than ViT-L/32. The 16×16 choice maximizes accuracy-per-FLOP.

Notice what's missing: no sliding window, no stride, no padding, no learned filters. Just a hard grid. A 5-year-old could do this with scissors. The entire spatial inductive bias of CNNs -- locality, translation equivariance, hierarchical feature extraction -- is thrown away. The transformer must learn all spatial relationships from scratch, using only the data.

There's a subtle but important detail here. The patches are non-overlapping. In a CNN, the first convolutional layer slides a small filter across the image with overlap (stride < kernel size). This means neighboring pixels share computations. ViT's patches have no overlap -- each pixel belongs to exactly one patch. Any information sharing between neighboring patches must happen in the transformer layers, through attention.

The naming convention encodes the patch size into the model name. ViT-B/16 means Base architecture with 16×16 patches (196 tokens from a 224×224 image). ViT-L/32 means Large architecture with 32×32 patches (49 tokens). Smaller patches = more tokens = more compute but finer spatial resolution.

Image Patch Grid

Watch the image get sliced into a grid of patches. Each colored square becomes one token.

Patch size P16

A 224×224 image with 32×32 patches produces how many tokens?

196 49 -- because (224/32)² = 7² = 49 784

Chapter 2: Linear Projection

Each flattened patch is a 768-dimensional vector (for 16×16×3). But the transformer's hidden dimension D might be different -- 768 for ViT-Base, 1024 for ViT-Large, 1280 for ViT-Huge. We need to map each patch vector into the transformer's embedding space.

This is done with a single trainable linear projection -- a matrix multiply:

e_i = x_pⁱ · E, E ∈ R^{(P²·C) × D}

For ViT-Base with 16×16 RGB patches: E is a 768×768 matrix. Each raw patch vector (768 numbers of pixel values) gets multiplied by this matrix to produce a 768-dimensional patch embedding.

Think of it this way: The linear projection is doing what the first convolutional layer of a CNN does -- extracting features from a local image region. But instead of learned convolutional filters (which impose locality and weight sharing), it's a single dense matrix applied independently to each patch. The paper shows that the learned projection weights resemble Gabor-like filters and color blobs -- the same low-level features CNNs learn, discovered purely from data.

After projection, each of the N patches is a D-dimensional vector. The image is now a sequence of N embedding vectors, just like a sentence of N word embeddings. From here, the transformer doesn't need to know these came from an image.

Tensor shapes through projection: Input image: (224, 224, 3). After splitting: (196, 16, 16, 3). After flattening patches: (196, 768). After multiplying by E: (196, 768) for ViT-Base. The E matrix has 768×768 = 589,824 parameters -- roughly the same as a 16×16 conv layer with 768 output channels. This is not a coincidence: mathematically, the linear projection on non-overlapping patches is identical to a convolution with kernel size = stride = patch size. PyTorch implements it as nn.Conv2d(3, 768, kernel_size=16, stride=16).

The paper inspects the learned projection weights and finds something striking. The top principal components of E look like Gabor filters, oriented edges, and color blobs -- the same features that the first layers of CNNs learn. The transformer discovers these low-level feature extractors purely from data, without any convolutional structure forcing it.

There is also a hybrid variant: instead of raw image patches, feed the output feature maps of a CNN (like ResNet) into the transformer. In this hybrid model, the CNN handles local feature extraction and the transformer handles global reasoning. The paper finds hybrids are slightly better at small compute budgets, but the advantage vanishes at scale. Pure ViT catches up.

Raw patch

16×16×3 = 768 pixel values, flattened into a vector

↓ multiply by E (768×768)

Patch embedding

768-dimensional learned representation capturing local structure

What does the linear projection matrix E learn to do?

It reduces the number of patches It maps raw pixel values from each patch into a learned embedding space, extracting local features similar to what a first convolutional layer learns It classifies each patch independently

Chapter 3: Position Embeddings

After linear projection, the transformer receives 196 patch embeddings. But it has no idea where they came from in the image. Patch 0 could be top-left or bottom-right -- the transformer treats sequences as unordered sets unless you add positional information.

ViT adds learnable 1D position embeddings. One embedding vector per position, learned from scratch during training:

z₀ = [x_class; x₁E; x₂E; …; x_NE] + E_pos, E_pos ∈ R^(N+1)×D

The "+1" is for the CLS token (next chapter). Each position gets its own learned D-dimensional vector, added element-wise to the corresponding patch embedding.

1D, not 2D -- and it works! You might expect that 2D position embeddings (encoding row and column separately) would help, since images have 2D structure. The paper tested this. No improvement. The model learns the 2D grid structure automatically from 1D positions. When you visualize the similarity between position embeddings, patches in the same row or column naturally cluster together. The model rediscovers 2D topology from pure data.

This is a key insight about inductive bias: the 2D structure of images is learnable, not something you need to hard-code. Given enough data, the model figures it out. This finding foreshadows a broader theme in deep learning -- less inductive bias, more data, better results.

Why learned, not sinusoidal? The original NLP transformer used fixed sinusoidal position encodings. ViT uses learned embeddings instead. Why? Sinusoidal encodings assume a linear sequence where position 5 is "between" positions 4 and 6. But image patches have 2D structure -- position 5 (row 0, col 5) is spatially adjacent to position 19 (row 1, col 5) in a 14×14 grid, not to position 6 (row 0, col 6). Learned embeddings can capture this 2D adjacency. The E_pos matrix is (197, 768) = 151,296 parameters -- tiny compared to the model's 86M total. Each position embedding is trained alongside the model, so positions that should be "close" develop similar vectors.

There's a practical benefit to 1D position embeddings: resolution flexibility. When fine-tuning at higher resolution than pre-training (say, 384×384 instead of 224×224), the number of patches increases. With 16×16 patches, you go from 196 to 576 tokens. The position embeddings need to be interpolated. Because ViT uses 1D embeddings that have learned 2D structure, you can perform 2D bicubic interpolation on the learned embeddings. The paper finds this works well -- it's the only point where 2D image structure is manually injected.

Position Embedding Similarity

Click any patch position to see which other positions have similar embeddings. The model learns 2D structure from 1D indices.

Why do 2D-aware position embeddings fail to improve over learned 1D embeddings?

The model learns 2D spatial relationships automatically from the 1D position embeddings -- nearby patches develop similar embeddings, and row-column structure emerges without being hard-coded 2D embeddings use too much memory Images don't actually have 2D structure

Chapter 4: The CLS Token

We now have 196 patch embeddings with position information. But we need a single vector to classify the image. Which patch should represent the whole image? The top-left? The center? None of them -- each patch only sees a 16×16 region.

ViT borrows BERT's trick: prepend a special learnable [CLS] token to the sequence. This token has no visual content. It starts as a random vector and is trained end-to-end. Its job is to aggregate information from all patches through self-attention and serve as the image-level representation.

z₀ = [x_class; x₁E; x₂E; …; x_NE] + E_pos

After L layers of transformer processing, the CLS token's output z₀^L has attended to every patch across every layer. It's the image's "summary." A classification head (a small MLP) maps this to class probabilities:

y = MLP(LN(z₀^L))

CLS vs global average pooling: An alternative is to skip the CLS token and simply average all patch outputs (global average pooling, or GAP). The paper finds both work comparably when tuned properly. But CLS is the default because it mirrors the NLP convention and doesn't require special treatment of the spatial tokens.

The CLS token is position 0 in the sequence. It gets its own position embedding. During training, it learns to be a "query" that pulls information from the patch tokens. During inference, it's the only token whose output matters for classification.

During pre-training (on JFT-300M or ImageNet-21k), the classification head is a small MLP with one hidden layer (hidden dim = 3072 for ViT-Base, GELU activation, tanh on the output). During fine-tuning on a downstream task, the pre-trained head is discarded and replaced with a single linear layer: (768,) → (K,) mapping from D dimensions to K classes. This is the only part of the model that changes shape between tasks.

What is frozen vs trained: During pre-training (JFT-300M, 300 epochs): everything is trained from scratch -- patch projection E, position embeddings, all transformer layers, CLS token, and MLP head. During fine-tuning (ImageNet, 20-30 epochs at higher resolution): the entire backbone is fine-tuned (not frozen) with a lower learning rate (0.001 vs 0.003), but the pre-training MLP head is replaced. The paper found that fine-tuning all layers works better than freezing the backbone -- ViT doesn't need PEFT tricks because the fine-tuning dataset (ImageNet, 1.3M images) is large enough to avoid catastrophic forgetting.

When fine-tuning at higher resolution (e.g., 512×512 instead of 224×224), the number of patches increases but the CLS token stays the same. It's just one more token in the sequence. The transformer handles variable-length sequences natively -- the only adjustment needed is interpolating the position embeddings, not restructuring the model.

What is the purpose of the CLS token in ViT?

It carries positional information about the image center It detects edges in the image It's a learnable token that aggregates information from all patches through attention, producing a single image-level representation for classification

Chapter 5: Transformer Encoder

From here, ViT uses a completely standard transformer encoder. No modifications. No vision-specific layers. The same architecture that processes sentences now processes image patch sequences.

Each layer has two sub-blocks with pre-norm residual connections:

z'_ℓ = MSA(LN(z_ℓ-1)) + z_ℓ-1
z_ℓ = MLP(LN(z'_ℓ)) + z'_ℓ

Multi-head self-attention (MSA) lets every patch attend to every other patch. The CLS token attends to all patches. Patch 42 in the bottom-right can attend to patch 3 in the top-left -- global context in one layer. This is what CNNs need dozens of layers to achieve.

The MLP block has two linear layers with GELU activation and a hidden dimension 4× the embedding dimension (e.g., 3072 for ViT-Base with D=768). The MLP processes each token independently -- it's the attention that mixes information across patches.

Note the pre-norm design: LayerNorm is applied before each sub-block, not after. This differs from the original Vaswani et al. transformer which uses post-norm. Pre-norm makes training more stable, especially at large scale. It's become the default in modern transformers.

One important detail: the self-attention layers are the only place information flows between patches. In a CNN, local information flows implicitly through overlapping receptive fields that gradually expand layer by layer. In ViT, every patch can talk to every other patch in every layer. This global connectivity is both ViT's superpower (long-range reasoning in one layer) and its weakness (no built-in locality bias, so it needs massive data to learn what's local and what's distant).

The deliberate simplicity: The paper makes a point of changing as little as possible from the NLP transformer. No multi-scale features, no windowed attention, no convolutional stems. The goal is to show that the standard transformer architecture is sufficient for vision -- all you need is data and scale.

Training and inference numbers:
• Pre-training ViT-B/16 on JFT-300M: batch size 4096, Adam optimizer (lr=0.001, warmup 10K steps, cosine decay), trained for 300 epochs. Total compute: ~2,500 TPUv3-core-days.
• Fine-tuning on ImageNet: batch size 512, SGD with momentum 0.9, lr=0.003 with cosine decay, 20K steps (~20 epochs). Resolution increased to 384×384 (576 patches). Takes ~1 day on a TPUv3 pod.
• Inference latency (ViT-B/16): ~4.4ms per image on a V100 GPU at 224×224. Throughput: ~230 images/sec.
• Comparison: ResNet-152 at similar accuracy takes 11.5ms/image. ViT is 2.6× faster at the same accuracy tier -- the quadratic attention cost is offset by highly parallelizable matrix multiplies.

Model	Layers	Hidden D	MLP size	Heads	Params
ViT-Base	12	768	3072	12	86M
ViT-Large	24	1024	4096	16	307M
ViT-Huge	32	1280	5120	16	632M

What is architecturally different between ViT's transformer and the original NLP transformer?

ViT uses convolutional attention instead of standard attention ViT uses a decoder instead of an encoder Essentially nothing -- ViT deliberately uses a standard transformer encoder with no vision-specific modifications

Chapter 6: Data Hunger

Here is the twist that makes ViT interesting beyond just "transformers work on images." When trained on ImageNet alone (1.3M images), ViT-Large underperforms ViT-Base. The bigger model is worse. Why?

CNNs have strong inductive biases baked into their architecture: locality (each filter sees only a small patch), translation equivariance (the same filter detects a cat ear whether it's top-left or bottom-right), and hierarchical structure (low-level features compose into high-level ones). These biases act as a form of implicit regularization -- they tell the model how images work before it sees any data.

ViT has almost none of this. The only image-specific bias is the initial patch extraction. Everything else -- spatial relationships, locality, hierarchy -- must be learned from data. With limited data, the model overfits. It memorizes rather than generalizes.

The scaling law of ViT: On ImageNet (1.3M images), ResNet beats ViT. On ImageNet-21k (14M images), they're roughly equal. On JFT-300M (300M images), ViT dominates -- and at a fraction of the training cost. Large-scale training trumps inductive bias.

The paper also shows a remarkable efficiency result. ViT-H/14 pre-trained on JFT-300M reaches 88.55% ImageNet accuracy using 2,500 TPUv3-core-days. The comparable CNN (BiT-L, ResNet-152x4) uses 9,900 TPUv3-core-days for 87.54%. ViT is 4× more compute-efficient at the frontier -- it just needs the data to get there.

What degrades with insufficient data: On ImageNet-1K alone, ViT-Large (307M params) gets worse than ViT-Base (86M params): 76.5% vs 77.9%. The larger model memorizes -- training accuracy reaches 99%+ while validation stalls. This is classic overfitting from too many parameters relative to data. CNNs don't have this problem because their inductive bias (locality, weight sharing) acts as implicit regularization. On ImageNet-21K (14M images), the gap closes. On JFT-300M, ViT-Large dominates: 87.76% vs 85.49% for ViT-Base. The lesson: ViT's lack of inductive bias means it needs ~100× more data than a CNN to start outperforming.

The scaling study reveals a clean pattern: at the same compute budget, ViT consistently outperforms ResNets on transfer learning. The gap widens as compute increases. And unlike ResNets, ViT shows no sign of saturation -- performance keeps improving with more compute, suggesting even larger models would do even better. This observation anticipated the scaling law era.

For practitioners, the takeaway is clear. If you have a small dataset (< 1M images), use a CNN or a ViT pre-trained on a large dataset and fine-tuned. If you have hundreds of millions of images and the compute to match, ViT will give you better results per FLOP.

Data Size vs Accuracy

Drag the dataset size. On small data, CNN wins. On large data, ViT pulls ahead.

Dataset size (M images)1

Why does ViT-Large underperform ViT-Base when trained only on ImageNet?

Without CNN-like inductive biases, the larger model overfits on limited data -- it has more parameters but less implicit regularization, so it memorizes rather than generalizes ViT-Large has a bug in its implementation Larger models always perform worse

Chapter 7: Patch Attention Explorer

This is the full ViT pipeline in action. Below, you'll see an image split into 16×16 patches. Each patch is projected into an embedding, given a position, and passed through a single self-attention layer. Click any patch to see which other patches attend to it most strongly.

The attention patterns reveal what the model has learned. Early layers tend to show local attention -- nearby patches attend to each other, mimicking the locality of convolutions. Deeper layers show global patterns -- semantically related patches attend to each other regardless of distance. The CLS token (position 0) gradually learns to attend to the most informative patches.

What to look for: Click patches in different parts of the "image." Notice how attention is strongest for patches that share similar content (same color region). The model learns that spatially and semantically similar patches should communicate, without any architectural bias telling it to do so.

Interactive ViT Pipeline

Click any patch to see its attention pattern. Brighter = stronger attention. Toggle layers to see how attention evolves.

In the real ViT, this attention operation happens 12 times (for ViT-Base) with 12 heads per layer. Each head learns different attention patterns -- one head might focus on texture similarity, another on spatial proximity, a third on color. The heads' outputs are concatenated and projected, then the MLP processes each token independently before the next layer.

The paper analyzed real attention patterns and found a striking progression. In early layers, many attention heads show local behavior -- they attend primarily to nearby patches, effectively mimicking the local receptive fields of a CNN. In deeper layers, attention becomes global -- heads attend to semantically related patches regardless of distance. Some heads even specialize: the paper found heads that attend primarily along rows or columns, heads that attend to patches of similar color, and heads that attend to edges.

The attention distance metric: The authors measured the average distance (in patch-space) of each attention head's focus. In layer 1, most heads attend to patches within 5 positions. By layer 12, heads routinely attend across the entire 14×14 grid. This confirms that ViT learns the local-to-global hierarchy that CNNs have hard-coded -- it just needs data to discover it.

This discovery has a profound implication. The hierarchical processing of visual information -- local edges composing into textures, textures into parts, parts into objects -- is not an artifact of the CNN architecture. It's a property of visual data itself. Give a sufficiently powerful model enough data, and it will rediscover this hierarchy on its own.

Chapter 8: Connections

ViT is a milestone paper, but it's also a starting point. Its deliberate simplicity -- standard transformer, no vision-specific tricks -- proved that the transformer architecture is universal. Everything that followed built on this foundation.

Method	Key Idea	Relation to ViT
DeiT (2021)	Knowledge distillation from CNNs	Makes ViT work with less data
Swin Transformer (2021)	Shifted window attention	Adds back locality for efficiency
MAE (2022)	Masked autoencoder pre-training	Self-supervised ViT pre-training
CLIP (2021)	Vision-language contrastive learning	Uses ViT as the image encoder
DINO (2021)	Self-distillation with no labels	ViT learns object segmentation without supervision

The deepest lesson of ViT is about the relationship between inductive bias and data. CNNs encode strong assumptions about images -- locality, translation equivariance, hierarchy. These assumptions help when data is scarce, but limit the model when data is abundant. ViT makes almost no assumptions, leaving the model free to discover whatever structure exists in the data. This requires more data but ultimately leads to better representations.

The paper also briefly explored self-supervised pre-training, predicting masked patches (like BERT's masked language modeling). The self-supervised ViT-B/16 reached 79.9% on ImageNet -- promising but below the supervised version. This planted the seed for MAE (He et al., 2022), which fully realized this idea and achieved state-of-the-art results with masked image modeling.

Perhaps most remarkably, DINO (Caron et al., 2021) showed that ViT trained with self-distillation (no labels at all) learns features that naturally segment objects -- the attention maps in the last layer cleanly separate foreground from background. No CNN had ever shown this emergent behavior. Something about the global attention mechanism, combined with sufficient scale, enables ViT to discover object boundaries as a side effect of representation learning.

The ViT Pipeline in Code

python
# The complete ViT forward pass in pseudocode
def vit_forward(image, patch_size=16):
    # 1. Split into patches: (224,224,3) -> (196, 768)
    patches = split_into_patches(image, patch_size)

    # 2. Linear projection: (196, 768) -> (196, D)
    embeddings = patches @ E  # E is (768, D)

    # 3. Prepend CLS token: (196, D) -> (197, D)
    embeddings = concat([cls_token, embeddings])

    # 4. Add position embeddings: (197, D) + (197, D)
    embeddings = embeddings + pos_embeddings

    # 5. Transformer encoder: L layers
    for layer in transformer_layers:
        embeddings = layer(embeddings)

    # 6. Classification: take CLS output -> class probs
    return mlp_head(layer_norm(embeddings[0]))

That's it. The entire ViT architecture fits in 15 lines of pseudocode. There is no pooling, no batch normalization, no convolutional layer, no multi-scale feature pyramid. Just patches, a linear projection, positional embeddings, and a standard transformer. The simplicity is the point.

Key results from the paper: ViT-H/14 on JFT-300M achieves 88.55% ImageNet top-1 accuracy, 94.55% CIFAR-100, and 77.63% on the VTAB suite of 19 tasks -- matching or exceeding every CNN baseline at 4× less compute. Training configuration: 632M parameters, trained on 303M images for ~90 epochs on Google's TPU infrastructure. Total compute estimated at ~2,500 TPUv3-core-days (~$125K at public cloud rates). For comparison, a single A100 GPU fine-tuning run on ImageNet takes ~1 day and costs ~$30.

"Large scale training trumps inductive bias."

— The central finding of ViT

An Image is Worth16×16 Words