Vision Transformers — Engineermaxxing

Introduction

For nearly a decade after AlexNet (Krizhevsky et al., 2012), the architecture of choice for visual recognition was the convolutional neural network. CNNs are built on a powerful inductive bias: locality. A 3×3 kernel can only see 9 pixels at a time, and the assumption is that neighboring pixels carry the most relevant information. This bias works well — until it doesn't.

The locality bottleneck manifests in three ways. First, building long-range dependencies requires stacking many layers — a ResNet-50 needs 50 layers for its deepest features to span the full image. Second, the effective receptive field of a CNN is much smaller than the theoretical one (Luo et al., 2016), meaning most neurons are heavily biased toward their spatial center. Third, the architecture is rigid: convolutions operate on fixed grids, making it awkward to handle varying resolutions or aspect ratios.

Meanwhile, in NLP, the Transformer (Vaswani et al., 2017) solved long-range dependencies trivially: self-attention lets every token attend to every other token in a single layer. The natural question was: can we do the same for images?

The answer came in October 2020 from a team at Google Brain. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, and colleagues published An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. The key insight was brutal in its simplicity: split an image into fixed-size patches, treat each patch as a "word," and apply a standard transformer encoder. No convolutions, no pooling, no feature pyramids. They called it the Vision Transformer (ViT).

The results were striking. When pre-trained on sufficiently large datasets (JFT-300M, with 300 million images), ViT matched or exceeded the state of the art on ImageNet, CIFAR-100, and VTAB, while being cheaper to train than comparable CNNs. The paper demonstrated that the inductive biases of convolutions are not necessary — they can be learned from data, provided you have enough of it.

This article builds the entire ViT from first principles. We derive every dimension, explain every design choice, and trace the lineage from the original ViT through DINOv2 and SigLIP — the vision encoders that power today's vision-language models.

Patch Embedding

A transformer operates on a sequence of vectors. A text transformer receives a sequence of token embeddings. A vision transformer needs the same thing: a sequence of vectors, each representing a chunk of the image. The patch embedding is the mechanism that converts a raw image tensor into this sequence.

Splitting into Patches

Start with an image of spatial dimensions H × W and C color channels (typically C=3 for RGB). Choose a patch size P. Divide the image into a non-overlapping grid of patches, each of size P × P pixels. The number of patches along the height is H/P, and along the width is W/P. The total number of patches is:

N = (H / P) \times (W / P)

For the standard ViT-B/16 configuration: the input image is 224 × 224 pixels, P = 16. So:

N = (224 / 16) \times (224 / 16) = 14 \times 14 = 196 patches

For ViT-L/14: the input image is 224 × 224, P = 14. So:

N = (224 / 14) \times (224 / 14) = 16 \times 16 = 256 patches

Each patch is a small image of shape (P, P, C). Flatten it into a single vector of length P² · C. For ViT-B/16, each patch becomes a vector of 16² × 3 = 768 values. For ViT-L/14, each patch becomes 14² × 3 = 588 values.

Linear Projection

The flattened patch vector has dimension P² · C, but the transformer expects vectors of dimension D (the model's hidden dimension). A learned linear projection maps each flattened patch into the transformer's embedding space:

x patch \in R P²\cdotC ⟶ E \cdot x patch + b = z \in R D where E \in R D \times P²\cdotC, b \in R D

For ViT-B/16: D = 768, so the projection matrix E has shape (768, 768) — a coincidence arising from 16² × 3 = 768. For ViT-L/14: D = 1024, so E has shape (1024, 588). The full output of patch embedding is a matrix Z ∈ R^{N × D}.

Conv2d Equivalence

Here is a fact that surprises many practitioners: the patch embedding is exactly equivalent to a single convolutional layer with kernel size P and stride P. A Conv2d(in_channels=3, out_channels=D, kernel_size=P, stride=P) does the following: it slides a P × P kernel across the image with stride P (no overlap), and at each position it computes a dot product between the kernel weights and the patch pixels, producing D output values. The output spatial dimensions are H/P × W/P = 14 × 14 for ViT-B/16. Reshaping the output from (D, 14, 14) to (196, D) gives exactly the same result as flattening patches and multiplying by E.

ℹ Implementation Detail

In practice, every major ViT implementation uses Conv2d for the patch embedding rather than manually reshaping and projecting. It is numerically identical but computationally more efficient because the convolution kernel is optimized in CUDA. The original ViT paper describes the linear projection, but the code uses Conv2d.

💡 Key Insight

The patch embedding is the only operation in ViT that is spatially local. Everything after it — all attention layers — operates globally across all patches. This means ViT has minimal spatial inductive bias. The model must learn spatial relationships from data rather than having them hardcoded by architecture.

The [CLS] Token

Why It Exists

After patch embedding, we have N patch tokens, each representing a spatial region of the image. For classification, we need a single vector that summarizes the entire image. BERT (Devlin et al., 2019) solved an analogous problem in NLP by prepending a special [CLS] token to the input sequence. ViT adopts the same strategy.

A learnable vector x_cls ∈ R^D is prepended to the sequence of patch embeddings, making the total sequence length N + 1. During training, this token has no spatial bias — it doesn't correspond to any particular image region. Through the self-attention layers, it aggregates information from all patches. At the final layer, the [CLS] token's output is used as the image representation for classification.

z 0 = [x cls; z 1; z 2; ... ; z N] + E pos Sequence length: N + 1 = 197 for ViT-B/16, 257 for ViT-L/14

The [CLS] token is randomly initialized and learned during training. Its initial value has no semantic content — all its information comes from attending to the patch tokens across multiple transformer layers.

Mean Pooling Alternative

An alternative to [CLS] is global average pooling (GAP): simply average all N patch token outputs at the final layer. This was standard in CNNs (e.g., after ResNet's last convolutional block). Some ViT variants found that GAP performs comparably to [CLS] and sometimes better, particularly when the model is trained with sufficient data.

DINOv2 (Oquab et al., 2024) uses both: the [CLS] token output and the mean of all patch tokens are concatenated or used separately depending on the downstream task. The [CLS] token captures a holistic summary, while mean-pooled patch tokens preserve more fine-grained spatial information. For dense prediction tasks (segmentation, depth estimation), the individual patch tokens are used directly, since each one retains correspondence to a spatial region.

Positional Embeddings

Self-attention is permutation-equivariant: if you shuffle the input tokens, the output tokens are shuffled in exactly the same way. The attention operation itself has no notion of order or position. This means that without positional information, a ViT cannot distinguish between a patch in the top-left corner and one in the bottom-right — it treats the sequence as a bag of patches.

To inject spatial information, positional embeddings are added to the patch embeddings before they enter the transformer. Each position i in the sequence gets an embedding e_pos(i) ∈ R^D, which is added element-wise to the token at that position.

Learned 1D Positional Embeddings

The original ViT uses the simplest approach: a learnable lookup table. Create a parameter matrix E_pos ∈ R^{(N+1) × D} (one row for each position, including the [CLS] token). These embeddings are randomly initialized and trained with the rest of the model via backpropagation.

Dosovitskiy et al. (2020) found that learned 1D positional embeddings work just as well as more sophisticated 2D-aware alternatives. The model learns to encode 2D spatial structure in the 1D position embeddings: visualizing the cosine similarity between position embeddings reveals a clear 2D grid pattern, showing the model discovers the row/column structure of patches on its own.

2D Sinusoidal Positional Embeddings

An alternative is to use fixed sinusoidal functions, extended to 2D. For a patch at grid position (r, c), encode the row and column separately using sine and cosine functions at different frequencies, then concatenate:

e pos (r, c) = [sin(r/10000 0/d), cos(r/10000 0/d), ..., sin(c/10000 0/d), cos(c/10000 0/d), ...]

This provides the model with explicit 2D spatial information without any learned parameters. The ViT paper found no significant advantage over learned 1D embeddings, but 2D sinusoidal encodings are used in some later works (e.g., MAE by He et al., 2022).

Resolution Interpolation

A practical problem: if you train ViT at 224 × 224 (196 patches with P=16) but want to fine-tune at 384 × 384 (576 patches), the positional embedding table has the wrong number of entries. The solution is bicubic interpolation of the positional embeddings.

Reshape the 1D position embeddings back to a 2D grid (14 × 14 for the original resolution), apply 2D bicubic interpolation to upsample to the new grid size (24 × 24 for 384 × 384 resolution), then flatten back to 1D. The [CLS] token's positional embedding is kept unchanged. This works surprisingly well in practice, though there is typically a brief dip in performance at the start of fine-tuning before the model adapts to the interpolated positions.

ℹ Why Interpolation Works

Positional embeddings learned at 14 × 14 resolution encode smooth spatial relationships. Bicubic interpolation preserves this smoothness. The semantic content of "position (5, 5) relative to position (5, 6)" doesn't fundamentally change when the grid is finer — the spatial relationship is the same, just sampled at higher resolution.

Self-Attention on Patches

Self-attention is the core operation that gives ViT its power. Every patch can directly attend to every other patch in a single layer — no stacking required. Let us derive the full mechanism from scratch.

Q, K, V Derivation

Let X ∈ R^{(N+1) × D} be the input sequence (N patch tokens plus the [CLS] token, each of dimension D). Three learned weight matrices project X into queries, keys, and values:

Q = X W Q, K = X W K, V = X W V where W Q, W K, W V \in R D \times D

Q, K, and V each have shape (N+1) × D. The query at position i asks "what am I looking for?". The key at position j announces "here is what I contain." The value at position j holds "here is the information I'll contribute if attended to." The attention weight between positions i and j is the dot product of query i with key j:

Attention(Q, K, V) = softmax(Q K T / \sqrtd k) V

The matrix Q K^T has shape (N+1) × (N+1), where entry (i, j) is the raw attention score between tokens i and j. The division by √d_k is critical: without it, the dot products grow in magnitude with the dimension d_k, pushing the softmax into regions of extremely small gradients. Dividing by √d_k keeps the variance of the dot products approximately 1, regardless of d_k.

The softmax normalizes each row to sum to 1, producing a probability distribution over all positions for each query. Multiplying by V computes a weighted average of the value vectors, where the weights are the attention probabilities.

Multi-Head Attention

A single attention head can only capture one type of relationship at a time. Multi-head attention runs h attention heads in parallel, each with its own Q, K, V projections, operating on a subspace of dimension d_k = D / h.

head i = Attention(X W Q i, X W K i, X W V i) where W Q i, W K i, W V i \in R D \times d k MultiHead(X) = Concat(head 1, ..., head h) W O where W O \in R D \times D

For ViT-B/16: D = 768, h = 12 heads, d_k = 768 / 12 = 64. Each head operates on 64-dimensional queries, keys, and values. The outputs of all 12 heads are concatenated (back to dimension 768) and projected through W_O.

Different heads learn to attend to different things. In ViT, some heads learn local attention patterns (nearby patches), some learn global patterns (distant patches), and some specialize in attending to the [CLS] token. This diversity of attention patterns is analogous to having multiple convolutional kernels at different receptive fields, but learned rather than hardcoded.

Computational Cost

The dominant cost in self-attention is computing Q K^T, which requires multiplying two matrices of shape (N+1, d_k), giving O(N² d_k) operations per head, or O(N² D) across all heads (since h × d_k = D). The memory cost for storing the attention matrix is O(N²) per head.

Attention cost: O(N 2 \cdot D) where N = number of patches

For ViT-B/16 (N = 196): the attention matrix is 197 × 197 ≈ 39K entries per head, times 12 heads = ~466K entries. This is manageable. For ViT-L/14 (N = 256): 257 × 257 ≈ 66K per head, times 16 heads = ~1.06M entries. Still tractable.

But the quadratic scaling with N is why ViTs struggle with high-resolution images. At resolution 1024 × 1024 with P=16, N = 4096 patches, and the attention matrix has ~16.8M entries per head. This is the fundamental bottleneck that motivates efficient attention variants (Swin Transformer, etc.), though the standard ViT we describe here uses full attention.

For comparison, a convolutional layer with kernel size K on a feature map of size H × W with C channels costs O(K² · C · H · W). The key difference: the CNN cost is linear in spatial resolution (H × W) but the receptive field is limited to K. The ViT cost is quadratic in spatial resolution but the receptive field is global.

💡 The Fundamental Trade-off

CNNs: O(K² · C · HW) — linear in spatial size, local receptive field.
ViTs: O(N² · D) — quadratic in spatial size, global receptive field.
This is not a flaw of either architecture but a trade-off. ViTs pay a quadratic cost to get global attention in every layer. CNNs pay a linear cost but need many layers for global context.

Self-Attention on Patches Interactive

Head 1/12 — Click a patch to see its attention

ViT Architecture

With patch embedding, positional encoding, and self-attention defined, we can now assemble the complete Vision Transformer. The architecture follows the standard transformer encoder from Vaswani et al. (2017) with minor modifications.

The Transformer Block

Each transformer block applies two sub-layers with residual connections and pre-normalization (LayerNorm before the operation, not after — this is the "Pre-LN" variant used by ViT, which is more stable during training than Post-LN):

y = x + MHSA(LayerNorm(x)) z = y + MLP(LayerNorm(y))

The residual connections are critical. Without them, gradients must flow through the attention and MLP operations at every layer, making deep networks hard to train. With residuals, gradients have a direct path through the network (the "skip connection highway"), and each block only needs to learn a correction to the identity mapping.

LayerNorm normalizes each token independently: for a vector x ∈ R^D, compute the mean and variance across its D dimensions, then normalize and apply learnable scale and shift:

LayerNorm(x) = γ ⊙ (x - μ) / \sqrt(σ 2 + ε) + β where μ = mean(x), σ 2 = var(x), γ and β \in R D are learned

The MLP Block

The MLP (sometimes called the Feed-Forward Network, FFN) consists of two linear layers with a GELU nonlinearity in between. The first layer expands the dimension by a factor of 4 (the "expansion ratio"), and the second projects back:

MLP(x) = W 2 \cdot GELU(W 1 \cdot x + b 1) + b 2 W 1 \in R 4D \times D, W 2 \in R D \times 4D

For ViT-B: the MLP maps 768 → 3072 → 768. For ViT-L: 1024 → 4096 → 1024. The expansion to 4× the dimension gives the model a wider space in which to compute nonlinear transformations before projecting back to the residual stream.

GELU (Gaussian Error Linear Unit, Hendrycks & Gimpel, 2016) is defined as:

GELU(x) = x \cdot Φ(x) \approx x \cdot σ(1.702x)

where Φ is the CDF of the standard normal distribution. Unlike ReLU, GELU is smooth everywhere and allows small negative gradients, which empirically helps training stability.

Full Stack

The complete ViT forward pass:

Patch Embed: Image ∈ R^H×W×3 → Z ∈ R^N×D via Conv2d(3, D, P, P)
Prepend [CLS]: Z → [x_cls ; Z] ∈ R^(N+1)×D
Add Position: Z = Z + E_pos
Transformer Blocks: Repeat L times: LayerNorm → MHSA → Residual → LayerNorm → MLP → Residual
Final LayerNorm: Apply LayerNorm to the output
Extract [CLS]: Take the first token as the image representation
Classification Head: Linear layer maps D → num_classes

Component	ViT-B/16	ViT-L/14
Patch size P	16	14
Number of patches N	196 (14×14)	256 (16×16)
Hidden dim D	768	1024
Transformer layers L	12	24
Attention heads h	12	16
Head dim d_k	64	64
MLP dim	3072	4096
Parameters	86M	304M
Sequence length (with [CLS])	197	257

ViT Architecture — Data Flow Interactive

Stage 0/6 — Input image

DINOv2

The original ViT was trained with supervised classification on labeled datasets. But the most powerful vision encoders for VLMs don't use labels at all. DINOv2 (Oquab et al., 2024) is a self-supervised Vision Transformer that produces features with remarkable quality — without seeing a single text label during training.

Self-Distillation with No Labels

DINO (Caron et al., 2021) introduced self-distillation with no labels (the acronym). The idea: maintain two copies of the network — a student and a teacher. Both see the same image but with different augmentations (crops, color jitter, etc.). The student is trained to match the teacher's output distribution. The teacher is not trained with gradients — instead, its weights are an exponential moving average (EMA) of the student's weights:

θ teacher \leftarrow λ \cdot θ teacher + (1 - λ) \cdot θ student where λ follows a cosine schedule from 0.996 to 1.0

The training objective is cross-entropy between the teacher's and student's output distributions. Crucially, the student sees local crops (small regions, 96×96) while the teacher sees global crops (large regions, 224×224). This forces the student to infer global context from local patches — a powerful learning signal.

A centering and sharpening mechanism prevents mode collapse (the trivial solution where both networks output the same constant). The teacher's output is centered by subtracting a running mean, and sharpened by using a low temperature in the softmax.

DINOv2 extends DINO with several improvements:

Data: The LVD-142M dataset, a curated collection of 142 million images retrieved and deduplicated from a pool of 1.2 billion web images. No manual labeling.
Architecture: ViT-g/14 (the "giant" variant) with 1.1 billion parameters, D = 1536, L = 40 layers, 24 attention heads.
Combined objective: Both the DINO self-distillation loss and an iBOT masked image modeling loss (predicting masked patch tokens) are used simultaneously.
Distillation cascade: After training the ViT-g, smaller models (ViT-S, ViT-B, ViT-L) are trained via distillation from the ViT-g teacher, transferring its representation quality to efficient models.

Feature Quality

DINOv2 features exhibit remarkable spatial correspondence: the patch tokens from two different images of the same object category have similar features at corresponding spatial locations. For example, the patch token at the "left eye" position of one cat image is similar to the "left eye" token of another cat image — even without any supervision telling the model what an eye is.

This emerges because the self-distillation objective forces the model to build representations that are invariant to viewpoint and appearance changes (captured by the different augmentations) while preserving spatial structure (because the student must reconstruct global context from local crops). The result is a feature space where semantic similarity aligns with spatial correspondence.

DINOv2 features achieve state-of-the-art performance on dense prediction tasks (monocular depth, semantic segmentation) with simple linear probes — a single learned linear layer on top of the frozen features. This indicates the features already encode rich spatial and semantic information without any task-specific training.

SigLIP

While DINOv2 trains with vision-only self-supervision, another family of vision encoders learns from image-text pairs. CLIP (Radford et al., 2021) pioneered contrastive language-image pretraining: given a batch of (image, text) pairs, learn an image encoder and a text encoder such that matching pairs have high similarity and non-matching pairs have low similarity. SigLIP (Zhai et al., 2023) improves CLIP's training loss to enable better scaling.

Sigmoid vs. Softmax Contrastive Loss

CLIP uses a softmax-based contrastive loss (InfoNCE). For a batch of B image-text pairs, compute the B × B similarity matrix between all image and text embeddings. The loss treats each row (and column) as a classification problem: the correct pair should have the highest similarity.

L CLIP = -½ Σ i [log(softmax(sim(I i, T) / τ) i) + log(softmax(sim(I, T i) / τ) i)]

The softmax couples all B samples in the denominator, requiring them to be computed on the same device. This limits effective batch sizes and requires complex multi-device synchronization.

SigLIP replaces softmax with sigmoid. Each (image, text) pair is treated independently as a binary classification: is this pair a match or not?

L SigLIP = -Σ i,j [y ij log σ(z ij) + (1 - y ij) log(1 - σ(z ij))] where z ij = (sim(I i, T j) - b) / τ, y ij = 1 if i=j else 0

Because each pair is evaluated independently (no softmax denominator coupling the batch), the loss decomposes naturally across devices. This allows much larger effective batch sizes — SigLIP was trained with batch sizes up to 32,768 image-text pairs. Larger batches provide more negative examples per step, improving representation quality.

SigLIP in Practice

SigLIP-SO400M ("Shape-Optimized, 400M parameters") uses a ViT-SO400M architecture: a ViT with modified hidden dimensions and head counts optimized via architecture search. It is trained on the WebLI dataset with roughly 4 billion image-text pairs. The resulting vision encoder achieves ImageNet zero-shot accuracy competitive with much larger CLIP models.

SigLIP is the vision encoder used in PaLI-3 (Chen et al., 2023) and OpenVLA (Kim et al., 2024). In OpenVLA, the SigLIP vision encoder processes the robot's camera image: the image passes through the ViT, the patch tokens (or [CLS] token) are projected into the language model's embedding space, and the language model conditions on these visual tokens to generate robot actions. The quality of the vision encoder directly determines how well the VLA understands the visual scene.

ℹ SigLIP vs. CLIP in VLMs

SigLIP is preferred over CLIP in many recent VLMs for three reasons: (1) the sigmoid loss produces slightly better features on average, (2) training is simpler to scale across many devices, and (3) the SO400M architecture is more parameter-efficient than CLIP's ViT-L. LLaVA-1.5 and PaLI-3 both adopted SigLIP as their default vision encoder.

Scaling ViTs

Model Variants

The ViT family follows a naming convention: ViT-{size}/{patch_size}. The size controls the hidden dimension, number of layers, and number of heads. The patch size controls the sequence length (and thus the resolution of the visual representation).

Model	Layers (L)	Hidden Dim (D)	Heads (h)	MLP Dim	Params	Patches (224px)
`ViT-S/16`	12	384	6	1536	22M	196
`ViT-B/16`	12	768	12	3072	86M	196
`ViT-L/14`	24	1024	16	4096	304M	256
`ViT-H/14`	32	1280	16	5120	632M	256
`ViT-g/14`	40	1536	24	6144	1.1B	256

Smaller patch sizes (e.g., /14 vs. /16) create more patches, giving finer spatial resolution at the cost of longer sequences and quadratically higher attention cost. The trend in modern VLMs is to use /14 patch size for the best trade-off between resolution and cost.

Data Requirements

A critical finding of the original ViT paper: ViTs need more training data than CNNs. When trained on ImageNet-1K alone (1.28M images), ViT-B underperforms a comparable ResNet. On ImageNet-21K (14M images), they are roughly equal. On JFT-300M (300M images), ViT-B surpasses the best ResNets.

The reason is the inductive bias gap. Convolutions hardcode locality and translation equivariance, giving CNNs a strong prior that compensates for limited data. ViTs lack these biases and must learn spatial structure from examples. With insufficient data, the model overfits or fails to discover useful spatial patterns. With sufficient data, the lack of inductive bias becomes an advantage: the model can learn representations unconstrained by the architecture's assumptions.

Subsequent work has partially addressed this data hunger:

DeiT (Touvron et al., 2021): Showed ViTs can match CNNs on ImageNet-1K alone with strong data augmentation (RandAugment, CutMix, Mixup) and regularization (stochastic depth, repeated augmentation).
MAE (He et al., 2022): Masked autoencoder pre-training — mask 75% of patches and train the model to reconstruct them. This self-supervised approach produces excellent features from ImageNet alone.
LAION-5B (Schuhmann et al., 2022): An open-source dataset of 5.8 billion image-text pairs, enabling CLIP/SigLIP-style training at massive scale.

The scaling law for ViTs approximately follows: performance improves log-linearly with both model size and dataset size. Doubling the model parameters requires roughly 4× the data to fully utilize the increased capacity. This is analogous to the Chinchilla scaling laws (Hoffmann et al., 2022) observed in LLMs, though the exact constants differ for vision.

💡 Why This Matters for VLMs

The vision encoder in a VLM determines the quality of the visual tokens that the language model sees. A ViT-L/14 pretrained on billions of image-text pairs (SigLIP) or 142M curated images (DINOv2) produces tokens that encode rich semantic and spatial information. The language model then reasons over these tokens exactly as it reasons over word tokens. Better vision encoder means better grounding, better spatial reasoning, and better visual question answering.

Encoder Comparison — ViT-S vs. ViT-B vs. ViT-L Interactive

Showing: Parameters

Code Examples

Patch Embedding from Scratch

Python

import torch
import torch.nn as nn

class PatchEmbedding(nn.Module):
    """Convert image to sequence of patch embeddings.

    Equivalent to: flatten P×P patches, then linear project.
    Implemented as Conv2d for efficiency.
    """
    def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768):
        super().__init__()
        self.num_patches = (img_size // patch_size) ** 2  # 196 for ViT-B/16
        self.proj = nn.Conv2d(
            in_channels, embed_dim,
            kernel_size=patch_size, stride=patch_size  # no overlap
        )

    def forward(self, x):
        # x: (B, 3, 224, 224)
        x = self.proj(x)          # (B, 768, 14, 14)
        x = x.flatten(2)          # (B, 768, 196)
        x = x.transpose(1, 2)    # (B, 196, 768)
        return x

# Verify shapes
embed = PatchEmbedding()
img = torch.randn(1, 3, 224, 224)
patches = embed(img)
print(f"Patches shape: {patches.shape}")  # (1, 196, 768)

ViT Transformer Block

Python

class ViTBlock(nn.Module):
    """Single ViT transformer block with Pre-LN."""
    def __init__(self, dim=768, num_heads=12, mlp_ratio=4.0, drop=0.0):
        super().__init__()
        self.norm1 = nn.LayerNorm(dim)
        self.attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
        self.norm2 = nn.LayerNorm(dim)
        self.mlp = nn.Sequential(
            nn.Linear(dim, int(dim * mlp_ratio)),   # 768 -> 3072
            nn.GELU(),
            nn.Dropout(drop),
            nn.Linear(int(dim * mlp_ratio), dim),   # 3072 -> 768
            nn.Dropout(drop),
        )

    def forward(self, x):
        # Pre-LN: normalize before attention
        h = self.norm1(x)
        h, _ = self.attn(h, h, h)     # self-attention: Q=K=V=h
        x = x + h                      # residual connection

        # Pre-LN: normalize before MLP
        h = self.norm2(x)
        h = self.mlp(h)
        x = x + h                      # residual connection
        return x

block = ViTBlock()
tokens = torch.randn(1, 197, 768)      # 196 patches + 1 [CLS]
out = block(tokens)
print(f"Output shape: {out.shape}")     # (1, 197, 768)

Full ViT Forward Pass

Python

class ViT(nn.Module):
    """Vision Transformer (ViT-B/16 by default)."""
    def __init__(self, img_size=224, patch_size=16, in_channels=3,
                 num_classes=1000, embed_dim=768, depth=12,
                 num_heads=12, mlp_ratio=4.0):
        super().__init__()
        self.patch_embed = PatchEmbedding(img_size, patch_size, in_channels, embed_dim)
        num_patches = self.patch_embed.num_patches

        # Learnable [CLS] token
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        nn.init.trunc_normal_(self.cls_token, std=0.02)

        # Learnable positional embeddings (1D)
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
        nn.init.trunc_normal_(self.pos_embed, std=0.02)

        # Stack of transformer blocks
        self.blocks = nn.Sequential(*[
            ViTBlock(embed_dim, num_heads, mlp_ratio) for _ in range(depth)
        ])

        self.norm = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, num_classes)

    def forward(self, x):
        B = x.shape[0]

        # 1. Patch embedding
        x = self.patch_embed(x)                           # (B, 196, 768)

        # 2. Prepend [CLS] token
        cls = self.cls_token.expand(B, -1, -1)            # (B, 1, 768)
        x = torch.cat([cls, x], dim=1)                    # (B, 197, 768)

        # 3. Add positional embeddings
        x = x + self.pos_embed                            # (B, 197, 768)

        # 4. Transformer blocks
        x = self.blocks(x)                                # (B, 197, 768)

        # 5. Final LayerNorm
        x = self.norm(x)

        # 6. Extract [CLS] token output
        cls_out = x[:, 0]                                 # (B, 768)

        # 7. Classification head
        logits = self.head(cls_out)                       # (B, 1000)
        return logits

model = ViT()
img = torch.randn(2, 3, 224, 224)
logits = model(img)
print(f"Logits shape: {logits.shape}")                    # (2, 1000)
print(f"Total params: {sum(p.numel() for p in model.parameters()):,}")  # ~86M

Feature Extraction with HuggingFace

Python

from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch

# === DINOv2 Feature Extraction ===
dino_processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
dino_model = AutoModel.from_pretrained("facebook/dinov2-base")

image = Image.open("example.jpg")
inputs = dino_processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = dino_model(**inputs)
    cls_token = outputs.last_hidden_state[:, 0]           # (1, 768) — global
    patch_tokens = outputs.last_hidden_state[:, 1:]       # (1, 256, 768) — spatial
    print(f"DINOv2 CLS: {cls_token.shape}")
    print(f"DINOv2 patches: {patch_tokens.shape}")

# === SigLIP Feature Extraction ===
from transformers import AutoModel, AutoProcessor

siglip_processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")
siglip_model = AutoModel.from_pretrained("google/siglip-base-patch16-224")

inputs = siglip_processor(images=image, text=["a photo of a cat"], return_tensors="pt")

with torch.no_grad():
    outputs = siglip_model(**inputs)
    image_embeds = outputs.image_embeds                   # (1, 768) — normalized
    text_embeds = outputs.text_embeds                     # (1, 768) — normalized
    similarity = (image_embeds @ text_embeds.T)           # cosine similarity
    print(f"SigLIP image-text similarity: {similarity.item():.4f}")

Patch Embedding Visualization Interactive

Click Split to begin

References

Seminal papers and key works referenced in this article.

Dosovitskiy et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR, 2021. arXiv
Touvron et al. "Training data-efficient image transformers & distillation through attention." ICML, 2021. arXiv
Oquab et al. "DINOv2: Learning Robust Visual Features without Supervision." TMLR, 2024. arXiv
Zhai et al. "Sigmoid Loss for Language Image Pre-Training." ICCV, 2023. arXiv
He et al. "Masked Autoencoders Are Scalable Vision Learners." CVPR, 2022. arXiv