Introduction
For nearly a decade after AlexNet (Krizhevsky et al., 2012), the architecture of choice for visual recognition was the convolutional neural network. CNNs are built on a powerful inductive bias: locality. A 3×3 kernel can only see 9 pixels at a time, and the assumption is that neighboring pixels carry the most relevant information. This bias works well — until it doesn't.
The locality bottleneck manifests in three ways. First, building long-range dependencies requires stacking many layers — a ResNet-50 needs 50 layers for its deepest features to span the full image. Second, the effective receptive field of a CNN is much smaller than the theoretical one (Luo et al., 2016), meaning most neurons are heavily biased toward their spatial center. Third, the architecture is rigid: convolutions operate on fixed grids, making it awkward to handle varying resolutions or aspect ratios.
Meanwhile, in NLP, the Transformer (Vaswani et al., 2017) solved long-range dependencies trivially: self-attention lets every token attend to every other token in a single layer. The natural question was: can we do the same for images?
The answer came in October 2020 from a team at Google Brain. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, and colleagues published An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. The key insight was brutal in its simplicity: split an image into fixed-size patches, treat each patch as a "word," and apply a standard transformer encoder. No convolutions, no pooling, no feature pyramids. They called it the Vision Transformer (ViT).
The results were striking. When pre-trained on sufficiently large datasets (JFT-300M, with 300 million images), ViT matched or exceeded the state of the art on ImageNet, CIFAR-100, and VTAB, while being cheaper to train than comparable CNNs. The paper demonstrated that the inductive biases of convolutions are not necessary — they can be learned from data, provided you have enough of it.
This article builds the entire ViT from first principles. We derive every dimension, explain every design choice, and trace the lineage from the original ViT through DINOv2 and SigLIP — the vision encoders that power today's vision-language models.
Patch Embedding
A transformer operates on a sequence of vectors. A text transformer receives a sequence of token embeddings. A vision transformer needs the same thing: a sequence of vectors, each representing a chunk of the image. The patch embedding is the mechanism that converts a raw image tensor into this sequence.
Splitting into Patches
Start with an image of spatial dimensions H × W and C color channels (typically C=3 for RGB). Choose a patch size P. Divide the image into a non-overlapping grid of patches, each of size P × P pixels. The number of patches along the height is H/P, and along the width is W/P. The total number of patches is:
For the standard ViT-B/16 configuration: the input image is 224 × 224 pixels, P = 16. So:
For ViT-L/14: the input image is 224 × 224, P = 14. So:
Each patch is a small image of shape (P, P, C). Flatten it into a single vector of length P2 · C. For ViT-B/16, each patch becomes a vector of 16² × 3 = 768 values. For ViT-L/14, each patch becomes 14² × 3 = 588 values.
Linear Projection
The flattened patch vector has dimension P2 · C, but the transformer expects vectors of dimension D (the model's hidden dimension). A learned linear projection maps each flattened patch into the transformer's embedding space:
where E ∈ RD × P²·C, b ∈ RD
For ViT-B/16: D = 768, so the projection matrix E has shape (768, 768) — a coincidence arising from 16² × 3 = 768. For ViT-L/14: D = 1024, so E has shape (1024, 588). The full output of patch embedding is a matrix Z ∈ RN × D.
Conv2d Equivalence
Here is a fact that surprises many practitioners: the patch embedding is exactly
equivalent to a single convolutional layer with kernel size P and stride P. A
Conv2d(in_channels=3, out_channels=D, kernel_size=P, stride=P) does the following:
it slides a P × P kernel across the image with stride P (no overlap), and at each position
it computes a dot product between the kernel weights and the patch pixels, producing D output
values. The output spatial dimensions are H/P × W/P = 14 × 14 for ViT-B/16.
Reshaping the output from (D, 14, 14) to (196, D) gives exactly the same result as flattening
patches and multiplying by E.
Conv2d for the patch embedding
rather than manually reshaping and projecting. It is numerically identical but computationally
more efficient because the convolution kernel is optimized in CUDA. The original ViT paper
describes the linear projection, but the code uses Conv2d.
The [CLS] Token
Why It Exists
After patch embedding, we have N patch tokens, each representing a spatial region of the image.
For classification, we need a single vector that summarizes the entire image. BERT
(Devlin et al., 2019) solved an analogous problem in NLP by prepending a special
[CLS] token to the input sequence. ViT adopts the same strategy.
A learnable vector xcls ∈ RD is prepended to the sequence
of patch embeddings, making the total sequence length N + 1. During training, this token has
no spatial bias — it doesn't correspond to any particular image region. Through the self-attention
layers, it aggregates information from all patches. At the final layer, the [CLS] token's output
is used as the image representation for classification.
Sequence length: N + 1 = 197 for ViT-B/16, 257 for ViT-L/14
The [CLS] token is randomly initialized and learned during training. Its initial value has no semantic content — all its information comes from attending to the patch tokens across multiple transformer layers.
Mean Pooling Alternative
An alternative to [CLS] is global average pooling (GAP): simply average all N patch token outputs at the final layer. This was standard in CNNs (e.g., after ResNet's last convolutional block). Some ViT variants found that GAP performs comparably to [CLS] and sometimes better, particularly when the model is trained with sufficient data.
DINOv2 (Oquab et al., 2024) uses both: the [CLS] token output and the mean of all patch tokens are concatenated or used separately depending on the downstream task. The [CLS] token captures a holistic summary, while mean-pooled patch tokens preserve more fine-grained spatial information. For dense prediction tasks (segmentation, depth estimation), the individual patch tokens are used directly, since each one retains correspondence to a spatial region.
Positional Embeddings
Self-attention is permutation-equivariant: if you shuffle the input tokens, the output tokens are shuffled in exactly the same way. The attention operation itself has no notion of order or position. This means that without positional information, a ViT cannot distinguish between a patch in the top-left corner and one in the bottom-right — it treats the sequence as a bag of patches.
To inject spatial information, positional embeddings are added to the patch embeddings before
they enter the transformer. Each position i in the sequence gets an embedding
epos(i) ∈ RD, which is added element-wise to the token
at that position.
Learned 1D Positional Embeddings
The original ViT uses the simplest approach: a learnable lookup table. Create a parameter matrix
Epos ∈ R(N+1) × D (one row for each position,
including the [CLS] token). These embeddings are randomly initialized and trained with the
rest of the model via backpropagation.
Dosovitskiy et al. (2020) found that learned 1D positional embeddings work just as well as more sophisticated 2D-aware alternatives. The model learns to encode 2D spatial structure in the 1D position embeddings: visualizing the cosine similarity between position embeddings reveals a clear 2D grid pattern, showing the model discovers the row/column structure of patches on its own.
2D Sinusoidal Positional Embeddings
An alternative is to use fixed sinusoidal functions, extended to 2D. For a patch at grid position (r, c), encode the row and column separately using sine and cosine functions at different frequencies, then concatenate:
This provides the model with explicit 2D spatial information without any learned parameters. The ViT paper found no significant advantage over learned 1D embeddings, but 2D sinusoidal encodings are used in some later works (e.g., MAE by He et al., 2022).
Resolution Interpolation
A practical problem: if you train ViT at 224 × 224 (196 patches with P=16) but want to fine-tune at 384 × 384 (576 patches), the positional embedding table has the wrong number of entries. The solution is bicubic interpolation of the positional embeddings.
Reshape the 1D position embeddings back to a 2D grid (14 × 14 for the original resolution), apply 2D bicubic interpolation to upsample to the new grid size (24 × 24 for 384 × 384 resolution), then flatten back to 1D. The [CLS] token's positional embedding is kept unchanged. This works surprisingly well in practice, though there is typically a brief dip in performance at the start of fine-tuning before the model adapts to the interpolated positions.
Self-Attention on Patches
Self-attention is the core operation that gives ViT its power. Every patch can directly attend to every other patch in a single layer — no stacking required. Let us derive the full mechanism from scratch.
Q, K, V Derivation
Let X ∈ R(N+1) × D be the input sequence (N patch tokens plus the [CLS] token, each of dimension D). Three learned weight matrices project X into queries, keys, and values:
where WQ, WK, WV ∈ RD × D
Q, K, and V each have shape (N+1) × D. The query at position i asks "what am I looking for?". The key at position j announces "here is what I contain." The value at position j holds "here is the information I'll contribute if attended to." The attention weight between positions i and j is the dot product of query i with key j:
The matrix Q KT has shape (N+1) × (N+1), where entry (i, j) is the raw attention score between tokens i and j. The division by √dk is critical: without it, the dot products grow in magnitude with the dimension dk, pushing the softmax into regions of extremely small gradients. Dividing by √dk keeps the variance of the dot products approximately 1, regardless of dk.
The softmax normalizes each row to sum to 1, producing a probability distribution over all positions for each query. Multiplying by V computes a weighted average of the value vectors, where the weights are the attention probabilities.
Multi-Head Attention
A single attention head can only capture one type of relationship at a time. Multi-head attention runs h attention heads in parallel, each with its own Q, K, V projections, operating on a subspace of dimension dk = D / h.
where WQi, WKi, WVi ∈ RD × dk
MultiHead(X) = Concat(head1, ..., headh) WO
where WO ∈ RD × D
For ViT-B/16: D = 768, h = 12 heads, dk = 768 / 12 = 64. Each head operates on 64-dimensional queries, keys, and values. The outputs of all 12 heads are concatenated (back to dimension 768) and projected through WO.
Different heads learn to attend to different things. In ViT, some heads learn local attention patterns (nearby patches), some learn global patterns (distant patches), and some specialize in attending to the [CLS] token. This diversity of attention patterns is analogous to having multiple convolutional kernels at different receptive fields, but learned rather than hardcoded.
Computational Cost
The dominant cost in self-attention is computing Q KT, which requires multiplying two matrices of shape (N+1, dk), giving O(N2 dk) operations per head, or O(N2 D) across all heads (since h × dk = D). The memory cost for storing the attention matrix is O(N2) per head.
For ViT-B/16 (N = 196): the attention matrix is 197 × 197 ≈ 39K entries per head, times 12 heads = ~466K entries. This is manageable. For ViT-L/14 (N = 256): 257 × 257 ≈ 66K per head, times 16 heads = ~1.06M entries. Still tractable.
But the quadratic scaling with N is why ViTs struggle with high-resolution images. At resolution 1024 × 1024 with P=16, N = 4096 patches, and the attention matrix has ~16.8M entries per head. This is the fundamental bottleneck that motivates efficient attention variants (Swin Transformer, etc.), though the standard ViT we describe here uses full attention.
For comparison, a convolutional layer with kernel size K on a feature map of size H × W with C channels costs O(K2 · C · H · W). The key difference: the CNN cost is linear in spatial resolution (H × W) but the receptive field is limited to K. The ViT cost is quadratic in spatial resolution but the receptive field is global.
ViTs: O(N2 · D) — quadratic in spatial size, global receptive field.
This is not a flaw of either architecture but a trade-off. ViTs pay a quadratic cost to get global attention in every layer. CNNs pay a linear cost but need many layers for global context.
ViT Architecture
With patch embedding, positional encoding, and self-attention defined, we can now assemble the complete Vision Transformer. The architecture follows the standard transformer encoder from Vaswani et al. (2017) with minor modifications.
The Transformer Block
Each transformer block applies two sub-layers with residual connections and pre-normalization (LayerNorm before the operation, not after — this is the "Pre-LN" variant used by ViT, which is more stable during training than Post-LN):
z = y + MLP(LayerNorm(y))
The residual connections are critical. Without them, gradients must flow through the attention and MLP operations at every layer, making deep networks hard to train. With residuals, gradients have a direct path through the network (the "skip connection highway"), and each block only needs to learn a correction to the identity mapping.
LayerNorm normalizes each token independently: for a vector x ∈ RD, compute the mean and variance across its D dimensions, then normalize and apply learnable scale and shift:
where μ = mean(x), σ2 = var(x), γ and β ∈ RD are learned
The MLP Block
The MLP (sometimes called the Feed-Forward Network, FFN) consists of two linear layers with a GELU nonlinearity in between. The first layer expands the dimension by a factor of 4 (the "expansion ratio"), and the second projects back:
W1 ∈ R4D × D, W2 ∈ RD × 4D
For ViT-B: the MLP maps 768 → 3072 → 768. For ViT-L: 1024 → 4096 → 1024. The expansion to 4× the dimension gives the model a wider space in which to compute nonlinear transformations before projecting back to the residual stream.
GELU (Gaussian Error Linear Unit, Hendrycks & Gimpel, 2016) is defined as:
where Φ is the CDF of the standard normal distribution. Unlike ReLU, GELU is smooth everywhere and allows small negative gradients, which empirically helps training stability.
Full Stack
The complete ViT forward pass:
- Patch Embed: Image ∈ RH×W×3 → Z ∈ RN×D via Conv2d(3, D, P, P)
- Prepend [CLS]: Z → [xcls ; Z] ∈ R(N+1)×D
- Add Position: Z = Z + Epos
- Transformer Blocks: Repeat L times: LayerNorm → MHSA → Residual → LayerNorm → MLP → Residual
- Final LayerNorm: Apply LayerNorm to the output
- Extract [CLS]: Take the first token as the image representation
- Classification Head: Linear layer maps D → num_classes
| Component | ViT-B/16 | ViT-L/14 |
|---|---|---|
| Patch size P | 16 | 14 |
| Number of patches N | 196 (14×14) | 256 (16×16) |
| Hidden dim D | 768 | 1024 |
| Transformer layers L | 12 | 24 |
| Attention heads h | 12 | 16 |
| Head dim dk | 64 | 64 |
| MLP dim | 3072 | 4096 |
| Parameters | 86M | 304M |
| Sequence length (with [CLS]) | 197 | 257 |
DINOv2
The original ViT was trained with supervised classification on labeled datasets. But the most powerful vision encoders for VLMs don't use labels at all. DINOv2 (Oquab et al., 2024) is a self-supervised Vision Transformer that produces features with remarkable quality — without seeing a single text label during training.
Self-Distillation with No Labels
DINO (Caron et al., 2021) introduced self-distillation with no labels (the acronym). The idea: maintain two copies of the network — a student and a teacher. Both see the same image but with different augmentations (crops, color jitter, etc.). The student is trained to match the teacher's output distribution. The teacher is not trained with gradients — instead, its weights are an exponential moving average (EMA) of the student's weights:
where λ follows a cosine schedule from 0.996 to 1.0
The training objective is cross-entropy between the teacher's and student's output distributions. Crucially, the student sees local crops (small regions, 96×96) while the teacher sees global crops (large regions, 224×224). This forces the student to infer global context from local patches — a powerful learning signal.
A centering and sharpening mechanism prevents mode collapse (the trivial solution where both networks output the same constant). The teacher's output is centered by subtracting a running mean, and sharpened by using a low temperature in the softmax.
DINOv2 extends DINO with several improvements:
- Data: The LVD-142M dataset, a curated collection of 142 million images retrieved and deduplicated from a pool of 1.2 billion web images. No manual labeling.
- Architecture: ViT-g/14 (the "giant" variant) with 1.1 billion parameters, D = 1536, L = 40 layers, 24 attention heads.
- Combined objective: Both the DINO self-distillation loss and an iBOT masked image modeling loss (predicting masked patch tokens) are used simultaneously.
- Distillation cascade: After training the ViT-g, smaller models (ViT-S, ViT-B, ViT-L) are trained via distillation from the ViT-g teacher, transferring its representation quality to efficient models.
Feature Quality
DINOv2 features exhibit remarkable spatial correspondence: the patch tokens from two different images of the same object category have similar features at corresponding spatial locations. For example, the patch token at the "left eye" position of one cat image is similar to the "left eye" token of another cat image — even without any supervision telling the model what an eye is.
This emerges because the self-distillation objective forces the model to build representations that are invariant to viewpoint and appearance changes (captured by the different augmentations) while preserving spatial structure (because the student must reconstruct global context from local crops). The result is a feature space where semantic similarity aligns with spatial correspondence.
DINOv2 features achieve state-of-the-art performance on dense prediction tasks (monocular depth, semantic segmentation) with simple linear probes — a single learned linear layer on top of the frozen features. This indicates the features already encode rich spatial and semantic information without any task-specific training.
SigLIP
While DINOv2 trains with vision-only self-supervision, another family of vision encoders learns from image-text pairs. CLIP (Radford et al., 2021) pioneered contrastive language-image pretraining: given a batch of (image, text) pairs, learn an image encoder and a text encoder such that matching pairs have high similarity and non-matching pairs have low similarity. SigLIP (Zhai et al., 2023) improves CLIP's training loss to enable better scaling.
Sigmoid vs. Softmax Contrastive Loss
CLIP uses a softmax-based contrastive loss (InfoNCE). For a batch of B image-text pairs, compute the B × B similarity matrix between all image and text embeddings. The loss treats each row (and column) as a classification problem: the correct pair should have the highest similarity.
The softmax couples all B samples in the denominator, requiring them to be computed on the same device. This limits effective batch sizes and requires complex multi-device synchronization.
SigLIP replaces softmax with sigmoid. Each (image, text) pair is treated independently as a binary classification: is this pair a match or not?
where zij = (sim(Ii, Tj) - b) / τ, yij = 1 if i=j else 0
Because each pair is evaluated independently (no softmax denominator coupling the batch), the loss decomposes naturally across devices. This allows much larger effective batch sizes — SigLIP was trained with batch sizes up to 32,768 image-text pairs. Larger batches provide more negative examples per step, improving representation quality.
SigLIP in Practice
SigLIP-SO400M ("Shape-Optimized, 400M parameters") uses a ViT-SO400M architecture: a ViT with modified hidden dimensions and head counts optimized via architecture search. It is trained on the WebLI dataset with roughly 4 billion image-text pairs. The resulting vision encoder achieves ImageNet zero-shot accuracy competitive with much larger CLIP models.
SigLIP is the vision encoder used in PaLI-3 (Chen et al., 2023) and OpenVLA (Kim et al., 2024). In OpenVLA, the SigLIP vision encoder processes the robot's camera image: the image passes through the ViT, the patch tokens (or [CLS] token) are projected into the language model's embedding space, and the language model conditions on these visual tokens to generate robot actions. The quality of the vision encoder directly determines how well the VLA understands the visual scene.
Scaling ViTs
Model Variants
The ViT family follows a naming convention: ViT-{size}/{patch_size}. The size
controls the hidden dimension, number of layers, and number of heads. The patch size controls
the sequence length (and thus the resolution of the visual representation).
| Model | Layers (L) | Hidden Dim (D) | Heads (h) | MLP Dim | Params | Patches (224px) |
|---|---|---|---|---|---|---|
ViT-S/16 |
12 | 384 | 6 | 1536 | 22M | 196 |
ViT-B/16 |
12 | 768 | 12 | 3072 | 86M | 196 |
ViT-L/14 |
24 | 1024 | 16 | 4096 | 304M | 256 |
ViT-H/14 |
32 | 1280 | 16 | 5120 | 632M | 256 |
ViT-g/14 |
40 | 1536 | 24 | 6144 | 1.1B | 256 |
Smaller patch sizes (e.g., /14 vs. /16) create more patches, giving finer spatial resolution at the cost of longer sequences and quadratically higher attention cost. The trend in modern VLMs is to use /14 patch size for the best trade-off between resolution and cost.
Data Requirements
A critical finding of the original ViT paper: ViTs need more training data than CNNs. When trained on ImageNet-1K alone (1.28M images), ViT-B underperforms a comparable ResNet. On ImageNet-21K (14M images), they are roughly equal. On JFT-300M (300M images), ViT-B surpasses the best ResNets.
The reason is the inductive bias gap. Convolutions hardcode locality and translation equivariance, giving CNNs a strong prior that compensates for limited data. ViTs lack these biases and must learn spatial structure from examples. With insufficient data, the model overfits or fails to discover useful spatial patterns. With sufficient data, the lack of inductive bias becomes an advantage: the model can learn representations unconstrained by the architecture's assumptions.
Subsequent work has partially addressed this data hunger:
- DeiT (Touvron et al., 2021): Showed ViTs can match CNNs on ImageNet-1K alone with strong data augmentation (RandAugment, CutMix, Mixup) and regularization (stochastic depth, repeated augmentation).
- MAE (He et al., 2022): Masked autoencoder pre-training — mask 75% of patches and train the model to reconstruct them. This self-supervised approach produces excellent features from ImageNet alone.
- LAION-5B (Schuhmann et al., 2022): An open-source dataset of 5.8 billion image-text pairs, enabling CLIP/SigLIP-style training at massive scale.
The scaling law for ViTs approximately follows: performance improves log-linearly with both model size and dataset size. Doubling the model parameters requires roughly 4× the data to fully utilize the increased capacity. This is analogous to the Chinchilla scaling laws (Hoffmann et al., 2022) observed in LLMs, though the exact constants differ for vision.
Code Examples
Patch Embedding from Scratch
import torch
import torch.nn as nn
class PatchEmbedding(nn.Module):
"""Convert image to sequence of patch embeddings.
Equivalent to: flatten P×P patches, then linear project.
Implemented as Conv2d for efficiency.
"""
def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768):
super().__init__()
self.num_patches = (img_size // patch_size) ** 2 # 196 for ViT-B/16
self.proj = nn.Conv2d(
in_channels, embed_dim,
kernel_size=patch_size, stride=patch_size # no overlap
)
def forward(self, x):
# x: (B, 3, 224, 224)
x = self.proj(x) # (B, 768, 14, 14)
x = x.flatten(2) # (B, 768, 196)
x = x.transpose(1, 2) # (B, 196, 768)
return x
# Verify shapes
embed = PatchEmbedding()
img = torch.randn(1, 3, 224, 224)
patches = embed(img)
print(f"Patches shape: {patches.shape}") # (1, 196, 768)
ViT Transformer Block
class ViTBlock(nn.Module):
"""Single ViT transformer block with Pre-LN."""
def __init__(self, dim=768, num_heads=12, mlp_ratio=4.0, drop=0.0):
super().__init__()
self.norm1 = nn.LayerNorm(dim)
self.attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
self.norm2 = nn.LayerNorm(dim)
self.mlp = nn.Sequential(
nn.Linear(dim, int(dim * mlp_ratio)), # 768 -> 3072
nn.GELU(),
nn.Dropout(drop),
nn.Linear(int(dim * mlp_ratio), dim), # 3072 -> 768
nn.Dropout(drop),
)
def forward(self, x):
# Pre-LN: normalize before attention
h = self.norm1(x)
h, _ = self.attn(h, h, h) # self-attention: Q=K=V=h
x = x + h # residual connection
# Pre-LN: normalize before MLP
h = self.norm2(x)
h = self.mlp(h)
x = x + h # residual connection
return x
block = ViTBlock()
tokens = torch.randn(1, 197, 768) # 196 patches + 1 [CLS]
out = block(tokens)
print(f"Output shape: {out.shape}") # (1, 197, 768)
Full ViT Forward Pass
class ViT(nn.Module):
"""Vision Transformer (ViT-B/16 by default)."""
def __init__(self, img_size=224, patch_size=16, in_channels=3,
num_classes=1000, embed_dim=768, depth=12,
num_heads=12, mlp_ratio=4.0):
super().__init__()
self.patch_embed = PatchEmbedding(img_size, patch_size, in_channels, embed_dim)
num_patches = self.patch_embed.num_patches
# Learnable [CLS] token
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
nn.init.trunc_normal_(self.cls_token, std=0.02)
# Learnable positional embeddings (1D)
self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
nn.init.trunc_normal_(self.pos_embed, std=0.02)
# Stack of transformer blocks
self.blocks = nn.Sequential(*[
ViTBlock(embed_dim, num_heads, mlp_ratio) for _ in range(depth)
])
self.norm = nn.LayerNorm(embed_dim)
self.head = nn.Linear(embed_dim, num_classes)
def forward(self, x):
B = x.shape[0]
# 1. Patch embedding
x = self.patch_embed(x) # (B, 196, 768)
# 2. Prepend [CLS] token
cls = self.cls_token.expand(B, -1, -1) # (B, 1, 768)
x = torch.cat([cls, x], dim=1) # (B, 197, 768)
# 3. Add positional embeddings
x = x + self.pos_embed # (B, 197, 768)
# 4. Transformer blocks
x = self.blocks(x) # (B, 197, 768)
# 5. Final LayerNorm
x = self.norm(x)
# 6. Extract [CLS] token output
cls_out = x[:, 0] # (B, 768)
# 7. Classification head
logits = self.head(cls_out) # (B, 1000)
return logits
model = ViT()
img = torch.randn(2, 3, 224, 224)
logits = model(img)
print(f"Logits shape: {logits.shape}") # (2, 1000)
print(f"Total params: {sum(p.numel() for p in model.parameters()):,}") # ~86M
Feature Extraction with HuggingFace
from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch
# === DINOv2 Feature Extraction ===
dino_processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
dino_model = AutoModel.from_pretrained("facebook/dinov2-base")
image = Image.open("example.jpg")
inputs = dino_processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = dino_model(**inputs)
cls_token = outputs.last_hidden_state[:, 0] # (1, 768) — global
patch_tokens = outputs.last_hidden_state[:, 1:] # (1, 256, 768) — spatial
print(f"DINOv2 CLS: {cls_token.shape}")
print(f"DINOv2 patches: {patch_tokens.shape}")
# === SigLIP Feature Extraction ===
from transformers import AutoModel, AutoProcessor
siglip_processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")
siglip_model = AutoModel.from_pretrained("google/siglip-base-patch16-224")
inputs = siglip_processor(images=image, text=["a photo of a cat"], return_tensors="pt")
with torch.no_grad():
outputs = siglip_model(**inputs)
image_embeds = outputs.image_embeds # (1, 768) — normalized
text_embeds = outputs.text_embeds # (1, 768) — normalized
similarity = (image_embeds @ text_embeds.T) # cosine similarity
print(f"SigLIP image-text similarity: {similarity.item():.4f}")
References
Seminal papers and key works referenced in this article.
- Dosovitskiy et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR, 2021. arXiv
- Touvron et al. "Training data-efficient image transformers & distillation through attention." ICML, 2021. arXiv
- Oquab et al. "DINOv2: Learning Robust Visual Features without Supervision." TMLR, 2024. arXiv
- Zhai et al. "Sigmoid Loss for Language Image Pre-Training." ICCV, 2023. arXiv
- He et al. "Masked Autoencoders Are Scalable Vision Learners." CVPR, 2022. arXiv