Contrastive Learning & CLIP — Engineermaxxing

Introduction

For the first fifty years of computer vision, images and text were different universes. An image was a grid of pixel values. A sentence was a sequence of token IDs. They shared nothing — no common representation, no shared space, no way to compute the "distance" between a photo and a description. To connect them, you needed a task-specific model: an image captioner, a visual question answerer, an object detector trained on a fixed label set. Every new task required new training data, new architectures, new supervised labels.

Then, in January 2021, OpenAI published a paper titled "Learning Transferable Visual Models From Natural Language Supervision" (Radford et al., 2021). The model was called CLIP — Contrastive Language–Image Pre-training. The idea was almost embarrassingly simple: take 400 million image-text pairs scraped from the internet, train two encoders (one for images, one for text) to map matching pairs close together and non-matching pairs far apart, and see what happens.

What happened changed the field. CLIP matched the performance of a fully supervised ResNet-50 on ImageNet — without ever seeing a single ImageNet label. It could classify images into categories it had never been explicitly trained on, just by comparing image embeddings to text embeddings of natural language descriptions. It gave birth to the idea that the same model could handle any visual task, if you just described the task in words.

Every major VLM today — GPT-4V, Gemini, Claude's vision, LLaVA, Qwen-VL — inherits from this lineage. The vision encoder in these systems is almost always a CLIP-trained (or CLIP-descended) model. Understanding how contrastive learning works, why it produces such powerful representations, and where it fails is not optional background — it is the core theory of modern vision-language systems.

This article derives everything from scratch. We start with the abstract idea of contrastive learning, derive the loss functions from first principles, walk through every architectural choice in CLIP, and build working code. By the end, you will understand not just what CLIP does, but why each piece is the way it is.

ℹ What this article covers

We derive contrastive learning from information theory. We build NCE and InfoNCE losses from scratch, walk through every component of the CLIP architecture, explain zero-shot classification, analyze the shared embedding space geometry, catalog CLIP's known failure modes, and survey the models that came after it. Four interactive visualizations let you experiment with each concept.

Contrastive Learning Theory

Before we touch CLIP, we need to understand the idea underneath it: contrastive learning. The core insight is deceptively simple. You do not need labels to learn useful representations. You just need to know which things should be similar and which should be different.

Consider a thought experiment. You have never seen a dog or a cat. Someone shows you two photos and says "these two are the same kind of thing" (two dogs), then shows you another pair and says "these two are different kinds of things" (a dog and a cat). With enough such pairs, you would develop an internal representation that separates dogs from cats — without ever being told what "dog" or "cat" means. You learned by contrast.

Positive and negative pairs

Contrastive learning formalizes this. Given a dataset, we construct:

Positive pairs (x, x⁺): two items that should have similar representations. In self-supervised vision, these might be two augmented views of the same image. In CLIP, they are an image and its matching caption.
Negative pairs (x, x⁻): two items that should have different representations. In a batch of B pairs, every non-matching combination is a negative.

The model learns an encoder f(x) → z ∈ R^d that maps inputs to a d-dimensional embedding space. The learning objective pushes positive pairs together and negative pairs apart in this space. That is the entire idea. Everything else is details about how to push and pull, and those details turn out to matter enormously.

Embedding space geometry

What does "close" and "far" mean in embedding space? There are two standard choices:

Euclidean distance measures straight-line distance: d(a, b) = ||a - b||₂. This is intuitive but has a problem: vectors of different magnitudes can be close in direction but far in Euclidean distance.

Cosine similarity measures the angle between vectors:

sim(a, b) = (a \cdot b) / (||a|| \cdot ||b||) = cos(θ)

Cosine similarity ranges from −1 (opposite directions) to +1 (same direction). It is invariant to vector magnitude — only the direction matters. This is why CLIP and nearly all contrastive learning systems L2-normalize their embeddings before computing similarity. After normalization, ||z|| = 1 for all embeddings, and cosine similarity reduces to the dot product: sim(a, b) = a · b.

Geometrically, normalized embeddings live on a unit hypersphere S^d−1 in R^d. Learning pushes positive pairs toward the same point on this sphere and negative pairs toward distant points. The hypersphere has a beautiful property: it is compact (no point can "escape" to infinity), and every point has the same local geometry, which prevents degenerate solutions where all embeddings collapse to a single point or a low-dimensional subspace.

💡 Why normalization prevents collapse

Without normalization, a trivial solution exists: the model could make all embeddings identical (collapse). With L2 normalization on the sphere, making all embeddings identical means they all have the same direction, which maximizes similarity between all pairs — including negatives. The loss function penalizes high negative-pair similarity, so collapse is no longer a minimum. The hypersphere geometry forces the model to spread representations out.

Noise Contrastive Estimation (NCE)

The loss function used in CLIP did not appear from nowhere. It has a precise lineage that begins with Noise Contrastive Estimation (Gutmann & Hyvärinen, 2010). To understand InfoNCE, we must first understand NCE. The derivation is worth following carefully because it reveals why contrastive losses work at all.

NCE derivation

The problem NCE solves: you want to estimate a probability distribution p_data(x) over some data, but computing the normalizing constant (the partition function Z) is intractable. This is common in language modeling, where the vocabulary is huge, and in energy-based models.

NCE's trick: instead of estimating p_data directly, convert the problem into binary classification. We have:

Real data samples x ~ p_data(x), labeled as positive (y = 1)
Noise samples x ~ p_noise(x), labeled as negative (y = 0)

We draw k noise samples for every real sample. By Bayes' rule, the posterior probability that a sample is real (given that it could be from either distribution) is:

P(y = 1 | x) = p data (x) / (p data (x) + k \cdot p noise (x))

We parameterize our model as p_θ(x) and define:

h(x; θ) = log p θ (x) - log(k \cdot p noise (x))

Then the binary classification loss is:

L NCE = -E x~p data [log σ(h(x; θ))] - k \cdot E x~p noise [log(1 - σ(h(x; θ)))]

where σ is the sigmoid function. Gutmann and Hyvärinen proved that as the number of noise samples k → ∞, the optimal θ* satisfies p_θ*(x) = p_data(x). The partition function is estimated implicitly through the classification objective. This is the key insight: you can learn distributions by learning to distinguish real data from noise.

Connection to mutual information

NCE has a deep connection to information theory that explains why contrastive learning produces such good representations. The mutual information between two random variables X and Y is:

I(X; Y) = E p(x,y) [log(p(x,y) / (p(x)p(y)))]

This measures how much knowing X tells you about Y (and vice versa). Computing I(X; Y) directly requires knowing the joint and marginal distributions, which is typically intractable. But contrastive losses provide a lower bound on mutual information.

The connection, formalized by Oord et al. (2018) in the CPC paper: an optimal contrastive model with k negative samples achieves a loss that lower-bounds I(X; Y) as:

I(X; Y) \geq log(k) - L contrastive

This means that minimizing a contrastive loss is equivalent to maximizing a lower bound on the mutual information between the positive pair members. For CLIP, this means the training objective maximizes the mutual information between image content and text descriptions — forcing the model to learn the shared semantic structure between visual and linguistic data.

ℹ The log(k) ceiling

The mutual information bound is capped at log(k), where k is the number of negatives. With a batch size of 32,768 (like CLIP), the bound is log(32768) ≈ 10.4 nats. This means larger batches allow the model to capture more mutual information. This is one reason CLIP used such massive batch sizes — it is not just a computational convenience, it is a fundamental limit on what the model can learn.

InfoNCE Loss

The InfoNCE loss (Oord et al., 2018) is the specific contrastive loss used by CLIP. It generalizes NCE from binary classification (one positive vs. one negative) to a (1+k)-way classification (one positive vs. k negatives). Let us derive it step by step.

From NCE to InfoNCE

Consider a batch of B image-text pairs: {(I₁, T₁), (I₂, T₂), ..., (I_B, T_B)}. Each pair (I_i, T_i) is a positive pair — the image and text match. All cross-pairs (I_i, T_j) for i ≠ j are negative pairs — the image and text do not match.

For a given image I_i, we want to identify its matching text T_i from among all B texts in the batch. This is a B-way classification problem. Using a softmax over similarities:

P(T i is the match for I i) = exp(sim(I i, T i) / τ) / \sum j=1 B exp(sim(I i, T j) / τ)

The loss is the negative log-likelihood of the correct match:

L i\tot = -(1/B) \sum i=1 B log [exp(sim(I i, T i) / τ) / \sum j=1 B exp(sim(I i, T j) / τ)]

This is the image-to-text direction: for each image, classify which text matches. The symmetric text-to-image direction is identical with the roles swapped:

L t\toi = -(1/B) \sum i=1 B log [exp(sim(T i, I i) / τ) / \sum j=1 B exp(sim(T i, I j) / τ)]

The final CLIP loss averages both directions:

L CLIP = (L i\tot + L t\toi) / 2

The B×B similarity matrix

The computation is best understood as a matrix operation. Encode all images and texts in a batch:

I = [f img (I 1), ..., f img (I B)] \in R B \times d T = [f txt (T 1), ..., f txt (T B)] \in R B \times d

Compute the similarity matrix:

S = (I \cdot T ⊤) / τ \in R B \times B

Entry S_ij is the similarity between image i and text j, divided by temperature. The diagonal entries S_ii are the positive pairs. Off-diagonal entries are negatives. The loss applies softmax cross-entropy along each row (image-to-text) and each column (text-to-image), with the diagonal as the target.

This is extraordinarily efficient. With B = 32,768, we get B² − B ≈ 1 billion negative pairs from a single matrix multiply. Each image is contrasted against 32,767 texts, and each text against 32,767 images. The quadratic scaling of negatives with batch size is why contrastive learning benefits so much from large batches.

Temperature parameter τ

The temperature τ is a scalar (CLIP learns it as a log-parameterized value, initialized to log(1/0.07) ≈ 2.66). It controls the sharpness of the softmax distribution:

Low τ (e.g., 0.01): the softmax becomes very peaked. The model is confident and focuses on hard negatives — the few examples that are most similar to the positive. This can lead to training instability.
High τ (e.g., 1.0): the softmax is uniform. All negatives are weighted equally, and the model gets weak gradients. Learning is slow.
τ ≈ 0.07 (CLIP's learned value): a sweet spot. The distribution is concentrated enough to learn from hard negatives but soft enough for stable training.

Critically, τ also controls the uniformity of the embedding space. Lower temperature pushes embeddings to spread more uniformly on the hypersphere (because the penalty for any two non-matching embeddings being close is amplified). Higher temperature allows more clustering. CLIP makes τ learnable so the model can find its own balance during training.

💡 Temperature as an implicit margin

In traditional metric learning, you set an explicit margin α: positive pairs must be closer than negative pairs by at least α. Temperature in InfoNCE plays an analogous role. Dividing by a small τ amplifies all similarity differences, effectively requiring a larger gap between positive and negative similarities for the loss to be small. But unlike a fixed margin, the learned temperature adapts to the difficulty of the task.

InfoNCE Loss — The B×B Similarity Matrix Interactive

Visualize the similarity matrix for a batch of image-text pairs. The diagonal contains positive pairs (matching image-text). Adjust the temperature to see how it sharpens or flattens the softmax probabilities. Hover over cells to inspect individual values.

Temperature τ: 0.07 Hover cells to inspect — diagonal = positive pairs

CLIP Architecture (Radford et al., 2021)

CLIP is a dual encoder model. It has two completely separate encoders — one for images and one for text — that map their respective inputs into a shared embedding space. The encoders share no weights. They are connected only through the contrastive loss that trains them jointly.

Vision Encoder

ViT-L/14

Vision Transformer, Large variant, with 14×14 pixel patches. 24 transformer layers, 1024-d hidden size, 16 attention heads. Input: 224×224 image → 256 patch tokens (16×16 grid) + 1 [CLS] token. The [CLS] token output is projected to the shared space. ~304M parameters.

Text Encoder

GPT-2 Style Transformer

12-layer transformer with masked self-attention (causal, left-to-right). 512-d hidden size, 8 attention heads. BPE tokenizer with 49,152 vocab size. Max sequence length: 76 tokens. The [EOS] token representation (last token) is projected to the shared space. ~63M parameters.

Vision encoder: ViT-L/14

The flagship CLIP model uses a Vision Transformer (ViT) rather than a CNN. Here is the exact processing pipeline:

Patch embedding: the 224×224×3 image is split into 14×14 pixel patches, giving a 16×16 grid of 256 patches. Each patch is flattened to a 588-d vector (14×14×3) and linearly projected to 1024-d.
Position embeddings: 257 learned position embeddings (256 patches + 1 [CLS]) are added. These encode spatial location, since the transformer has no inherent notion of position.
Transformer layers: 24 layers of multi-head self-attention (16 heads) and feed-forward networks (MLP with GELU activation, 4096-d hidden layer). Pre-LayerNorm variant. Each patch attends to all others, building global context.
Projection: the final [CLS] token representation (1024-d) passes through a learned linear projection W_img ∈ R^1024×512 to produce a 512-d embedding, which is L2-normalized.

Text encoder

The text encoder follows the GPT-2 architecture with causal (left-to-right) masking:

Tokenization: BPE with a 49,152-token vocabulary. Text is lowercased and tokenized. [SOS] and [EOS] tokens are prepended and appended.
Embedding + position: token embeddings (512-d) + learned positional embeddings, max length 76 tokens.
Transformer layers: 12 layers of causal multi-head attention (8 heads) and feed-forward networks. The causal mask means each token can only attend to preceding tokens, giving the [EOS] token a summary of the entire sequence.
Projection: the [EOS] token representation (512-d) is projected by W_txt ∈ R^512×512 and L2-normalized.

A critical design choice: the text encoder uses causal masking, not bidirectional attention. This means the [EOS] token must compress the entire meaning of the caption into a single vector by reading left-to-right. Bidirectional encoders (like BERT) could potentially produce better text representations, but causal masking is simpler and was sufficient for CLIP's purposes. Later models like SigLIP experimented with both.

Training at scale

The scale of CLIP's training is part of its story:

Dataset: WebImageText (WIT), 400 million image-text pairs collected from the internet. Not publicly released. Constructed by searching for images whose associated text (alt text, titles, descriptions) contained one of 500,000 queries derived from Wikipedia article titles and WordNet synsets.
Batch size: 32,768 image-text pairs per batch. This means each update computes a 32,768×32,768 similarity matrix with over 1 billion negative pairs.
Hardware: 256 V100 GPUs (the original paper says "592 V100-days" for the largest model, which at 256 GPUs is about 2.3 days; other reports cite 12 days on 32 GPUs for ViT-B/32 and proportionally longer for larger variants).
Optimizer: Adam with decoupled weight decay, cosine learning rate schedule, warmup over the first 2000 steps.
Mixed precision: fp16 training with loss scaling to prevent underflow.
Augmentation: only random square crop from resized image. No color jitter, no flipping, no complex augmentation pipeline. The diversity of web data provided enough variation.

ℹ Why 32,768 batch size?

The batch size is not arbitrary. Recall that the mutual information lower bound is log(B). With B = 32,768, the model can capture up to log(32768) ≈ 10.4 nats of mutual information between images and text. Smaller batches (e.g., B = 256) cap at log(256) ≈ 5.5 nats, potentially losing fine-grained distinctions. The batch size is also the number of negatives, and harder negative mining happens naturally in larger batches. This is why CLIP's performance degrades significantly with smaller batch sizes.

Zero-Shot Classification

Zero-shot classification is CLIP's breakthrough capability and the most direct demonstration of what a shared embedding space enables. The idea is beautifully simple:

Take a set of class names: {"dog", "cat", "car", "airplane", ...}
Convert each to a text prompt: "a photo of a dog", "a photo of a cat", ...
Encode all prompts through the text encoder to get text embeddings
Encode the query image through the vision encoder to get an image embedding
Compute cosine similarity between the image and each text embedding
The class with highest similarity is the prediction

No fine-tuning. No task-specific layers. No training on the target dataset. The model classifies images into categories it has never been explicitly trained on, purely by leveraging the shared structure of the embedding space.

CLIP achieved 76.2% top-1 accuracy on ImageNet zero-shot — matching the performance of the original supervised ResNet-50 that was trained on 1.28 million labeled ImageNet images. This was shocking. A model that had never seen ImageNet labels performed as well as one that was specifically trained on them.

Prompt engineering turned out to matter significantly. "a photo of a {class}" consistently outperformed just "{class}" because it provides context. CLIP's training data consists of natural language descriptions, and "a photo of a dog" is closer to web captions than just "dog". OpenAI found that using 80 prompt templates and averaging the text embeddings (prompt ensembling) boosted ImageNet accuracy by about 3.5 percentage points over a single prompt.

Some example prompt templates from the CLIP paper:

"a photo of a {class}."
"a bad photo of a {class}."
"a sculpture of a {class}."
"a photo of the large {class}."
"a photo of a {class} in a video game."
"art of a {class}."
"a photo of the small {class}."

Averaging embeddings across templates makes the text representation more robust to phrasing variations, capturing the concept rather than any specific description of it.

Zero-Shot Classification — Text Prompts vs. Image Interactive

Select a simulated image category, then see how cosine similarity scores distribute across candidate text prompts. The highest-similarity prompt is the zero-shot prediction.

Showing cosine similarities for: dog

The Shared Embedding Space

The shared embedding space is CLIP's central artifact — the thing that makes everything else work. Let us examine what this space actually looks like and what it captures.

After training, images and their matching text descriptions cluster together in the 512-d space. An image of a golden retriever and the text "a golden retriever playing in a park" occupy nearby points. More remarkably, the space exhibits compositional structure. The direction from "dog" to "puppy" in text space roughly parallels the direction from adult dog images to puppy images in visual space. The space does not just match images to descriptions — it captures semantic relationships.

Because images and text share a space, you can perform cross-modal retrieval trivially:

Text-to-image: embed a text query, find the nearest image embeddings. "A sunset over the ocean" retrieves photographs of ocean sunsets.
Image-to-text: embed an image, find the nearest text embeddings. A photo of a cat retrieves descriptions like "a tabby cat sitting on a couch".
Image-to-image: embed an image, find the nearest image embeddings. This performs semantic similarity search — finding visually and semantically similar images.

This is the foundation of modern image search systems. Services like Google Images, Pinterest visual search, and Shutterstock all use CLIP-derived embeddings (or similar contrastive embeddings) for retrieval.

Linear probing

A standard way to evaluate representation quality is linear probing: freeze the encoder, extract embeddings, and train a single linear layer (logistic regression) on top for classification. If the linear probe performs well, the embeddings already contain the relevant information in a linearly separable form.

CLIP's visual encoder achieves remarkable linear probe accuracy: 85.4% on ImageNet with ViT-L/14 (at 336px), compared to 85.8% for a fully supervised ViT-L. The contrastive objective produces representations that are almost as linearly separable as those from supervised training — but they generalize far better to distribution shifts, because they were not overfit to ImageNet's specific label taxonomy.

What CLIP learns vs. what it misses

The shared space captures semantic content remarkably well but has systematic blind spots. CLIP embeddings encode:

Object identity (what things are)
Scene type (indoor/outdoor, natural/urban)
Style (photograph vs. painting vs. sketch)
Mood and aesthetics (bright, dark, colorful)
Text in images (partially — it can read large text)

CLIP embeddings do not reliably encode:

Counting (an image of "3 cats" and "5 cats" may have similar embeddings)
Spatial relationships ("a cup on a table" vs. "a table on a cup")
Attribute binding ("a red car and a blue truck" vs. "a blue car and a red truck")
Fine-grained distinctions between similar categories

These failures stem from CLIP's architecture: the image is compressed to a single 512-d vector, losing spatial information, and the contrastive loss does not require understanding composition. This is why VLMs cannot just use CLIP embeddings — they need the intermediate patch tokens (spatial features) from the vision encoder, not just the [CLS] summary.

Shared Embedding Space — 2D Projection Interactive

A simulated 2D projection of CLIP's shared embedding space. Image embeddings (circles) and text embeddings (diamonds) cluster by semantic category. Matching pairs are connected. Drag to explore, toggle categories to focus on specific clusters.

Showing all categories — hover to see labels

CLIP's Limitations

Understanding CLIP's failure modes is crucial because these limitations are inherited by every VLM that uses a CLIP-style vision encoder. They explain why VLMs sometimes produce confident but wrong answers about visual content.

Counting: CLIP cannot reliably count objects. "Three apples" and "seven apples" produce similar embeddings because the contrastive objective only needs to distinguish "apples" from "not apples" — the exact count rarely determines whether an image-text pair matches. Web captions rarely specify exact counts.
Spatial relationships: "A cat sitting on a mat" and "a mat sitting on a cat" have nearly identical CLIP embeddings. The bag-of-concepts nature of single-vector representations discards relational structure. The [CLS] token captures what is present but not where things are relative to each other.
OCR and text reading: CLIP can read large, prominent text in images but fails on small or stylized text. The 14×14 pixel patch size means text smaller than about 14 pixels tall is split across patches, making it very hard to decode.
Fine-grained categories: CLIP struggles to distinguish between similar subcategories (e.g., different bird species, car models, mushroom varieties). Web text rarely provides the fine-grained labels needed to learn these distinctions — most captions say "bird" not "cerulean warbler".
Typographic attacks: Placing text on an object can override CLIP's visual understanding. An apple with "iPod" written on it may be classified as an iPod, because the text embedding for "iPod" is pulled toward the visual features of the text rather than the object. This reveals that CLIP does not truly "understand" images — it matches surface patterns.
Social biases: CLIP inherits biases from its web-scraped training data. Images of people show disparate classification behavior across demographics. The model associates certain occupations, activities, and attributes with specific demographic groups in ways that reflect (and potentially amplify) societal biases.
Compositional understanding: "A red cube on a blue sphere" vs. "a blue cube on a red sphere" — CLIP treats these as nearly identical because it captures the set of attributes and objects but not their bindings. This is sometimes called the "attribute binding" problem.

💡 Why these limitations matter for VLMs

When a VLM uses a CLIP vision encoder, these failure modes propagate. If the vision encoder cannot count objects in an image, the language model cannot count them either — no amount of language reasoning can recover information that was lost during visual encoding. This is why modern VLMs are moving toward higher-resolution encoders, multi-crop strategies, and encoders trained with objectives beyond contrastive matching (e.g., DINOv2's self-supervised objective preserves more spatial information).

Beyond CLIP

CLIP opened the floodgates. In the years since its release, numerous models have addressed its limitations while preserving its core idea. Here are the most important successors:

SigLIP (Zhai et al., 2023)

Replaces the softmax-based InfoNCE loss with a sigmoid loss: each pair in the B×B matrix is independently classified as matching or not, using binary cross-entropy. This removes the need for all-to-all comparison across GPUs, enabling training on larger batches. SigLIP achieves better zero-shot performance than CLIP at smaller batch sizes and has become the preferred vision encoder for modern VLMs (e.g., PaLI, Gemini).

ALIGN (Jia et al., 2021)

Tests the hypothesis: what if you simply used more data, even if it is noisy? ALIGN trains on 1.8 billion image-alt-text pairs with minimal filtering (just frequency-based filtering). Despite the noise, it achieves state-of-the-art performance, demonstrating that data scale can compensate for curation quality. The key insight: the contrastive loss is robust to noisy pairs because they are effectively random negatives.

OpenCLIP (Ilharco et al., 2021)

An open-source reproduction of CLIP trained on LAION-2B (2 billion image-text pairs). Matches or exceeds OpenAI CLIP on many benchmarks. Critical for the research community because OpenAI never released CLIP's training data or code. OpenCLIP models are the backbone of most open VLMs (LLaVA, InternVL, etc.).

EVA-CLIP (Fang et al., 2023)

Uses a pre-trained EVA vision encoder (initialized from masked image modeling, not random) as the starting point, then applies CLIP-style contrastive training. The pre-training gives the model better spatial understanding and fine-grained features. EVA-02-CLIP-E achieves 82.0% ImageNet zero-shot — the highest at the time.

MetaCLIP (Xu et al., 2023)

Focuses on data curation rather than architecture. Reverse-engineers the likely curation strategy behind CLIP's WIT dataset, then applies it to CommonCrawl. The key finding: careful data balancing across concepts matters more than raw data size. MetaCLIP with 400M pairs outperforms CLIP trained on 400M pairs, purely through better data selection.

DFN (Fang et al., 2023)

Data Filtering Networks use a pre-trained CLIP model to filter training data for a new CLIP model. The insight: use image-text similarity scores from an existing model to select high-quality pairs. This bootstrapping approach produces better models than training on unfiltered data, creating a virtuous cycle of data quality improvement.

The trend is clear: the contrastive learning framework CLIP established is robust. Improvements come from better loss functions (SigLIP), more data (ALIGN), better data curation (MetaCLIP, DFN), and better initialization (EVA-CLIP). The core idea — learning a shared image-text space through contrastive training — remains the foundation of vision-language AI.

CLIP vs DINOv2 — Feature Comparison Interactive

Compare what CLIP and DINOv2 features capture. CLIP excels at semantic/categorical features (matching text descriptions) while DINOv2 excels at spatial/structural features (object parts, boundaries). Toggle between feature types to see the difference.

Showing semantic feature comparison

Code Examples

Let us build the key components from scratch, then use the real CLIP model.

InfoNCE loss from scratch

python

import torch
import torch.nn.functional as F

def info_nce_loss(image_embeddings: torch.Tensor,
                  text_embeddings: torch.Tensor,
                  temperature: float = 0.07) -> torch.Tensor:
    """
    Compute the symmetric InfoNCE loss (CLIP-style).

    Args:
        image_embeddings: (B, d) L2-normalized image embeddings
        text_embeddings:  (B, d) L2-normalized text embeddings
        temperature:      scalar temperature parameter

    Returns:
        Scalar loss value
    """
    # Compute B x B similarity matrix
    # Since embeddings are L2-normalized, dot product = cosine similarity
    logits = image_embeddings @ text_embeddings.T / temperature  # (B, B)

    # Labels: the diagonal is the positive pair
    # For each row i, the correct column is i
    B = logits.shape[0]
    labels = torch.arange(B, device=logits.device)  # [0, 1, 2, ..., B-1]

    # Image-to-text: for each image (row), classify which text matches
    loss_i2t = F.cross_entropy(logits, labels)

    # Text-to-image: for each text (column), classify which image matches
    loss_t2i = F.cross_entropy(logits.T, labels)

    # Symmetric loss
    loss = (loss_i2t + loss_t2i) / 2
    return loss


# --- Demonstration ---
B, d = 8, 512  # batch of 8, embedding dim 512

# Random normalized embeddings (simulating encoder outputs)
img_emb = F.normalize(torch.randn(B, d), dim=-1)
txt_emb = F.normalize(torch.randn(B, d), dim=-1)

# Before training: embeddings are random, loss is high
loss_random = info_nce_loss(img_emb, txt_emb, temperature=0.07)
print(f"Loss (random embeddings): {loss_random.item():.4f}")
print(f"Expected for random:      {torch.log(torch.tensor(B, dtype=torch.float)).item():.4f}")  # log(B)

# Simulate "trained" embeddings: make positives similar
txt_emb_trained = img_emb + 0.1 * torch.randn(B, d)
txt_emb_trained = F.normalize(txt_emb_trained, dim=-1)

loss_trained = info_nce_loss(img_emb, txt_emb_trained, temperature=0.07)
print(f"Loss (similar embeddings): {loss_trained.item():.4f}")

# Effect of temperature
for tau in [0.01, 0.07, 0.5, 1.0]:
    loss = info_nce_loss(img_emb, txt_emb_trained, temperature=tau)
    print(f"  tau={tau:.2f} -> loss={loss.item():.4f}")

CLIP inference with OpenAI's model

python

import torch
import clip
from PIL import Image

# Load CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-L/14", device=device)

# Encode an image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)          # (1, 512)
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)

# Encode text
texts = ["a photo of a dog", "a photo of a cat", "a photo of a car"]
text_tokens = clip.tokenize(texts).to(device)

with torch.no_grad():
    text_features = model.encode_text(text_tokens)      # (3, 512)
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)

# Cosine similarity = dot product (both are normalized)
similarity = (image_features @ text_features.T).squeeze(0)  # (3,)
print("Similarities:", similarity.cpu().numpy())

# Softmax for probabilities (using CLIP's learned temperature)
probs = similarity.softmax(dim=0)
for text, prob in zip(texts, probs):
    print(f"  {text}: {prob.item():.1%}")

Zero-shot classification with prompt ensembling

python

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-L/14", device=device)

# ImageNet-style prompt templates (subset of 80 used by OpenAI)
PROMPT_TEMPLATES = [
    "a photo of a {}.",
    "a blurry photo of a {}.",
    "a photo of the large {}.",
    "a photo of the small {}.",
    "a photo of a {} in a video game.",
    "art of a {}.",
    "a drawing of a {}.",
    "a photo of the {}.",
    "itap of a {}.",                      # "I took a picture of a"
    "a good photo of a {}.",
    "a bad photo of a {}.",
    "a photo of many {}.",
    "a sculpture of a {}.",
    "a photo of the hard to see {}.",
    "a rendition of a {}.",
    "a cropped photo of the {}.",
]

def build_classifier(class_names: list[str]) -> torch.Tensor:
    """Build zero-shot classifier weights using prompt ensembling."""
    all_weights = []

    for class_name in class_names:
        # Generate all prompt variants for this class
        prompts = [template.format(class_name) for template in PROMPT_TEMPLATES]
        tokens = clip.tokenize(prompts).to(device)

        with torch.no_grad():
            text_features = model.encode_text(tokens)       # (num_templates, 512)
            text_features = text_features / text_features.norm(dim=-1, keepdim=True)

        # Average across templates, then re-normalize
        class_embedding = text_features.mean(dim=0)         # (512,)
        class_embedding = class_embedding / class_embedding.norm()

        all_weights.append(class_embedding)

    # Stack into classifier weight matrix: (num_classes, 512)
    return torch.stack(all_weights)


# Build classifier for a few classes
classes = ["golden retriever", "tabby cat", "sports car", "airliner", "pizza"]
classifier = build_classifier(classes)
print(f"Classifier shape: {classifier.shape}")  # (5, 512)

# Classify an image
image = preprocess(Image.open("dog.jpg")).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)

# Similarity = image_features @ classifier.T
logits = (100.0 * image_features @ classifier.T).squeeze(0)  # (5,)
probs = logits.softmax(dim=0)

for name, prob in sorted(zip(classes, probs), key=lambda x: -x[1]):
    bar = "#" * int(prob.item() * 40)
    print(f"  {name:20s} {prob.item():6.1%} {bar}")

Embedding space visualization

python

import torch
import clip
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-L/14", device=device)

# Encode a set of text concepts
concepts = {
    "animals": ["dog", "cat", "bird", "fish", "horse", "elephant", "tiger", "rabbit"],
    "vehicles": ["car", "truck", "airplane", "bicycle", "boat", "motorcycle", "train", "bus"],
    "food": ["pizza", "hamburger", "sushi", "salad", "cake", "ice cream", "pasta", "taco"],
    "nature": ["mountain", "ocean", "forest", "desert", "river", "sunset", "rainbow", "volcano"],
}

all_texts = []
all_labels = []
all_categories = []

for category, items in concepts.items():
    for item in items:
        # Encode both "a photo of a {item}" and just "{item}"
        all_texts.append(f"a photo of a {item}")
        all_labels.append(item)
        all_categories.append(category)

tokens = clip.tokenize(all_texts).to(device)

with torch.no_grad():
    embeddings = model.encode_text(tokens)
    embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)

embeddings_np = embeddings.cpu().numpy()

# t-SNE projection to 2D
tsne = TSNE(n_components=2, perplexity=8, random_state=42)
coords = tsne.fit_transform(embeddings_np)

# Plot
colors = {"animals": "#ef4444", "vehicles": "#3b82f6", "food": "#22c55e", "nature": "#f59e0b"}
fig, ax = plt.subplots(figsize=(10, 8))

for i, (label, cat) in enumerate(zip(all_labels, all_categories)):
    ax.scatter(coords[i, 0], coords[i, 1], c=colors[cat], s=80, alpha=0.8, edgecolors='white', linewidth=0.5)
    ax.annotate(label, (coords[i, 0] + 1, coords[i, 1] + 1), fontsize=7, color='gray')

# Legend
for cat, color in colors.items():
    ax.scatter([], [], c=color, s=60, label=cat)
ax.legend(loc='upper right')
ax.set_title("CLIP Text Embedding Space (t-SNE)")
ax.set_xticks([]); ax.set_yticks([])
plt.tight_layout()
plt.savefig("clip_embedding_space.png", dpi=150)
print("Saved embedding space visualization")

References

Seminal papers and key works referenced in this article.

Radford et al. "Learning Transferable Visual Models From Natural Language Supervision." ICML, 2021. arXiv
Chen et al. "A Simple Framework for Contrastive Learning of Visual Representations." ICML, 2020. arXiv
Grill et al. "Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning." NeurIPS, 2020. arXiv
Caron et al. "Emerging Properties in Self-Supervised Vision Transformers." ICCV, 2021. arXiv
van den Oord et al. "Representation Learning with Contrastive Predictive Coding." 2018. arXiv

Introduction

Contrastive Learning Theory

Positive and negative pairs

Embedding space geometry

Noise Contrastive Estimation (NCE)

NCE derivation

Connection to mutual information

InfoNCE Loss

From NCE to InfoNCE

The B×B similarity matrix

Temperature parameter τ

CLIP Architecture (Radford et al., 2021)

ViT-L/14

GPT-2 Style Transformer

Vision encoder: ViT-L/14

Text encoder

Training at scale

Zero-Shot Classification

The Shared Embedding Space

Cross-modal retrieval

Linear probing

What CLIP learns vs. what it misses

CLIP's Limitations

Beyond CLIP

SigLIP (Zhai et al., 2023)

ALIGN (Jia et al., 2021)

OpenCLIP (Ilharco et al., 2021)

EVA-CLIP (Fang et al., 2023)

MetaCLIP (Xu et al., 2023)

DFN (Fang et al., 2023)

Code Examples

InfoNCE loss from scratch

CLIP inference with OpenAI's model

Zero-shot classification with prompt ensembling

Embedding space visualization

References