The Complete Beginner's Path

Understand Contrastive
Learning & CLIP

How neural networks learn what "similar" means, and how CLIP taught machines to see through the lens of language.

Prerequisites: Neural network basics + Embedding intuition. That's it.
10
Chapters
9+
Interactives
0
Assumed Knowledge

Chapter 0: What Is Representation?

Raw data — pixels, audio samples, characters — is not directly useful for reasoning. A 224×224 image is 150,528 numbers, but those numbers don't tell you "this is a dog" or "this looks like a Van Gogh painting." A representation is a transformation that turns raw data into useful numbers.

A good representation puts similar things close together and different things far apart in an embedding space. A photo of a golden retriever and a labrador should have nearby embeddings. A golden retriever and a car should be far apart.

The key question: How do you learn a good embedding function without manually labeling millions of examples? Contrastive learning answers this: you learn by comparing pairs.
Embedding Space

Points are data samples. Colors are categories. A good representation clusters similar items together. Click "Shuffle" to see a bad (random) embedding; click "Organize" to see a learned one.

Check: What makes a good representation?

Chapter 1: Contrastive Learning

Contrastive learning is the art of learning embeddings by comparing pairs. For each sample (the anchor), you need a positive (something similar) and negatives (things that are different). The loss pulls anchor and positive together and pushes anchor and negatives apart.

The beauty: you don't need class labels. The "positive" can be a different augmentation of the same image: flip it, crop it, change the color. Any two views of the same image are a positive pair. Everything else in the batch is a negative.

Anchor
Original image
↓ augment
Positive
Augmented version (same content)
↓ contrast with
Negatives
All other images in the batch
Pull and Push

The anchor is pulled toward the positive and pushed away from negatives. Click "Step" to see the forces applied. Watch embeddings organize over steps.

Step: 0
Self-supervised: Contrastive learning is self-supervised — the data provides its own supervision through augmentations. No human labels needed. This is why it scales to billions of images.
Check: In contrastive learning, what is a "positive pair"?

Chapter 2: SimCLR — Simple Framework

SimCLR (Simple Contrastive Learning of Representations) showed how effective contrastive learning can be with a clean, minimal design. Take a batch of N images. Augment each twice to get 2N views. For each image, its two augmented views form the positive pair. The other 2(N-1) views are negatives.

The architecture: a ResNet encoder maps each view to an embedding, then a small projection head (MLP) maps it to the space where the contrastive loss is applied. After training, the projection head is thrown away — the encoder's output is the useful representation.

Image x
Original image from batch
↓ two random augmentations
xi, xj
Two augmented views
↓ encoder f (ResNet)
hi, hj
Representations (keep these)
↓ projection head g (MLP)
zi, zj
Projected embeddings (contrastive loss here)
Why throw away the projection head? The projection head learns to discard information that's irrelevant to the contrastive task (like color jitter). The encoder retains richer features useful for downstream tasks.
AugmentationWhat it doesWhy it helps
Random crop + resizeDifferent spatial viewsForces spatial invariance
Color jitterAdjust brightness, contrast, saturationPrevents color-based shortcuts
Horizontal flipMirror the imageOrientation invariance
Gaussian blurSmooth the imagePrevents texture-based shortcuts
Realization — the data flow: An image [3, 224, 224] enters the encoder (ResNet-50). The encoder outputs a representation h of dimension 2048. The projection head (2-layer MLP with ReLU) maps this to z of dimension 128. Contrastive loss is computed on z, but downstream tasks use h. The projection head is a lossy funnel — it compresses out information that helped the contrastive task but isn't useful later.
Check: In SimCLR, for a batch of N images, how many positive pairs does each image have?

Chapter 3: InfoNCE Loss

The workhorse of contrastive learning: the InfoNCE loss (also called NT-Xent in SimCLR). For a positive pair (i, j), it's the negative log probability that j is the correct positive among all negatives. It's essentially a softmax over similarity scores.

Li = −log ( exp(sim(zi,zj)/τ) / ∑k≠i exp(sim(zi,zk)/τ) )

The temperature τ is crucial. Low τ makes the distribution sharper (hard negatives matter more). High τ makes it smoother (all negatives contribute equally). The similarity function is cosine similarity: sim(a,b) = a·b / (||a|| ||b||).

Realization — temperature in numbers: Say two text embeddings have cosine similarities of 0.30 and 0.29 with an image. At τ=0.01, the softmax probabilities become ~0.73 vs ~0.27 — the model sharply distinguishes them. At τ=1.0, they become ~0.502 vs ~0.498 — nearly identical. In CLIP, τ is learned as tau = exp(log_temperature), initialized at ~0.07. The model discovers how discriminative it needs to be.
B×B Similarity Matrix

A batch of B images creates a B×B similarity matrix. Diagonal = positive pairs (should be high). Off-diagonal = negatives (should be low). Adjust temperature to see the effect.

Batch size B6
Temperature τ0.10
Temperature matters: τ=0.07 (CLIP's default) is quite sharp. The model focuses intensely on the hardest negatives. τ=1.0 is flat — all negatives contribute equally. Too low = training instability. Too high = weak signal.
Check: What does a lower temperature τ do to the contrastive loss?
🔨 Derivation InfoNCE is a lower bound on mutual information ✓ ATTEMPTED

The InfoNCE loss was introduced in the CPC paper (van den Oord et al., 2018) with a surprising theoretical justification: minimizing InfoNCE is equivalent to maximizing a lower bound on the mutual information I(X; Y) between two views.

Your task: Show that the optimal InfoNCE loss equals log(N) − I(X; Y), where N is the number of negatives + 1. Therefore, minimizing InfoNCE maximizes a lower bound on mutual information.

InfoNCE asks: given a query x, identify the matching positive y+ among N−1 negatives y. This is an N-way classification problem. The probability of success for a random classifier is 1/N, giving loss = log(N). A perfect classifier achieves loss = 0.
The optimal critic function f*(x,y) is proportional to the density ratio p(y|x)/p(y). When the model is optimal, the softmax probability of the positive is p(y+|x) / [p(y+|x) + (N−1)p(y+)]. This connects to pointwise mutual information.
Take the expectation of −log(softmax probability). The optimal value is log(N) − I(X;Y) when I(X;Y) ≤ log(N). This means InfoNCE can recover at most log(N) bits of mutual information — which is why batch size matters!

Full derivation:

The InfoNCE loss for a positive pair (x, y+) with N−1 negatives y1...yN-1 is:

L = −E[log(exp(f(x,y+)) / (exp(f(x,y+)) + ∑k exp(f(x,yk))))]

The optimal critic is f*(x,y) = log(p(y|x)/p(y)) + c(x). Substituting:

L* = −E[log(p(y+|x)/p(y+) / (p(y+|x)/p(y+) + (N−1)))]

= −E[log(1 / (1 + (N−1) · p(y+)/p(y+|x)))]

≥ −E[log(1/(1 + (N−1) · exp(−I(X;Y))))]

By Jensen's inequality and rearranging: L* ≥ log(N) − I(X;Y)

Therefore: I(X;Y) ≥ log(N) − LInfoNCE

The key insight: InfoNCE can capture at most log(N) bits of mutual information. With batch size 32,768, that's log(32768) = 15.0 bits. This is why CLIP uses such enormous batches — small batches literally cannot represent enough shared information between modalities.

🔨 Derivation Temperature controls gradient magnitude, not just sharpness ✓ ATTEMPTED

Temperature τ is usually described as "sharpening the distribution." But its effect on learning is more profound: it scales the gradient magnitude. When τ is small, gradients from hard negatives are amplified exponentially.

Your task: Compute ∂L/∂zi (the gradient of InfoNCE w.r.t. the anchor embedding) and show how τ modulates which negatives contribute most.

pj = exp(sim(zi, zj)/τ) / ∑k exp(sim(zi, zk)/τ). This is the probability the model assigns to sample j being the positive.
For cross-entropy with softmax, ∂L/∂logitk = pk − 1[k=positive]. The gradient w.r.t. the logit of each negative is just its softmax probability. Now substitute logit = sim/τ.
As τ → 0, the softmax concentrates ALL mass on the single hardest negative (highest similarity). The gradient becomes a delta function — only the hardest negative gets pushed away. At τ → ∞, all negatives contribute equally (uniform gradient).

Full derivation:

Let sk = sim(zi, zk)/τ be the scaled similarity. The loss is L = −log(softmax(s)pos).

The gradient w.r.t. the anchor embedding zi is:

∂L/∂zi = (1/τ) · [∑k≠pos pk · ∂sim(zi,zk)/∂zi − (1−ppos) · ∂sim(zi,zpos)/∂zi]

Each negative's contribution is weighted by pk = softmax(sim/τ)k. As τ shrinks:

• The (1/τ) prefactor amplifies ALL gradients

• The softmax concentrates on the hardest negative (max sim)

• Easy negatives (low sim) contribute exponentially less

Numerically: if two negatives have sim = 0.9 and 0.1, at τ=0.07 their gradient weights are exp(0.9/0.07) / exp(0.1/0.07) = exp(11.4) ≈ 89,000x difference.

The key insight: Low temperature creates a curriculum — the model only learns from its hardest confusions, ignoring easy negatives. But if τ is too low, a single outlier negative dominates all gradients, causing instability. CLIP's learned τ≈0.07 is a sweet spot.

🔨 Derivation Why cosine similarity, not raw dot product? ✓ ATTEMPTED

CLIP L2-normalizes all embeddings before computing similarities. This converts dot products into cosine similarities. Why not use raw dot products? The answer involves the interaction between embedding magnitude and the loss landscape.

Your task: Show that without normalization, the model can minimize contrastive loss by simply increasing embedding magnitudes (a degenerate solution), and that normalization prevents this.

L = −log(exp(zi · zj / τ) / ∑ exp(zi · zk / τ)). If we scale zi by a factor α, ALL dot products scale by α. How does this affect the softmax?
softmax(α · s / τ) = softmax(s / (τ/α)). Scaling embeddings by α is equivalent to dividing temperature by α. The model can "learn" a sharp distribution by making embeddings larger, without actually improving the similarity structure.
After L2 normalization, ||z|| = 1 for all embeddings. The dot product zi · zj is constrained to [−1, 1]. The model can only reduce loss by improving the geometric arrangement of embeddings on the unit hypersphere, not by inflating magnitudes.

Full derivation:

Without normalization, let ||zi|| = r. The dot product zi · zj ≈ r2 cos(θij). The softmax logit is r2 cos(θ)/τ.

The gradient w.r.t. r (the magnitude) is ∂L/∂r ∝ (1/τ) · r · [cos(θpos)(1−ppos) − ∑ pk cos(θk)]

Even if all angles are fixed, increasing r sharpens the softmax, reducing loss. The model can achieve arbitrarily low loss by r → ∞ without changing any angular relationships.

With L2 normalization: ẑ = z/||z||, so ẑi · ẑj = cos(θij) ∈ [−1, 1].

Now the only way to reduce loss is to make cos(θpos) → 1 and cos(θneg) → −1. The model MUST learn meaningful geometric structure.

The key insight: L2 normalization decouples "how hard the model tries" (magnitude) from "how well it organizes" (angles). Without it, gradient descent takes the easy path of inflating norms. With it, the only path to lower loss is better representations. This is why every modern contrastive method normalizes.

🔗 Pattern Recognition
Contrastive Loss IS Cross-Modal Attention
InfoNCE (this lesson)
softmax(zi · zjT / τ) — one image attends to all texts, picks the match.
Self-Attention (Transformer)
softmax(Q · KT / √d) — one query attends to all keys, picks relevant ones. → Transformer lesson

Both are softmax over scaled dot products. The difference: attention makes a soft assignment (weighted average of all values), while contrastive makes a hard assignment (only one correct match). Contrastive learning is like attention where you train the model to put all weight on the single correct key.

Can you see why temperature in contrastive learning plays the same role as √d in attention? Both prevent the softmax from collapsing to a one-hot vector too early in training.

Checkpoint — Before you move on
Explain in your own words: why are negative pairs essential? What would happen if you trained only with positive pairs (pulling matching pairs together without pushing non-matching pairs apart)?
✓ Gate cleared
Model Answer

Without negatives, the model collapses: it maps ALL inputs to the same point (or a small region). Why? Because mapping everything to a single vector z* = [1,0,0,...] makes every pair have similarity 1.0, achieving perfect "alignment" with zero loss. This is called representation collapse. Negatives prevent this by penalizing the model when non-matching pairs are close. They force the model to SPREAD different concepts apart, creating an informative embedding space rather than a trivial one. The loss requires both attraction (positives) and repulsion (negatives) to find a meaningful equilibrium.

Chapter 4: CLIP — Connecting Vision and Language

CLIP (Contrastive Language-Image Pre-training) applies contrastive learning across modalities: images and text. Instead of two augmented views of the same image, the positive pair is an image and its caption. Train on 400 million image-text pairs from the internet.

Two separate encoders: a vision encoder (ViT or ResNet) maps images to embeddings, and a text encoder (Transformer) maps captions to embeddings. The contrastive loss pulls matching (image, caption) pairs together and pushes non-matching pairs apart.

Image Encoder
ViT / ResNet → image embedding
 
Shared Embedding Space
cos(image, text) = similarity
 
Text Encoder
Transformer → text embedding

Let's trace the exact data flow. The image path: a batch of images [batch, 3, 224, 224] enters ViT-L/14. The image is split into 14×14 patches, each flattened and projected to 1024 dims, giving 196 patch tokens plus one class token — 197 tokens total. After 24 transformer layers, the class token (a single 1024-dim vector) is extracted and passed through a linear projection head that maps it to 768 dimensions. The result is L2-normalized: [batch, 768].

The text path: a caption is tokenized with BPE (byte-pair encoding, 49,152-token vocabulary), padded/truncated to 77 tokens. A 12-layer transformer processes these tokens. The embedding at the [EOS] token position (analogous to the class token) is extracted and projected to 768 dimensions. Also L2-normalized: [batch, 768].

Key detail: The image encoder outputs 1024-dim vectors. The text encoder outputs 512-dim vectors. The projection heads exist specifically to map both into the same 768-dim space. Without them, you can't compute a dot product between image and text — the dimensions don't match.
CLIP Embedding Space

Images (squares) and text (circles) in a shared space. Matching pairs are connected. CLIP learns to align them.

The breakthrough: CLIP learns a shared space where "a photo of a dog" and an actual photo of a dog are neighbors. This means you can classify images using text descriptions alone — no task-specific training required.
Check: What are CLIP's positive pairs?

Chapter 5: Training CLIP

CLIP's training is elegant. Take a batch of N (image, text) pairs. Compute all N×N cosine similarities. The diagonal contains matching pairs. Apply cross-entropy loss to make each row and column peak at the diagonal. That's it.

In code, this is shockingly simple. You have image_embeds of shape [N, 768] and text_embeds of shape [N, 768], both L2-normalized. The similarity matrix is one matrix multiply:

logits = image_embeds @ text_embedsT × exp(log_temperature)

This gives an [N, N] matrix. The diagonal entries are the N positive pairs (imagei matched with texti). The off-diagonal entries are the N²−N negative pairs. For a concrete example with N=4: you get 4 positive pairs and 12 negatives. The loss is symmetric cross-entropy — cross-entropy on the rows (which text matches each image?) AND on the columns (which image matches each text?), averaged.

L = ½(CErows(logits, labels) + CEcols(logits, labels))    where labels = [0, 1, 2, ..., N−1]

The scale: 400 million image-text pairs scraped from the internet (called WIT — WebImageText). Training on 256 V100 GPUs for ~12 days. Batch size 32,768 — meaning each image competes against 32,767 negative text captions per step. Why so large? More negatives make the contrastive task harder, forcing the model to learn finer-grained distinctions.

The N×N Training Matrix

Each cell is the cosine similarity between image i and text j. The loss tries to make the diagonal bright (high similarity) and everything else dark (low similarity).

Step: 0
HyperparameterValueWhy
Batch size32,768More negatives = harder contrastive task
Temperature τ0.07 (learned)Sharpens softmax to focus on hard negatives
Image encoderViT-L/14Large Vision Transformer for best quality
Text encoder12-layer TransformerStandard text encoding
Embedding dim512 or 768Shared dimension for image-text space
Why so large a batch? With batch size 32,768, each positive pair competes against 32,767 negatives. This makes the task extremely hard, forcing the model to learn fine-grained distinctions.
Check: In CLIP training, what does each row of the N×N matrix represent?
💻 Build It Implement CLIP's Symmetric Contrastive Loss ✓ ATTEMPTED
You have a batch of image embeddings and text embeddings, both L2-normalized. Implement the full CLIP loss: compute the similarity matrix, apply learned temperature, compute cross-entropy in both directions (image-to-text and text-to-image), and average.
signature def clip_loss(image_embeds, text_embeds, log_temperature): """ Args: image_embeds: (N, D) tensor, L2-normalized text_embeds: (N, D) tensor, L2-normalized log_temperature: scalar (learnable parameter) Returns: loss: scalar, symmetric contrastive loss """
Test case
N=4, D=8. With random normalized embeddings and log_temperature=log(1/0.07)≈2.66, loss should be close to log(4)≈1.39 (random chance). After training converges, loss → 0.
logits = image_embeds @ text_embeds.T * exp(log_temperature). That's it. The temperature is applied as a multiplicative scale, not a divisor (CLIP parameterizes it as exp(log_temp) to keep it positive).
python
import torch
import torch.nn.functional as F

def clip_loss(image_embeds, text_embeds, log_temperature):
    # Scale factor (learned, initialized to 1/0.07 ≈ 14.3)
    temperature = torch.exp(log_temperature)

    # Similarity matrix: (N, N)
    logits = image_embeds @ text_embeds.T * temperature

    # Labels: diagonal is the correct match
    N = logits.shape[0]
    labels = torch.arange(N, device=logits.device)

    # Symmetric cross-entropy
    loss_i2t = F.cross_entropy(logits, labels)       # rows
    loss_t2i = F.cross_entropy(logits.T, labels)     # cols

    return (loss_i2t + loss_t2i) / 2
Bonus: Why does CLIP use exp(log_temperature) instead of just a raw temperature scalar? Because the scale must be positive. Parameterizing in log-space and exponentiating guarantees positivity without needing a clamp, and gives better gradient flow for small values.
💥 Break-It Lab What Dies When You Remove Components? ✓ ATTEMPTED
A contrastive learning system trains on image-text pairs. The chart shows loss curves and embedding quality (measured by retrieval accuracy). Toggle off components to see specific failure modes.
Remove temperature scaling (τ=1) ACTIVE
Failure mode: Without temperature, the softmax is flat. All negatives contribute equally to the gradient. The model can't focus on hard negatives (the ones it's confusing with the positive). Learning is extremely slow because gradients from thousands of easy negatives dilute the signal from the few informative hard negatives. Convergence takes 5−10x longer.
Remove L2 normalization ACTIVE
Failure mode: Without normalization, embedding magnitudes grow unboundedly. The model discovers it can reduce loss by simply making vectors longer (sharpening the softmax without improving representations). Training loss drops but retrieval accuracy plateaus — a degenerate solution. Embeddings magnitudes reach 100+ while cosine similarity stays random.
Remove negatives (positives only) ACTIVE
Failure mode: Complete representation collapse. Without repulsion from negatives, all embeddings converge to the same point. Loss reaches zero instantly (all pairs have similarity 1.0) but the representation is completely useless — every image and every text map to the same vector. Retrieval accuracy = random chance (1/N).
⚔ Adversarial: Your CLIP model's training loss plateaus at 4.2 (which equals log(batch_size=65)). The loss won't decrease further despite training for more epochs. What's happening?
You're training CLIP with batch size 65. After 10 epochs, loss = 4.2. After 100 epochs, loss is still 4.2. The embeddings are L2-normalized. Temperature is fixed at 0.07.
🏗 Design Challenge You're the Architect: Distributed CLIP Training ✓ ATTEMPTED
You're training CLIP on 400M image-text pairs. Batch size is crucial for contrastive learning (more negatives = better). You have 256 A100 GPUs (80GB each). Design the distributed training strategy.
Hardware
256x A100-80GB, NVLink within nodes (8 GPUs/node), InfiniBand across nodes
Model size
ViT-L/14 (304M params) + Text Transformer (63M params) = ~1.5GB in fp16
Per-sample cost
Image: [3,224,224] → 588KB. Embeddings: [1,768] → 1.5KB. Activations for backprop: ~2GB/batch
Contrastive requirement
InfoNCE needs ALL-to-ALL similarity computation across the full batch
Target
Effective batch size 32,768. Training completed in 12 days.
1. How do you distribute the batch? Each GPU can hold ~128 image-text pairs in memory (with activations for backprop). 256 GPUs × 128 = 32,768. But the loss needs a 32K × 32K similarity matrix. How do you compute this without storing it on one GPU?
2. All-gather vs gradient accumulation: Should you compute the loss over the full 32K batch (requiring all-gather of embeddings) or accumulate gradients over micro-batches (losing the full negative set)?
3. Mixed precision: The similarity matrix has 32K×32K = 1 billion entries. At fp32 that's 4GB. At fp16 it's 2GB. Does precision matter for the loss computation?

OpenAI's approach (and the field's consensus):

1. All-gather embeddings, not images. Each GPU processes its local micro-batch (128 pairs) through both encoders to get embeddings of shape [128, 768]. Then an all-gather collects ALL embeddings from ALL GPUs: [32768, 768]. This is only 32768 × 768 × 2 bytes = 48MB per modality — trivial over InfiniBand. Now each GPU can compute the full 32K×32K similarity matrix locally.

2. All-gather, not gradient accumulation. Gradient accumulation would mean each micro-batch only sees 128 negatives, not 32K. The entire point of large batches is the full negative set. All-gather is essential. The communication cost (48MB × 2 modalities = 96MB per step) is negligible compared to the forward/backward pass.

3. fp16 embeddings, fp32 loss. Embeddings are computed in fp16 (saves memory). But the similarity matrix and cross-entropy are computed in fp32 to avoid numerical instability in the softmax (exp of large values overflows fp16). The loss is cast back to fp16 for the backward pass.

Key insight: The brilliant trick is that embeddings are tiny (768 dims) compared to images (150K dims). All-gathering embeddings is 200x cheaper than all-gathering images. This is why the two-encoder architecture works so well for distributed training.

Chapter 6: Zero-Shot Transfer

CLIP's superpower: zero-shot classification. To classify an image, create text prompts for each class: "a photo of a cat", "a photo of a dog", "a photo of a car." Embed all prompts. Embed the image. Pick the text with highest cosine similarity. No training on the target task.

This works because CLIP's shared embedding space aligns concepts across modalities. "A photo of a dog" is near actual dog photos, even if CLIP never saw this specific classification task during training.

Here's exactly how zero-shot ImageNet classification works. Take all 1,000 ImageNet class names. Create a text template for each: "a photo of a {class}". Run all 1,000 through the text encoder to get 1,000 text embeddings of shape [1000, 768]. These are your "classifier weights" — computed once, cached forever. To classify an image: encode it to get [1, 768], dot product with all 1,000 text embeddings, argmax. That's the prediction. No fine-tuning, no training on ImageNet. The "classifier" is literally text embeddings.

How well does it work? CLIP zero-shot matches a fully-supervised ResNet-50 on ImageNet (76.2% top-1 accuracy). It never saw a single ImageNet training image. Prompt engineering helps: ensembling templates like "a photo of a {class}", "a bad photo of a {class}", "a sculpture of a {class}" boosts accuracy by ~3.5% because it captures more of CLIP's knowledge.
Image
Encode with CLIP vision encoder
↓ cosine similarity
Text Prompts
"a photo of a {class}" for each class
↓ argmax
Prediction
Class with highest similarity
Zero-Shot Classification

An image is compared against text prompts for 5 classes. The bar chart shows similarity scores. The highest score wins. Click to generate a new random scenario.

Prompt engineering matters: "a photo of a dog" works better than just "dog" because CLIP was trained on natural captions. Templates like "a photo of a {class}" or "a centered satellite image of {class}" can boost accuracy by 5-10%.
Check: Why does zero-shot classification work with CLIP?
⚔ Adversarial: Your CLIP model achieves 76% top-1 on ImageNet zero-shot. But when you test it on counting tasks ("an image of 3 dogs"), accuracy drops to near-random. Why?
You prompt CLIP with "a photo of 1 dog", "a photo of 2 dogs", "a photo of 3 dogs" and test on images containing specific counts. The model almost always picks "a photo of dogs" regardless of the actual count. Increasing the image resolution doesn't help.

Chapter 7: SigLIP — Sigmoid Beats Softmax

CLIP uses a softmax-based loss: each positive competes with all negatives in the batch. SigLIP replaces this with independent sigmoid losses per pair. Each (image, text) pair gets a binary "match or not" prediction — no need to normalize across the batch.

Why does this matter? Softmax requires all-to-all communication across the batch. Sigmoid doesn't. This means SigLIP can scale to much larger batch sizes (up to 1M) using simple data parallelism, and achieves better performance.

L = −∑i,j [ yij log σ(zij) + (1−yij) log(1−σ(zij)) ]
CLIP (softmax): "Which text in the batch matches this image?" Competition across all pairs. Requires global batch communication.
SigLIP (sigmoid): "Does this specific (image, text) pair match?" Independent per-pair decision. Trivially parallelizable.
Softmax vs Sigmoid

Left: softmax normalizes across the row (probabilities sum to 1). Right: sigmoid treats each cell independently. Both want the diagonal to be bright.

PropertyCLIP (softmax)SigLIP (sigmoid)
Loss typeCross-entropy over row/colBinary cross-entropy per pair
Batch scalingNeeds all-gatherEmbarrassingly parallel
Max batch size~32K (practical limit)>1M demonstrated
PerformanceStrong baselineBetter at same compute
Check: What advantage does SigLIP's sigmoid loss have over CLIP's softmax?

Chapter 8: DINO & Self-Supervised Vision

DINO (self-DIstillation with NO labels) takes contrastive ideas in a different direction: instead of text-image pairs, it uses self-distillation. A student network and a teacher network (exponential moving average of the student) both see augmented views of the same image. The student learns to match the teacher's output distribution.

The magic result: DINO learns features with remarkable spatial awareness. Its attention maps naturally segment objects without ever seeing segmentation labels. This makes DINO features excellent for robotics, dense prediction, and spatial reasoning.

Student
Sees local crops + global crop
↓ match distributions
Teacher (EMA)
Sees only global crops
↓ centering + sharpening
Cross-Entropy Loss
Student matches teacher's output distribution
DINO Attention Maps

DINO's self-attention naturally segments objects. Each color represents a different attention head focusing on different parts. The model discovers object boundaries without labels.

DINO v1: ViT + self-distillation. Excellent attention maps. Used a lot in research.
DINOv2: Scaled up, combined with iBOT. State-of-the-art visual features for dense tasks. Powers many robotics pipelines.
DINO vs CLIP: CLIP learns image-text alignment (good for classification, retrieval). DINO learns spatial features (good for segmentation, depth, robotics). Many VLMs use both.
Check: What makes DINO's learned features special?

Chapter 9: Representations Everywhere

CLIP and contrastive learning have become foundational infrastructure. CLIP embeddings power vision-language models (VLMs) like LLaVA and GPT-4V. DINO features power robotics (spatial reasoning) and video understanding. Together, they're how AI systems see the world.

The dependency chain is remarkable. CLIP's vision encoder is the visual backbone of LLaVA and other VLMs — they freeze it and connect it to a language model. CLIP's text encoder guides image generation in Stable Diffusion and DALL-E — the text embedding steers the denoising process. Open-vocabulary detection models like OWLv2 and DETIC use CLIP to detect objects from text descriptions instead of fixed class lists. In robotics, SayCan uses CLIP to ground language commands ("pick up the red cup") to visual affordances. Each of these systems would not exist without CLIP's shared vision-language space.

The Representation Ecosystem

How contrastive representations flow through modern AI systems. Each arrow shows a dependency.

SystemUsesFor
LLaVA, GPT-4VCLIP vision encoderImage understanding for VLMs
Stable DiffusionCLIP text encoderText-guided image generation
Segment Anything (SAM)Contrastive pre-trainingUniversal segmentation
Robotics (RT-2, etc.)DINO/DINOv2 featuresSpatial perception and manipulation
Image searchCLIP embeddingsSemantic search across billions of images
Video understandingDINO + CLIPAction recognition, tracking
The theme: Learn representations once, use them everywhere. Contrastive pre-training is expensive (hundreds of GPU-days), but the resulting encoders are reused across dozens of downstream applications.
🔗 Pattern Recognition
CLIP's Vision Encoder IS the VLM's Eyes
CLIP (this lesson)
ViT encodes image → [N_patches, 1024] → project to 768-dim → contrastive loss aligns with text.
LLaVA / GPT-4V (VLM)
Same ViT encodes image → [N_patches, 1024] → linear projection to LLM's dim → concatenate with text tokens → LLM generates. → VLM lesson

The VLM doesn't train its own vision encoder from scratch. It takes CLIP's frozen ViT and plugs it directly into a language model. CLIP's contrastive pre-training already aligned visual concepts with language — the VLM exploits this by feeding CLIP's patch tokens as "visual words" into the LLM. The bridge between modalities was already built by contrastive learning; the VLM just walks across it.

Why do VLMs freeze the CLIP encoder instead of fine-tuning it? Think about what happens to the representation space if you update the vision encoder with a language modeling loss.

"The features are the foundation. Everything else is built on top."
— common wisdom in representation learning

You now understand how machines learn to represent the world: by comparing, contrasting, and aligning. Every VLM, every image search, every AI that "sees" — it starts here.