microCLIP — From Contrastive Learning to Vision-Language Models

Chapter 2: SimCLR — Simple Framework

SimCLR (Simple Contrastive Learning of Representations) showed how effective contrastive learning can be with a clean, minimal design. Take a batch of N images. Augment each twice to get 2N views. For each image, its two augmented views form the positive pair. The other 2(N-1) views are negatives.

The architecture: a ResNet encoder maps each view to an embedding, then a small projection head (MLP) maps it to the space where the contrastive loss is applied. After training, the projection head is thrown away — the encoder's output is the useful representation.

Image x

Original image from batch

↓ two random augmentations

x_i, x_j

Two augmented views

↓ encoder f (ResNet)

h_i, h_j

Representations (keep these)

↓ projection head g (MLP)

z_i, z_j

Projected embeddings (contrastive loss here)

Why throw away the projection head? The projection head learns to discard information that's irrelevant to the contrastive task (like color jitter). The encoder retains richer features useful for downstream tasks.

Augmentation	What it does	Why it helps
Random crop + resize	Different spatial views	Forces spatial invariance
Color jitter	Adjust brightness, contrast, saturation	Prevents color-based shortcuts
Horizontal flip	Mirror the image	Orientation invariance
Gaussian blur	Smooth the image	Prevents texture-based shortcuts

Realization — the data flow: An image [3, 224, 224] enters the encoder (ResNet-50). The encoder outputs a representation h of dimension 2048. The projection head (2-layer MLP with ReLU) maps this to z of dimension 128. Contrastive loss is computed on z, but downstream tasks use h. The projection head is a lossy funnel — it compresses out information that helped the contrastive task but isn't useful later.

Check: In SimCLR, for a batch of N images, how many positive pairs does each image have?

N pairs 1 pair (its two augmented views) 2N pairs

Chapter 3: InfoNCE Loss

The workhorse of contrastive learning: the InfoNCE loss (also called NT-Xent in SimCLR). For a positive pair (i, j), it's the negative log probability that j is the correct positive among all negatives. It's essentially a softmax over similarity scores.

L_i = −log ( exp(sim(z_i,z_j)/τ) / ∑_k≠i exp(sim(z_i,z_k)/τ) )

The temperature τ is crucial. Low τ makes the distribution sharper (hard negatives matter more). High τ makes it smoother (all negatives contribute equally). The similarity function is cosine similarity: sim(a,b) = a·b / (||a|| ||b||).

Realization — temperature in numbers: Say two text embeddings have cosine similarities of 0.30 and 0.29 with an image. At τ=0.01, the softmax probabilities become ~0.73 vs ~0.27 — the model sharply distinguishes them. At τ=1.0, they become ~0.502 vs ~0.498 — nearly identical. In CLIP, τ is learned as tau = exp(log_temperature), initialized at ~0.07. The model discovers how discriminative it needs to be.

B×B Similarity Matrix

A batch of B images creates a B×B similarity matrix. Diagonal = positive pairs (should be high). Off-diagonal = negatives (should be low). Adjust temperature to see the effect.

Batch size B6

Temperature τ0.10

Temperature matters: τ=0.07 (CLIP's default) is quite sharp. The model focuses intensely on the hardest negatives. τ=1.0 is flat — all negatives contribute equally. Too low = training instability. Too high = weak signal.

Check: What does a lower temperature τ do to the contrastive loss?

Makes the distribution sharper — the model focuses on hard negatives Makes training slower Reduces the batch size

🔨 Derivation InfoNCE is a lower bound on mutual information ▶ ✓ ATTEMPTED

The InfoNCE loss was introduced in the CPC paper (van den Oord et al., 2018) with a surprising theoretical justification: minimizing InfoNCE is equivalent to maximizing a lower bound on the mutual information I(X; Y) between two views.

Your task: Show that the optimal InfoNCE loss equals log(N) − I(X; Y), where N is the number of negatives + 1. Therefore, minimizing InfoNCE maximizes a lower bound on mutual information.

InfoNCE asks: given a query x, identify the matching positive y⁺ among N−1 negatives y⁻. This is an N-way classification problem. The probability of success for a random classifier is 1/N, giving loss = log(N). A perfect classifier achieves loss = 0.

The optimal critic function f*(x,y) is proportional to the density ratio p(y|x)/p(y). When the model is optimal, the softmax probability of the positive is p(y⁺|x) / [p(y⁺|x) + (N−1)p(y⁺)]. This connects to pointwise mutual information.

Take the expectation of −log(softmax probability). The optimal value is log(N) − I(X;Y) when I(X;Y) ≤ log(N). This means InfoNCE can recover at most log(N) bits of mutual information — which is why batch size matters!

Full derivation:

The InfoNCE loss for a positive pair (x, y⁺) with N−1 negatives y₁⁻...y_N-1⁻ is:

L = −E[log(exp(f(x,y⁺)) / (exp(f(x,y⁺)) + ∑_k exp(f(x,y_k⁻))))]

The optimal critic is f*(x,y) = log(p(y|x)/p(y)) + c(x). Substituting:

L* = −E[log(p(y⁺|x)/p(y⁺) / (p(y⁺|x)/p(y⁺) + (N−1)))]

= −E[log(1 / (1 + (N−1) · p(y⁺)/p(y⁺|x)))]

≥ −E[log(1/(1 + (N−1) · exp(−I(X;Y))))]

By Jensen's inequality and rearranging: L* ≥ log(N) − I(X;Y)

Therefore: I(X;Y) ≥ log(N) − L_InfoNCE

The key insight: InfoNCE can capture at most log(N) bits of mutual information. With batch size 32,768, that's log(32768) = 15.0 bits. This is why CLIP uses such enormous batches — small batches literally cannot represent enough shared information between modalities.

🔨 Derivation Temperature controls gradient magnitude, not just sharpness ▶ ✓ ATTEMPTED

Temperature τ is usually described as "sharpening the distribution." But its effect on learning is more profound: it scales the gradient magnitude. When τ is small, gradients from hard negatives are amplified exponentially.

Your task: Compute ∂L/∂z_i (the gradient of InfoNCE w.r.t. the anchor embedding) and show how τ modulates which negatives contribute most.

p_j = exp(sim(z_i, z_j)/τ) / ∑_k exp(sim(z_i, z_k)/τ). This is the probability the model assigns to sample j being the positive.

For cross-entropy with softmax, ∂L/∂logit_k = p_k − 1_[k=positive]. The gradient w.r.t. the logit of each negative is just its softmax probability. Now substitute logit = sim/τ.

As τ → 0, the softmax concentrates ALL mass on the single hardest negative (highest similarity). The gradient becomes a delta function — only the hardest negative gets pushed away. At τ → ∞, all negatives contribute equally (uniform gradient).

Full derivation:

Let s_k = sim(z_i, z_k)/τ be the scaled similarity. The loss is L = −log(softmax(s)_pos).

The gradient w.r.t. the anchor embedding z_i is:

∂L/∂z_i = (1/τ) · [∑_k≠pos p_k · ∂sim(z_i,z_k)/∂z_i − (1−p_pos) · ∂sim(z_i,z_pos)/∂z_i]

Each negative's contribution is weighted by p_k = softmax(sim/τ)_k. As τ shrinks:

• The (1/τ) prefactor amplifies ALL gradients

• The softmax concentrates on the hardest negative (max sim)

• Easy negatives (low sim) contribute exponentially less

Numerically: if two negatives have sim = 0.9 and 0.1, at τ=0.07 their gradient weights are exp(0.9/0.07) / exp(0.1/0.07) = exp(11.4) ≈ 89,000x difference.

The key insight: Low temperature creates a curriculum — the model only learns from its hardest confusions, ignoring easy negatives. But if τ is too low, a single outlier negative dominates all gradients, causing instability. CLIP's learned τ≈0.07 is a sweet spot.

🔨 Derivation Why cosine similarity, not raw dot product? ▶ ✓ ATTEMPTED

CLIP L2-normalizes all embeddings before computing similarities. This converts dot products into cosine similarities. Why not use raw dot products? The answer involves the interaction between embedding magnitude and the loss landscape.

Your task: Show that without normalization, the model can minimize contrastive loss by simply increasing embedding magnitudes (a degenerate solution), and that normalization prevents this.

L = −log(exp(z_i · z_j / τ) / ∑ exp(z_i · z_k / τ)). If we scale z_i by a factor α, ALL dot products scale by α. How does this affect the softmax?

softmax(α · s / τ) = softmax(s / (τ/α)). Scaling embeddings by α is equivalent to dividing temperature by α. The model can "learn" a sharp distribution by making embeddings larger, without actually improving the similarity structure.

After L2 normalization, ||z|| = 1 for all embeddings. The dot product z_i · z_j is constrained to [−1, 1]. The model can only reduce loss by improving the geometric arrangement of embeddings on the unit hypersphere, not by inflating magnitudes.

Full derivation:

Without normalization, let ||z_i|| = r. The dot product z_i · z_j ≈ r² cos(θ_ij). The softmax logit is r² cos(θ)/τ.

The gradient w.r.t. r (the magnitude) is ∂L/∂r ∝ (1/τ) · r · [cos(θ_pos)(1−p_pos) − ∑ p_k cos(θ_k)]

Even if all angles are fixed, increasing r sharpens the softmax, reducing loss. The model can achieve arbitrarily low loss by r → ∞ without changing any angular relationships.

With L2 normalization: ẑ = z/||z||, so ẑ_i · ẑ_j = cos(θ_ij) ∈ [−1, 1].

Now the only way to reduce loss is to make cos(θ_pos) → 1 and cos(θ_neg) → −1. The model MUST learn meaningful geometric structure.

The key insight: L2 normalization decouples "how hard the model tries" (magnitude) from "how well it organizes" (angles). Without it, gradient descent takes the easy path of inflating norms. With it, the only path to lower loss is better representations. This is why every modern contrastive method normalizes.

🔗 Pattern Recognition

Contrastive Loss IS Cross-Modal Attention

InfoNCE (this lesson)

softmax(z_i · z_j^T / τ) — one image attends to all texts, picks the match.

Self-Attention (Transformer)

softmax(Q · K^T / √d) — one query attends to all keys, picks relevant ones. → Transformer lesson

Both are softmax over scaled dot products. The difference: attention makes a soft assignment (weighted average of all values), while contrastive makes a hard assignment (only one correct match). Contrastive learning is like attention where you train the model to put all weight on the single correct key.

Can you see why temperature in contrastive learning plays the same role as √d in attention? Both prevent the softmax from collapsing to a one-hot vector too early in training.

Checkpoint — Before you move on

Explain in your own words: why are negative pairs essential? What would happen if you trained only with positive pairs (pulling matching pairs together without pushing non-matching pairs apart)?

✓ Gate cleared

Model Answer

Without negatives, the model collapses: it maps ALL inputs to the same point (or a small region). Why? Because mapping everything to a single vector z* = [1,0,0,...] makes every pair have similarity 1.0, achieving perfect "alignment" with zero loss. This is called representation collapse. Negatives prevent this by penalizing the model when non-matching pairs are close. They force the model to SPREAD different concepts apart, creating an informative embedding space rather than a trivial one. The loss requires both attraction (positives) and repulsion (negatives) to find a meaningful equilibrium.

Chapter 4: CLIP — Connecting Vision and Language

CLIP (Contrastive Language-Image Pre-training) applies contrastive learning across modalities: images and text. Instead of two augmented views of the same image, the positive pair is an image and its caption. Train on 400 million image-text pairs from the internet.

Two separate encoders: a vision encoder (ViT or ResNet) maps images to embeddings, and a text encoder (Transformer) maps captions to embeddings. The contrastive loss pulls matching (image, caption) pairs together and pushes non-matching pairs apart.

Image Encoder

ViT / ResNet → image embedding

Shared Embedding Space

cos(image, text) = similarity

Text Encoder

Transformer → text embedding

Let's trace the exact data flow. The image path: a batch of images [batch, 3, 224, 224] enters ViT-L/14. The image is split into 14×14 patches, each flattened and projected to 1024 dims, giving 196 patch tokens plus one class token — 197 tokens total. After 24 transformer layers, the class token (a single 1024-dim vector) is extracted and passed through a linear projection head that maps it to 768 dimensions. The result is L2-normalized: [batch, 768].

The text path: a caption is tokenized with BPE (byte-pair encoding, 49,152-token vocabulary), padded/truncated to 77 tokens. A 12-layer transformer processes these tokens. The embedding at the [EOS] token position (analogous to the class token) is extracted and projected to 768 dimensions. Also L2-normalized: [batch, 768].

Key detail: The image encoder outputs 1024-dim vectors. The text encoder outputs 512-dim vectors. The projection heads exist specifically to map both into the same 768-dim space. Without them, you can't compute a dot product between image and text — the dimensions don't match.

CLIP Embedding Space

Images (squares) and text (circles) in a shared space. Matching pairs are connected. CLIP learns to align them.

The breakthrough: CLIP learns a shared space where "a photo of a dog" and an actual photo of a dog are neighbors. This means you can classify images using text descriptions alone — no task-specific training required.

Check: What are CLIP's positive pairs?

An image and its matching caption Two augmented views of the same image Two similar captions

Chapter 5: Training CLIP

CLIP's training is elegant. Take a batch of N (image, text) pairs. Compute all N×N cosine similarities. The diagonal contains matching pairs. Apply cross-entropy loss to make each row and column peak at the diagonal. That's it.

In code, this is shockingly simple. You have image_embeds of shape [N, 768] and text_embeds of shape [N, 768], both L2-normalized. The similarity matrix is one matrix multiply:

logits = image_embeds @ text_embeds^T × exp(log_temperature)

This gives an [N, N] matrix. The diagonal entries are the N positive pairs (image_i matched with text_i). The off-diagonal entries are the N²−N negative pairs. For a concrete example with N=4: you get 4 positive pairs and 12 negatives. The loss is symmetric cross-entropy — cross-entropy on the rows (which text matches each image?) AND on the columns (which image matches each text?), averaged.

L = ½(CE_rows(logits, labels) + CE_cols(logits, labels)) where labels = [0, 1, 2, ..., N−1]

The scale: 400 million image-text pairs scraped from the internet (called WIT — WebImageText). Training on 256 V100 GPUs for ~12 days. Batch size 32,768 — meaning each image competes against 32,767 negative text captions per step. Why so large? More negatives make the contrastive task harder, forcing the model to learn finer-grained distinctions.

The N×N Training Matrix

Each cell is the cosine similarity between image i and text j. The loss tries to make the diagonal bright (high similarity) and everything else dark (low similarity).

Step: 0

Hyperparameter	Value	Why
Batch size	32,768	More negatives = harder contrastive task
Temperature τ	0.07 (learned)	Sharpens softmax to focus on hard negatives
Image encoder	ViT-L/14	Large Vision Transformer for best quality
Text encoder	12-layer Transformer	Standard text encoding
Embedding dim	512 or 768	Shared dimension for image-text space

Why so large a batch? With batch size 32,768, each positive pair competes against 32,767 negatives. This makes the task extremely hard, forcing the model to learn fine-grained distinctions.

Check: In CLIP training, what does each row of the N×N matrix represent?

One image's similarity to all N text captions in the batch The weights of the image encoder The learning rate schedule

💻 Build It Implement CLIP's Symmetric Contrastive Loss ▶ ✓ ATTEMPTED

You have a batch of image embeddings and text embeddings, both L2-normalized. Implement the full CLIP loss: compute the similarity matrix, apply learned temperature, compute cross-entropy in both directions (image-to-text and text-to-image), and average.

signature def clip_loss(image_embeds, text_embeds, log_temperature): """ Args: image_embeds: (N, D) tensor, L2-normalized text_embeds: (N, D) tensor, L2-normalized log_temperature: scalar (learnable parameter) Returns: loss: scalar, symmetric contrastive loss """

Test case

N=4, D=8. With random normalized embeddings and log_temperature=log(1/0.07)≈2.66, loss should be close to log(4)≈1.39 (random chance). After training converges, loss → 0.

logits = image_embeds @ text_embeds.T * exp(log_temperature). That's it. The temperature is applied as a multiplicative scale, not a divisor (CLIP parameterizes it as exp(log_temp) to keep it positive).

python
import torch
import torch.nn.functional as F

def clip_loss(image_embeds, text_embeds, log_temperature):
    # Scale factor (learned, initialized to 1/0.07 ≈ 14.3)
    temperature = torch.exp(log_temperature)

    # Similarity matrix: (N, N)
    logits = image_embeds @ text_embeds.T * temperature

    # Labels: diagonal is the correct match
    N = logits.shape[0]
    labels = torch.arange(N, device=logits.device)

    # Symmetric cross-entropy
    loss_i2t = F.cross_entropy(logits, labels)       # rows
    loss_t2i = F.cross_entropy(logits.T, labels)     # cols

    return (loss_i2t + loss_t2i) / 2

Bonus: Why does CLIP use exp(log_temperature) instead of just a raw temperature scalar? Because the scale must be positive. Parameterizing in log-space and exponentiating guarantees positivity without needing a clamp, and gives better gradient flow for small values.

💥 Break-It Lab What Dies When You Remove Components? ▶ ✓ ATTEMPTED

A contrastive learning system trains on image-text pairs. The chart shows loss curves and embedding quality (measured by retrieval accuracy). Toggle off components to see specific failure modes.

Remove temperature scaling (τ=1) ACTIVE

Failure mode: Without temperature, the softmax is flat. All negatives contribute equally to the gradient. The model can't focus on hard negatives (the ones it's confusing with the positive). Learning is extremely slow because gradients from thousands of easy negatives dilute the signal from the few informative hard negatives. Convergence takes 5−10x longer.

Remove L2 normalization ACTIVE

Failure mode: Without normalization, embedding magnitudes grow unboundedly. The model discovers it can reduce loss by simply making vectors longer (sharpening the softmax without improving representations). Training loss drops but retrieval accuracy plateaus — a degenerate solution. Embeddings magnitudes reach 100+ while cosine similarity stays random.

Remove negatives (positives only) ACTIVE

Failure mode: Complete representation collapse. Without repulsion from negatives, all embeddings converge to the same point. Loss reaches zero instantly (all pairs have similarity 1.0) but the representation is completely useless — every image and every text map to the same vector. Retrieval accuracy = random chance (1/N).

⚔ Adversarial: Your CLIP model's training loss plateaus at 4.2 (which equals log(batch_size=65)). The loss won't decrease further despite training for more epochs. What's happening?

You're training CLIP with batch size 65. After 10 epochs, loss = 4.2. After 100 epochs, loss is still 4.2. The embeddings are L2-normalized. Temperature is fixed at 0.07.

The learning rate is too low The batch size is too large The embeddings have collapsed — all outputs are identical

🏗 Design Challenge You're the Architect: Distributed CLIP Training ▶ ✓ ATTEMPTED

You're training CLIP on 400M image-text pairs. Batch size is crucial for contrastive learning (more negatives = better). You have 256 A100 GPUs (80GB each). Design the distributed training strategy.

Hardware

256x A100-80GB, NVLink within nodes (8 GPUs/node), InfiniBand across nodes

Model size

ViT-L/14 (304M params) + Text Transformer (63M params) = ~1.5GB in fp16

Per-sample cost

Image: [3,224,224] → 588KB. Embeddings: [1,768] → 1.5KB. Activations for backprop: ~2GB/batch

Contrastive requirement

InfoNCE needs ALL-to-ALL similarity computation across the full batch

Target

Effective batch size 32,768. Training completed in 12 days.

1. How do you distribute the batch? Each GPU can hold ~128 image-text pairs in memory (with activations for backprop). 256 GPUs × 128 = 32,768. But the loss needs a 32K × 32K similarity matrix. How do you compute this without storing it on one GPU?

2. All-gather vs gradient accumulation: Should you compute the loss over the full 32K batch (requiring all-gather of embeddings) or accumulate gradients over micro-batches (losing the full negative set)?

3. Mixed precision: The similarity matrix has 32K×32K = 1 billion entries. At fp32 that's 4GB. At fp16 it's 2GB. Does precision matter for the loss computation?

OpenAI's approach (and the field's consensus):

1. All-gather embeddings, not images. Each GPU processes its local micro-batch (128 pairs) through both encoders to get embeddings of shape [128, 768]. Then an all-gather collects ALL embeddings from ALL GPUs: [32768, 768]. This is only 32768 × 768 × 2 bytes = 48MB per modality — trivial over InfiniBand. Now each GPU can compute the full 32K×32K similarity matrix locally.

2. All-gather, not gradient accumulation. Gradient accumulation would mean each micro-batch only sees 128 negatives, not 32K. The entire point of large batches is the full negative set. All-gather is essential. The communication cost (48MB × 2 modalities = 96MB per step) is negligible compared to the forward/backward pass.

3. fp16 embeddings, fp32 loss. Embeddings are computed in fp16 (saves memory). But the similarity matrix and cross-entropy are computed in fp32 to avoid numerical instability in the softmax (exp of large values overflows fp16). The loss is cast back to fp16 for the backward pass.

Key insight: The brilliant trick is that embeddings are tiny (768 dims) compared to images (150K dims). All-gathering embeddings is 200x cheaper than all-gathering images. This is why the two-encoder architecture works so well for distributed training.

Chapter 6: Zero-Shot Transfer

CLIP's superpower: zero-shot classification. To classify an image, create text prompts for each class: "a photo of a cat", "a photo of a dog", "a photo of a car." Embed all prompts. Embed the image. Pick the text with highest cosine similarity. No training on the target task.

This works because CLIP's shared embedding space aligns concepts across modalities. "A photo of a dog" is near actual dog photos, even if CLIP never saw this specific classification task during training.

Here's exactly how zero-shot ImageNet classification works. Take all 1,000 ImageNet class names. Create a text template for each: "a photo of a {class}". Run all 1,000 through the text encoder to get 1,000 text embeddings of shape [1000, 768]. These are your "classifier weights" — computed once, cached forever. To classify an image: encode it to get [1, 768], dot product with all 1,000 text embeddings, argmax. That's the prediction. No fine-tuning, no training on ImageNet. The "classifier" is literally text embeddings.

How well does it work? CLIP zero-shot matches a fully-supervised ResNet-50 on ImageNet (76.2% top-1 accuracy). It never saw a single ImageNet training image. Prompt engineering helps: ensembling templates like "a photo of a {class}", "a bad photo of a {class}", "a sculpture of a {class}" boosts accuracy by ~3.5% because it captures more of CLIP's knowledge.

Image

Encode with CLIP vision encoder

↓ cosine similarity

Text Prompts

"a photo of a {class}" for each class

↓ argmax

Prediction

Class with highest similarity

Zero-Shot Classification

An image is compared against text prompts for 5 classes. The bar chart shows similarity scores. The highest score wins. Click to generate a new random scenario.

Prompt engineering matters: "a photo of a dog" works better than just "dog" because CLIP was trained on natural captions. Templates like "a photo of a {class}" or "a centered satellite image of {class}" can boost accuracy by 5-10%.

Check: Why does zero-shot classification work with CLIP?

CLIP was pre-trained on every possible classification task The text encoder memorizes all class labels CLIP's shared space aligns image and text concepts, so text descriptions match visual content

⚔ Adversarial: Your CLIP model achieves 76% top-1 on ImageNet zero-shot. But when you test it on counting tasks ("an image of 3 dogs"), accuracy drops to near-random. Why?

You prompt CLIP with "a photo of 1 dog", "a photo of 2 dogs", "a photo of 3 dogs" and test on images containing specific counts. The model almost always picks "a photo of dogs" regardless of the actual count. Increasing the image resolution doesn't help.

CLIP's training data had no images with multiple objects Contrastive learning matches whole images to whole captions — it learns "bag of concepts," not compositional structure The text encoder can't represent numbers

Chapter 7: SigLIP — Sigmoid Beats Softmax

CLIP uses a softmax-based loss: each positive competes with all negatives in the batch. SigLIP replaces this with independent sigmoid losses per pair. Each (image, text) pair gets a binary "match or not" prediction — no need to normalize across the batch.

Why does this matter? Softmax requires all-to-all communication across the batch. Sigmoid doesn't. This means SigLIP can scale to much larger batch sizes (up to 1M) using simple data parallelism, and achieves better performance.

L = −∑_i,j [ y_ij log σ(z_ij) + (1−y_ij) log(1−σ(z_ij)) ]

CLIP (softmax): "Which text in the batch matches this image?" Competition across all pairs. Requires global batch communication.

SigLIP (sigmoid): "Does this specific (image, text) pair match?" Independent per-pair decision. Trivially parallelizable.

Softmax vs Sigmoid

Left: softmax normalizes across the row (probabilities sum to 1). Right: sigmoid treats each cell independently. Both want the diagonal to be bright.

Property	CLIP (softmax)	SigLIP (sigmoid)
Loss type	Cross-entropy over row/col	Binary cross-entropy per pair
Batch scaling	Needs all-gather	Embarrassingly parallel
Max batch size	~32K (practical limit)	>1M demonstrated
Performance	Strong baseline	Better at same compute

Check: What advantage does SigLIP's sigmoid loss have over CLIP's softmax?

It scales to much larger batch sizes because pairs are independent It uses less data It doesn't need a text encoder

Chapter 8: DINO & Self-Supervised Vision

DINO (self-DIstillation with NO labels) takes contrastive ideas in a different direction: instead of text-image pairs, it uses self-distillation. A student network and a teacher network (exponential moving average of the student) both see augmented views of the same image. The student learns to match the teacher's output distribution.

The magic result: DINO learns features with remarkable spatial awareness. Its attention maps naturally segment objects without ever seeing segmentation labels. This makes DINO features excellent for robotics, dense prediction, and spatial reasoning.

Student

Sees local crops + global crop

↓ match distributions

Teacher (EMA)

Sees only global crops

↓ centering + sharpening

Cross-Entropy Loss

Student matches teacher's output distribution

DINO Attention Maps

DINO's self-attention naturally segments objects. Each color represents a different attention head focusing on different parts. The model discovers object boundaries without labels.

DINO v1: ViT + self-distillation. Excellent attention maps. Used a lot in research.

DINOv2: Scaled up, combined with iBOT. State-of-the-art visual features for dense tasks. Powers many robotics pipelines.

DINO vs CLIP: CLIP learns image-text alignment (good for classification, retrieval). DINO learns spatial features (good for segmentation, depth, robotics). Many VLMs use both.

Check: What makes DINO's learned features special?

They have excellent spatial awareness and naturally segment objects They can generate images They are the smallest possible embeddings

Chapter 9: Representations Everywhere

CLIP and contrastive learning have become foundational infrastructure. CLIP embeddings power vision-language models (VLMs) like LLaVA and GPT-4V. DINO features power robotics (spatial reasoning) and video understanding. Together, they're how AI systems see the world.

The dependency chain is remarkable. CLIP's vision encoder is the visual backbone of LLaVA and other VLMs — they freeze it and connect it to a language model. CLIP's text encoder guides image generation in Stable Diffusion and DALL-E — the text embedding steers the denoising process. Open-vocabulary detection models like OWLv2 and DETIC use CLIP to detect objects from text descriptions instead of fixed class lists. In robotics, SayCan uses CLIP to ground language commands ("pick up the red cup") to visual affordances. Each of these systems would not exist without CLIP's shared vision-language space.

The Representation Ecosystem

How contrastive representations flow through modern AI systems. Each arrow shows a dependency.

System	Uses	For
LLaVA, GPT-4V	CLIP vision encoder	Image understanding for VLMs
Stable Diffusion	CLIP text encoder	Text-guided image generation
Segment Anything (SAM)	Contrastive pre-training	Universal segmentation
Robotics (RT-2, etc.)	DINO/DINOv2 features	Spatial perception and manipulation
Image search	CLIP embeddings	Semantic search across billions of images
Video understanding	DINO + CLIP	Action recognition, tracking

The theme: Learn representations once, use them everywhere. Contrastive pre-training is expensive (hundreds of GPU-days), but the resulting encoders are reused across dozens of downstream applications.

🔗 Pattern Recognition

CLIP's Vision Encoder IS the VLM's Eyes

CLIP (this lesson)

ViT encodes image → [N_patches, 1024] → project to 768-dim → contrastive loss aligns with text.

LLaVA / GPT-4V (VLM)

Same ViT encodes image → [N_patches, 1024] → linear projection to LLM's dim → concatenate with text tokens → LLM generates. → VLM lesson

The VLM doesn't train its own vision encoder from scratch. It takes CLIP's frozen ViT and plugs it directly into a language model. CLIP's contrastive pre-training already aligned visual concepts with language — the VLM exploits this by feeding CLIP's patch tokens as "visual words" into the LLM. The bridge between modalities was already built by contrastive learning; the VLM just walks across it.

Why do VLMs freeze the CLIP encoder instead of fine-tuning it? Think about what happens to the representation space if you update the vision encoder with a language modeling loss.

"The features are the foundation. Everything else is built on top."

— common wisdom in representation learning

You now understand how machines learn to represent the world: by comparing, contrasting, and aligning. Every VLM, every image search, every AI that "sees" — it starts here.

Understand Contrastive
Learning & CLIP

Chapter 0: What Is Representation?

Chapter 1: Contrastive Learning

Chapter 2: SimCLR — Simple Framework

Chapter 3: InfoNCE Loss

Chapter 4: CLIP — Connecting Vision and Language

Chapter 5: Training CLIP

Chapter 6: Zero-Shot Transfer

Chapter 7: SigLIP — Sigmoid Beats Softmax

Chapter 8: DINO & Self-Supervised Vision

Chapter 9: Representations Everywhere

Understand ContrastiveLearning & CLIP

Chapter 0: What Is Representation?

Chapter 1: Contrastive Learning

Chapter 2: SimCLR — Simple Framework

Chapter 3: InfoNCE Loss

Chapter 4: CLIP — Connecting Vision and Language

Chapter 5: Training CLIP

Chapter 6: Zero-Shot Transfer

Chapter 7: SigLIP — Sigmoid Beats Softmax

Chapter 8: DINO & Self-Supervised Vision

Chapter 9: Representations Everywhere

Understand Contrastive
Learning & CLIP