How neural networks learn what "similar" means, and how CLIP taught machines to see through the lens of language.
Raw data — pixels, audio samples, characters — is not directly useful for reasoning. A 224×224 image is 150,528 numbers, but those numbers don't tell you "this is a dog" or "this looks like a Van Gogh painting." A representation is a transformation that turns raw data into useful numbers.
A good representation puts similar things close together and different things far apart in an embedding space. A photo of a golden retriever and a labrador should have nearby embeddings. A golden retriever and a car should be far apart.
Points are data samples. Colors are categories. A good representation clusters similar items together. Click "Shuffle" to see a bad (random) embedding; click "Organize" to see a learned one.
Contrastive learning is the art of learning embeddings by comparing pairs. For each sample (the anchor), you need a positive (something similar) and negatives (things that are different). The loss pulls anchor and positive together and pushes anchor and negatives apart.
The beauty: you don't need class labels. The "positive" can be a different augmentation of the same image: flip it, crop it, change the color. Any two views of the same image are a positive pair. Everything else in the batch is a negative.
The anchor is pulled toward the positive and pushed away from negatives. Click "Step" to see the forces applied. Watch embeddings organize over steps.
SimCLR (Simple Contrastive Learning of Representations) showed how effective contrastive learning can be with a clean, minimal design. Take a batch of N images. Augment each twice to get 2N views. For each image, its two augmented views form the positive pair. The other 2(N-1) views are negatives.
The architecture: a ResNet encoder maps each view to an embedding, then a small projection head (MLP) maps it to the space where the contrastive loss is applied. After training, the projection head is thrown away — the encoder's output is the useful representation.
| Augmentation | What it does | Why it helps |
|---|---|---|
| Random crop + resize | Different spatial views | Forces spatial invariance |
| Color jitter | Adjust brightness, contrast, saturation | Prevents color-based shortcuts |
| Horizontal flip | Mirror the image | Orientation invariance |
| Gaussian blur | Smooth the image | Prevents texture-based shortcuts |
[3, 224, 224] enters the encoder (ResNet-50). The encoder outputs a representation h of dimension 2048. The projection head (2-layer MLP with ReLU) maps this to z of dimension 128. Contrastive loss is computed on z, but downstream tasks use h. The projection head is a lossy funnel — it compresses out information that helped the contrastive task but isn't useful later.The workhorse of contrastive learning: the InfoNCE loss (also called NT-Xent in SimCLR). For a positive pair (i, j), it's the negative log probability that j is the correct positive among all negatives. It's essentially a softmax over similarity scores.
The temperature τ is crucial. Low τ makes the distribution sharper (hard negatives matter more). High τ makes it smoother (all negatives contribute equally). The similarity function is cosine similarity: sim(a,b) = a·b / (||a|| ||b||).
tau = exp(log_temperature), initialized at ~0.07. The model discovers how discriminative it needs to be.A batch of B images creates a B×B similarity matrix. Diagonal = positive pairs (should be high). Off-diagonal = negatives (should be low). Adjust temperature to see the effect.
The InfoNCE loss was introduced in the CPC paper (van den Oord et al., 2018) with a surprising theoretical justification: minimizing InfoNCE is equivalent to maximizing a lower bound on the mutual information I(X; Y) between two views.
Your task: Show that the optimal InfoNCE loss equals log(N) − I(X; Y), where N is the number of negatives + 1. Therefore, minimizing InfoNCE maximizes a lower bound on mutual information.
Full derivation:
The InfoNCE loss for a positive pair (x, y+) with N−1 negatives y1−...yN-1− is:
L = −E[log(exp(f(x,y+)) / (exp(f(x,y+)) + ∑k exp(f(x,yk−))))]
The optimal critic is f*(x,y) = log(p(y|x)/p(y)) + c(x). Substituting:
L* = −E[log(p(y+|x)/p(y+) / (p(y+|x)/p(y+) + (N−1)))]
= −E[log(1 / (1 + (N−1) · p(y+)/p(y+|x)))]
≥ −E[log(1/(1 + (N−1) · exp(−I(X;Y))))]
By Jensen's inequality and rearranging: L* ≥ log(N) − I(X;Y)
Therefore: I(X;Y) ≥ log(N) − LInfoNCE
The key insight: InfoNCE can capture at most log(N) bits of mutual information. With batch size 32,768, that's log(32768) = 15.0 bits. This is why CLIP uses such enormous batches — small batches literally cannot represent enough shared information between modalities.
Temperature τ is usually described as "sharpening the distribution." But its effect on learning is more profound: it scales the gradient magnitude. When τ is small, gradients from hard negatives are amplified exponentially.
Your task: Compute ∂L/∂zi (the gradient of InfoNCE w.r.t. the anchor embedding) and show how τ modulates which negatives contribute most.
Full derivation:
Let sk = sim(zi, zk)/τ be the scaled similarity. The loss is L = −log(softmax(s)pos).
The gradient w.r.t. the anchor embedding zi is:
∂L/∂zi = (1/τ) · [∑k≠pos pk · ∂sim(zi,zk)/∂zi − (1−ppos) · ∂sim(zi,zpos)/∂zi]
Each negative's contribution is weighted by pk = softmax(sim/τ)k. As τ shrinks:
• The (1/τ) prefactor amplifies ALL gradients
• The softmax concentrates on the hardest negative (max sim)
• Easy negatives (low sim) contribute exponentially less
Numerically: if two negatives have sim = 0.9 and 0.1, at τ=0.07 their gradient weights are exp(0.9/0.07) / exp(0.1/0.07) = exp(11.4) ≈ 89,000x difference.
The key insight: Low temperature creates a curriculum — the model only learns from its hardest confusions, ignoring easy negatives. But if τ is too low, a single outlier negative dominates all gradients, causing instability. CLIP's learned τ≈0.07 is a sweet spot.
CLIP L2-normalizes all embeddings before computing similarities. This converts dot products into cosine similarities. Why not use raw dot products? The answer involves the interaction between embedding magnitude and the loss landscape.
Your task: Show that without normalization, the model can minimize contrastive loss by simply increasing embedding magnitudes (a degenerate solution), and that normalization prevents this.
Full derivation:
Without normalization, let ||zi|| = r. The dot product zi · zj ≈ r2 cos(θij). The softmax logit is r2 cos(θ)/τ.
The gradient w.r.t. r (the magnitude) is ∂L/∂r ∝ (1/τ) · r · [cos(θpos)(1−ppos) − ∑ pk cos(θk)]
Even if all angles are fixed, increasing r sharpens the softmax, reducing loss. The model can achieve arbitrarily low loss by r → ∞ without changing any angular relationships.
With L2 normalization: ẑ = z/||z||, so ẑi · ẑj = cos(θij) ∈ [−1, 1].
Now the only way to reduce loss is to make cos(θpos) → 1 and cos(θneg) → −1. The model MUST learn meaningful geometric structure.
The key insight: L2 normalization decouples "how hard the model tries" (magnitude) from "how well it organizes" (angles). Without it, gradient descent takes the easy path of inflating norms. With it, the only path to lower loss is better representations. This is why every modern contrastive method normalizes.
Both are softmax over scaled dot products. The difference: attention makes a soft assignment (weighted average of all values), while contrastive makes a hard assignment (only one correct match). Contrastive learning is like attention where you train the model to put all weight on the single correct key.
Can you see why temperature in contrastive learning plays the same role as √d in attention? Both prevent the softmax from collapsing to a one-hot vector too early in training.
Without negatives, the model collapses: it maps ALL inputs to the same point (or a small region). Why? Because mapping everything to a single vector z* = [1,0,0,...] makes every pair have similarity 1.0, achieving perfect "alignment" with zero loss. This is called representation collapse. Negatives prevent this by penalizing the model when non-matching pairs are close. They force the model to SPREAD different concepts apart, creating an informative embedding space rather than a trivial one. The loss requires both attraction (positives) and repulsion (negatives) to find a meaningful equilibrium.
CLIP (Contrastive Language-Image Pre-training) applies contrastive learning across modalities: images and text. Instead of two augmented views of the same image, the positive pair is an image and its caption. Train on 400 million image-text pairs from the internet.
Two separate encoders: a vision encoder (ViT or ResNet) maps images to embeddings, and a text encoder (Transformer) maps captions to embeddings. The contrastive loss pulls matching (image, caption) pairs together and pushes non-matching pairs apart.
Let's trace the exact data flow. The image path: a batch of images [batch, 3, 224, 224] enters ViT-L/14. The image is split into 14×14 patches, each flattened and projected to 1024 dims, giving 196 patch tokens plus one class token — 197 tokens total. After 24 transformer layers, the class token (a single 1024-dim vector) is extracted and passed through a linear projection head that maps it to 768 dimensions. The result is L2-normalized: [batch, 768].
The text path: a caption is tokenized with BPE (byte-pair encoding, 49,152-token vocabulary), padded/truncated to 77 tokens. A 12-layer transformer processes these tokens. The embedding at the [EOS] token position (analogous to the class token) is extracted and projected to 768 dimensions. Also L2-normalized: [batch, 768].
Images (squares) and text (circles) in a shared space. Matching pairs are connected. CLIP learns to align them.
CLIP's training is elegant. Take a batch of N (image, text) pairs. Compute all N×N cosine similarities. The diagonal contains matching pairs. Apply cross-entropy loss to make each row and column peak at the diagonal. That's it.
In code, this is shockingly simple. You have image_embeds of shape [N, 768] and text_embeds of shape [N, 768], both L2-normalized. The similarity matrix is one matrix multiply:
This gives an [N, N] matrix. The diagonal entries are the N positive pairs (imagei matched with texti). The off-diagonal entries are the N²−N negative pairs. For a concrete example with N=4: you get 4 positive pairs and 12 negatives. The loss is symmetric cross-entropy — cross-entropy on the rows (which text matches each image?) AND on the columns (which image matches each text?), averaged.
The scale: 400 million image-text pairs scraped from the internet (called WIT — WebImageText). Training on 256 V100 GPUs for ~12 days. Batch size 32,768 — meaning each image competes against 32,767 negative text captions per step. Why so large? More negatives make the contrastive task harder, forcing the model to learn finer-grained distinctions.
Each cell is the cosine similarity between image i and text j. The loss tries to make the diagonal bright (high similarity) and everything else dark (low similarity).
| Hyperparameter | Value | Why |
|---|---|---|
| Batch size | 32,768 | More negatives = harder contrastive task |
| Temperature τ | 0.07 (learned) | Sharpens softmax to focus on hard negatives |
| Image encoder | ViT-L/14 | Large Vision Transformer for best quality |
| Text encoder | 12-layer Transformer | Standard text encoding |
| Embedding dim | 512 or 768 | Shared dimension for image-text space |
python import torch import torch.nn.functional as F def clip_loss(image_embeds, text_embeds, log_temperature): # Scale factor (learned, initialized to 1/0.07 ≈ 14.3) temperature = torch.exp(log_temperature) # Similarity matrix: (N, N) logits = image_embeds @ text_embeds.T * temperature # Labels: diagonal is the correct match N = logits.shape[0] labels = torch.arange(N, device=logits.device) # Symmetric cross-entropy loss_i2t = F.cross_entropy(logits, labels) # rows loss_t2i = F.cross_entropy(logits.T, labels) # cols return (loss_i2t + loss_t2i) / 2
OpenAI's approach (and the field's consensus):
1. All-gather embeddings, not images. Each GPU processes its local micro-batch (128 pairs) through both encoders to get embeddings of shape [128, 768]. Then an all-gather collects ALL embeddings from ALL GPUs: [32768, 768]. This is only 32768 × 768 × 2 bytes = 48MB per modality — trivial over InfiniBand. Now each GPU can compute the full 32K×32K similarity matrix locally.
2. All-gather, not gradient accumulation. Gradient accumulation would mean each micro-batch only sees 128 negatives, not 32K. The entire point of large batches is the full negative set. All-gather is essential. The communication cost (48MB × 2 modalities = 96MB per step) is negligible compared to the forward/backward pass.
3. fp16 embeddings, fp32 loss. Embeddings are computed in fp16 (saves memory). But the similarity matrix and cross-entropy are computed in fp32 to avoid numerical instability in the softmax (exp of large values overflows fp16). The loss is cast back to fp16 for the backward pass.
Key insight: The brilliant trick is that embeddings are tiny (768 dims) compared to images (150K dims). All-gathering embeddings is 200x cheaper than all-gathering images. This is why the two-encoder architecture works so well for distributed training.
CLIP's superpower: zero-shot classification. To classify an image, create text prompts for each class: "a photo of a cat", "a photo of a dog", "a photo of a car." Embed all prompts. Embed the image. Pick the text with highest cosine similarity. No training on the target task.
This works because CLIP's shared embedding space aligns concepts across modalities. "A photo of a dog" is near actual dog photos, even if CLIP never saw this specific classification task during training.
Here's exactly how zero-shot ImageNet classification works. Take all 1,000 ImageNet class names. Create a text template for each: "a photo of a {class}". Run all 1,000 through the text encoder to get 1,000 text embeddings of shape [1000, 768]. These are your "classifier weights" — computed once, cached forever. To classify an image: encode it to get [1, 768], dot product with all 1,000 text embeddings, argmax. That's the prediction. No fine-tuning, no training on ImageNet. The "classifier" is literally text embeddings.
An image is compared against text prompts for 5 classes. The bar chart shows similarity scores. The highest score wins. Click to generate a new random scenario.
CLIP uses a softmax-based loss: each positive competes with all negatives in the batch. SigLIP replaces this with independent sigmoid losses per pair. Each (image, text) pair gets a binary "match or not" prediction — no need to normalize across the batch.
Why does this matter? Softmax requires all-to-all communication across the batch. Sigmoid doesn't. This means SigLIP can scale to much larger batch sizes (up to 1M) using simple data parallelism, and achieves better performance.
Left: softmax normalizes across the row (probabilities sum to 1). Right: sigmoid treats each cell independently. Both want the diagonal to be bright.
| Property | CLIP (softmax) | SigLIP (sigmoid) |
|---|---|---|
| Loss type | Cross-entropy over row/col | Binary cross-entropy per pair |
| Batch scaling | Needs all-gather | Embarrassingly parallel |
| Max batch size | ~32K (practical limit) | >1M demonstrated |
| Performance | Strong baseline | Better at same compute |
DINO (self-DIstillation with NO labels) takes contrastive ideas in a different direction: instead of text-image pairs, it uses self-distillation. A student network and a teacher network (exponential moving average of the student) both see augmented views of the same image. The student learns to match the teacher's output distribution.
The magic result: DINO learns features with remarkable spatial awareness. Its attention maps naturally segment objects without ever seeing segmentation labels. This makes DINO features excellent for robotics, dense prediction, and spatial reasoning.
DINO's self-attention naturally segments objects. Each color represents a different attention head focusing on different parts. The model discovers object boundaries without labels.
CLIP and contrastive learning have become foundational infrastructure. CLIP embeddings power vision-language models (VLMs) like LLaVA and GPT-4V. DINO features power robotics (spatial reasoning) and video understanding. Together, they're how AI systems see the world.
The dependency chain is remarkable. CLIP's vision encoder is the visual backbone of LLaVA and other VLMs — they freeze it and connect it to a language model. CLIP's text encoder guides image generation in Stable Diffusion and DALL-E — the text embedding steers the denoising process. Open-vocabulary detection models like OWLv2 and DETIC use CLIP to detect objects from text descriptions instead of fixed class lists. In robotics, SayCan uses CLIP to ground language commands ("pick up the red cup") to visual affordances. Each of these systems would not exist without CLIP's shared vision-language space.
How contrastive representations flow through modern AI systems. Each arrow shows a dependency.
| System | Uses | For |
|---|---|---|
| LLaVA, GPT-4V | CLIP vision encoder | Image understanding for VLMs |
| Stable Diffusion | CLIP text encoder | Text-guided image generation |
| Segment Anything (SAM) | Contrastive pre-training | Universal segmentation |
| Robotics (RT-2, etc.) | DINO/DINOv2 features | Spatial perception and manipulation |
| Image search | CLIP embeddings | Semantic search across billions of images |
| Video understanding | DINO + CLIP | Action recognition, tracking |
The VLM doesn't train its own vision encoder from scratch. It takes CLIP's frozen ViT and plugs it directly into a language model. CLIP's contrastive pre-training already aligned visual concepts with language — the VLM exploits this by feeding CLIP's patch tokens as "visual words" into the LLM. The bridge between modalities was already built by contrastive learning; the VLM just walks across it.
Why do VLMs freeze the CLIP encoder instead of fine-tuning it? Think about what happens to the representation space if you update the vision encoder with a language modeling loss.
You now understand how machines learn to represent the world: by comparing, contrasting, and aligning. Every VLM, every image search, every AI that "sees" — it starts here.