How neural networks learn what "similar" means, and how CLIP taught machines to see through the lens of language.
Raw data — pixels, audio samples, characters — is not directly useful for reasoning. A 224×224 image is 150,528 numbers, but those numbers don't tell you "this is a dog" or "this looks like a Van Gogh painting." A representation is a transformation that turns raw data into useful numbers.
A good representation puts similar things close together and different things far apart in an embedding space. A photo of a golden retriever and a labrador should have nearby embeddings. A golden retriever and a car should be far apart.
Points are data samples. Colors are categories. A good representation clusters similar items together. Click "Shuffle" to see a bad (random) embedding; click "Organize" to see a learned one.
Contrastive learning is the art of learning embeddings by comparing pairs. For each sample (the anchor), you need a positive (something similar) and negatives (things that are different). The loss pulls anchor and positive together and pushes anchor and negatives apart.
The beauty: you don't need class labels. The "positive" can be a different augmentation of the same image: flip it, crop it, change the color. Any two views of the same image are a positive pair. Everything else in the batch is a negative.
The anchor is pulled toward the positive and pushed away from negatives. Click "Step" to see the forces applied. Watch embeddings organize over steps.
SimCLR (Simple Contrastive Learning of Representations) showed how effective contrastive learning can be with a clean, minimal design. Take a batch of N images. Augment each twice to get 2N views. For each image, its two augmented views form the positive pair. The other 2(N-1) views are negatives.
The architecture: a ResNet encoder maps each view to an embedding, then a small projection head (MLP) maps it to the space where the contrastive loss is applied. After training, the projection head is thrown away — the encoder's output is the useful representation.
| Augmentation | What it does | Why it helps |
|---|---|---|
| Random crop + resize | Different spatial views | Forces spatial invariance |
| Color jitter | Adjust brightness, contrast, saturation | Prevents color-based shortcuts |
| Horizontal flip | Mirror the image | Orientation invariance |
| Gaussian blur | Smooth the image | Prevents texture-based shortcuts |
The workhorse of contrastive learning: the InfoNCE loss (also called NT-Xent in SimCLR). For a positive pair (i, j), it's the negative log probability that j is the correct positive among all negatives. It's essentially a softmax over similarity scores.
The temperature τ is crucial. Low τ makes the distribution sharper (hard negatives matter more). High τ makes it smoother (all negatives contribute equally). The similarity function is cosine similarity: sim(a,b) = a·b / (||a|| ||b||).
A batch of B images creates a B×B similarity matrix. Diagonal = positive pairs (should be high). Off-diagonal = negatives (should be low). Adjust temperature to see the effect.
CLIP (Contrastive Language-Image Pre-training) applies contrastive learning across modalities: images and text. Instead of two augmented views of the same image, the positive pair is an image and its caption. Train on 400 million image-text pairs from the internet.
Two separate encoders: a vision encoder (ViT or ResNet) maps images to embeddings, and a text encoder (Transformer) maps captions to embeddings. The contrastive loss pulls matching (image, caption) pairs together and pushes non-matching pairs apart.
Images (squares) and text (circles) in a shared space. Matching pairs are connected. CLIP learns to align them.
CLIP's training is elegant. Take a batch of N (image, text) pairs. Compute all N×N cosine similarities. The diagonal contains matching pairs. Apply cross-entropy loss to make each row and column peak at the diagonal. That's it.
The scale: 400 million image-text pairs, 32 epochs, batch size 32,768. The enormous batch size is critical — more negatives per batch means harder contrastive learning. Training takes weeks on hundreds of GPUs.
Each cell is the cosine similarity between image i and text j. The loss tries to make the diagonal bright (high similarity) and everything else dark (low similarity).
| Hyperparameter | Value | Why |
|---|---|---|
| Batch size | 32,768 | More negatives = harder contrastive task |
| Temperature τ | 0.07 (learned) | Sharpens softmax to focus on hard negatives |
| Image encoder | ViT-L/14 | Large Vision Transformer for best quality |
| Text encoder | 12-layer Transformer | Standard text encoding |
| Embedding dim | 512 or 768 | Shared dimension for image-text space |
CLIP's superpower: zero-shot classification. To classify an image, create text prompts for each class: "a photo of a cat", "a photo of a dog", "a photo of a car." Embed all prompts. Embed the image. Pick the text with highest cosine similarity. No training on the target task.
This works because CLIP's shared embedding space aligns concepts across modalities. "A photo of a dog" is near actual dog photos, even if CLIP never saw this specific classification task during training.
An image is compared against text prompts for 5 classes. The bar chart shows similarity scores. The highest score wins. Click to generate a new random scenario.
CLIP uses a softmax-based loss: each positive competes with all negatives in the batch. SigLIP replaces this with independent sigmoid losses per pair. Each (image, text) pair gets a binary "match or not" prediction — no need to normalize across the batch.
Why does this matter? Softmax requires all-to-all communication across the batch. Sigmoid doesn't. This means SigLIP can scale to much larger batch sizes (up to 1M) using simple data parallelism, and achieves better performance.
Left: softmax normalizes across the row (probabilities sum to 1). Right: sigmoid treats each cell independently. Both want the diagonal to be bright.
| Property | CLIP (softmax) | SigLIP (sigmoid) |
|---|---|---|
| Loss type | Cross-entropy over row/col | Binary cross-entropy per pair |
| Batch scaling | Needs all-gather | Embarrassingly parallel |
| Max batch size | ~32K (practical limit) | >1M demonstrated |
| Performance | Strong baseline | Better at same compute |
DINO (self-DIstillation with NO labels) takes contrastive ideas in a different direction: instead of text-image pairs, it uses self-distillation. A student network and a teacher network (exponential moving average of the student) both see augmented views of the same image. The student learns to match the teacher's output distribution.
The magic result: DINO learns features with remarkable spatial awareness. Its attention maps naturally segment objects without ever seeing segmentation labels. This makes DINO features excellent for robotics, dense prediction, and spatial reasoning.
DINO's self-attention naturally segments objects. Each color represents a different attention head focusing on different parts. The model discovers object boundaries without labels.
CLIP and contrastive learning have become foundational infrastructure. CLIP embeddings power vision-language models (VLMs) like LLaVA and GPT-4V. DINO features power robotics (spatial reasoning) and video understanding. Together, they're how AI systems see the world.
How contrastive representations flow through modern AI systems. Each arrow shows a dependency.
| System | Uses | For |
|---|---|---|
| LLaVA, GPT-4V | CLIP vision encoder | Image understanding for VLMs |
| Stable Diffusion | CLIP text encoder | Text-guided image generation |
| Segment Anything (SAM) | Contrastive pre-training | Universal segmentation |
| Robotics (RT-2, etc.) | DINO/DINOv2 features | Spatial perception and manipulation |
| Image search | CLIP embeddings | Semantic search across billions of images |
| Video understanding | DINO + CLIP | Action recognition, tracking |
You now understand how machines learn to represent the world: by comparing, contrasting, and aligning. Every VLM, every image search, every AI that "sees" — it starts here.