The Complete Beginner's Path

Understand Contrastive
Learning & CLIP

How neural networks learn what "similar" means, and how CLIP taught machines to see through the lens of language.

Prerequisites: Neural network basics + Embedding intuition. That's it.
10
Chapters
9+
Interactives
0
Assumed Knowledge

Chapter 0: What Is Representation?

Raw data — pixels, audio samples, characters — is not directly useful for reasoning. A 224×224 image is 150,528 numbers, but those numbers don't tell you "this is a dog" or "this looks like a Van Gogh painting." A representation is a transformation that turns raw data into useful numbers.

A good representation puts similar things close together and different things far apart in an embedding space. A photo of a golden retriever and a labrador should have nearby embeddings. A golden retriever and a car should be far apart.

The key question: How do you learn a good embedding function without manually labeling millions of examples? Contrastive learning answers this: you learn by comparing pairs.
Embedding Space

Points are data samples. Colors are categories. A good representation clusters similar items together. Click "Shuffle" to see a bad (random) embedding; click "Organize" to see a learned one.

Check: What makes a good representation?

Chapter 1: Contrastive Learning

Contrastive learning is the art of learning embeddings by comparing pairs. For each sample (the anchor), you need a positive (something similar) and negatives (things that are different). The loss pulls anchor and positive together and pushes anchor and negatives apart.

The beauty: you don't need class labels. The "positive" can be a different augmentation of the same image: flip it, crop it, change the color. Any two views of the same image are a positive pair. Everything else in the batch is a negative.

Anchor
Original image
↓ augment
Positive
Augmented version (same content)
↓ contrast with
Negatives
All other images in the batch
Pull and Push

The anchor is pulled toward the positive and pushed away from negatives. Click "Step" to see the forces applied. Watch embeddings organize over steps.

Step: 0
Self-supervised: Contrastive learning is self-supervised — the data provides its own supervision through augmentations. No human labels needed. This is why it scales to billions of images.
Check: In contrastive learning, what is a "positive pair"?

Chapter 2: SimCLR — Simple Framework

SimCLR (Simple Contrastive Learning of Representations) showed how effective contrastive learning can be with a clean, minimal design. Take a batch of N images. Augment each twice to get 2N views. For each image, its two augmented views form the positive pair. The other 2(N-1) views are negatives.

The architecture: a ResNet encoder maps each view to an embedding, then a small projection head (MLP) maps it to the space where the contrastive loss is applied. After training, the projection head is thrown away — the encoder's output is the useful representation.

Image x
Original image from batch
↓ two random augmentations
xi, xj
Two augmented views
↓ encoder f (ResNet)
hi, hj
Representations (keep these)
↓ projection head g (MLP)
zi, zj
Projected embeddings (contrastive loss here)
Why throw away the projection head? The projection head learns to discard information that's irrelevant to the contrastive task (like color jitter). The encoder retains richer features useful for downstream tasks.
AugmentationWhat it doesWhy it helps
Random crop + resizeDifferent spatial viewsForces spatial invariance
Color jitterAdjust brightness, contrast, saturationPrevents color-based shortcuts
Horizontal flipMirror the imageOrientation invariance
Gaussian blurSmooth the imagePrevents texture-based shortcuts
Check: In SimCLR, for a batch of N images, how many positive pairs does each image have?

Chapter 3: InfoNCE Loss

The workhorse of contrastive learning: the InfoNCE loss (also called NT-Xent in SimCLR). For a positive pair (i, j), it's the negative log probability that j is the correct positive among all negatives. It's essentially a softmax over similarity scores.

Li = −log ( exp(sim(zi,zj)/τ) / ∑k≠i exp(sim(zi,zk)/τ) )

The temperature τ is crucial. Low τ makes the distribution sharper (hard negatives matter more). High τ makes it smoother (all negatives contribute equally). The similarity function is cosine similarity: sim(a,b) = a·b / (||a|| ||b||).

B×B Similarity Matrix

A batch of B images creates a B×B similarity matrix. Diagonal = positive pairs (should be high). Off-diagonal = negatives (should be low). Adjust temperature to see the effect.

Batch size B6
Temperature τ0.10
Temperature matters: τ=0.07 (CLIP's default) is quite sharp. The model focuses intensely on the hardest negatives. τ=1.0 is flat — all negatives contribute equally. Too low = training instability. Too high = weak signal.
Check: What does a lower temperature τ do to the contrastive loss?

Chapter 4: CLIP — Connecting Vision and Language

CLIP (Contrastive Language-Image Pre-training) applies contrastive learning across modalities: images and text. Instead of two augmented views of the same image, the positive pair is an image and its caption. Train on 400 million image-text pairs from the internet.

Two separate encoders: a vision encoder (ViT or ResNet) maps images to embeddings, and a text encoder (Transformer) maps captions to embeddings. The contrastive loss pulls matching (image, caption) pairs together and pushes non-matching pairs apart.

Image Encoder
ViT / ResNet → image embedding
 
Shared Embedding Space
cos(image, text) = similarity
 
Text Encoder
Transformer → text embedding
CLIP Embedding Space

Images (squares) and text (circles) in a shared space. Matching pairs are connected. CLIP learns to align them.

The breakthrough: CLIP learns a shared space where "a photo of a dog" and an actual photo of a dog are neighbors. This means you can classify images using text descriptions alone — no task-specific training required.
Check: What are CLIP's positive pairs?

Chapter 5: Training CLIP

CLIP's training is elegant. Take a batch of N (image, text) pairs. Compute all N×N cosine similarities. The diagonal contains matching pairs. Apply cross-entropy loss to make each row and column peak at the diagonal. That's it.

The scale: 400 million image-text pairs, 32 epochs, batch size 32,768. The enormous batch size is critical — more negatives per batch means harder contrastive learning. Training takes weeks on hundreds of GPUs.

L = ½(CErows + CEcols) over the N×N similarity matrix
The N×N Training Matrix

Each cell is the cosine similarity between image i and text j. The loss tries to make the diagonal bright (high similarity) and everything else dark (low similarity).

Step: 0
HyperparameterValueWhy
Batch size32,768More negatives = harder contrastive task
Temperature τ0.07 (learned)Sharpens softmax to focus on hard negatives
Image encoderViT-L/14Large Vision Transformer for best quality
Text encoder12-layer TransformerStandard text encoding
Embedding dim512 or 768Shared dimension for image-text space
Why so large a batch? With batch size 32,768, each positive pair competes against 32,767 negatives. This makes the task extremely hard, forcing the model to learn fine-grained distinctions.
Check: In CLIP training, what does each row of the N×N matrix represent?

Chapter 6: Zero-Shot Transfer

CLIP's superpower: zero-shot classification. To classify an image, create text prompts for each class: "a photo of a cat", "a photo of a dog", "a photo of a car." Embed all prompts. Embed the image. Pick the text with highest cosine similarity. No training on the target task.

This works because CLIP's shared embedding space aligns concepts across modalities. "A photo of a dog" is near actual dog photos, even if CLIP never saw this specific classification task during training.

Image
Encode with CLIP vision encoder
↓ cosine similarity
Text Prompts
"a photo of a {class}" for each class
↓ argmax
Prediction
Class with highest similarity
Zero-Shot Classification

An image is compared against text prompts for 5 classes. The bar chart shows similarity scores. The highest score wins. Click to generate a new random scenario.

Prompt engineering matters: "a photo of a dog" works better than just "dog" because CLIP was trained on natural captions. Templates like "a photo of a {class}" or "a centered satellite image of {class}" can boost accuracy by 5-10%.
Check: Why does zero-shot classification work with CLIP?

Chapter 7: SigLIP — Sigmoid Beats Softmax

CLIP uses a softmax-based loss: each positive competes with all negatives in the batch. SigLIP replaces this with independent sigmoid losses per pair. Each (image, text) pair gets a binary "match or not" prediction — no need to normalize across the batch.

Why does this matter? Softmax requires all-to-all communication across the batch. Sigmoid doesn't. This means SigLIP can scale to much larger batch sizes (up to 1M) using simple data parallelism, and achieves better performance.

L = −∑i,j [ yij log σ(zij) + (1−yij) log(1−σ(zij)) ]
CLIP (softmax): "Which text in the batch matches this image?" Competition across all pairs. Requires global batch communication.
SigLIP (sigmoid): "Does this specific (image, text) pair match?" Independent per-pair decision. Trivially parallelizable.
Softmax vs Sigmoid

Left: softmax normalizes across the row (probabilities sum to 1). Right: sigmoid treats each cell independently. Both want the diagonal to be bright.

PropertyCLIP (softmax)SigLIP (sigmoid)
Loss typeCross-entropy over row/colBinary cross-entropy per pair
Batch scalingNeeds all-gatherEmbarrassingly parallel
Max batch size~32K (practical limit)>1M demonstrated
PerformanceStrong baselineBetter at same compute
Check: What advantage does SigLIP's sigmoid loss have over CLIP's softmax?

Chapter 8: DINO & Self-Supervised Vision

DINO (self-DIstillation with NO labels) takes contrastive ideas in a different direction: instead of text-image pairs, it uses self-distillation. A student network and a teacher network (exponential moving average of the student) both see augmented views of the same image. The student learns to match the teacher's output distribution.

The magic result: DINO learns features with remarkable spatial awareness. Its attention maps naturally segment objects without ever seeing segmentation labels. This makes DINO features excellent for robotics, dense prediction, and spatial reasoning.

Student
Sees local crops + global crop
↓ match distributions
Teacher (EMA)
Sees only global crops
↓ centering + sharpening
Cross-Entropy Loss
Student matches teacher's output distribution
DINO Attention Maps

DINO's self-attention naturally segments objects. Each color represents a different attention head focusing on different parts. The model discovers object boundaries without labels.

DINO v1: ViT + self-distillation. Excellent attention maps. Used a lot in research.
DINOv2: Scaled up, combined with iBOT. State-of-the-art visual features for dense tasks. Powers many robotics pipelines.
DINO vs CLIP: CLIP learns image-text alignment (good for classification, retrieval). DINO learns spatial features (good for segmentation, depth, robotics). Many VLMs use both.
Check: What makes DINO's learned features special?

Chapter 9: Representations Everywhere

CLIP and contrastive learning have become foundational infrastructure. CLIP embeddings power vision-language models (VLMs) like LLaVA and GPT-4V. DINO features power robotics (spatial reasoning) and video understanding. Together, they're how AI systems see the world.

The Representation Ecosystem

How contrastive representations flow through modern AI systems. Each arrow shows a dependency.

SystemUsesFor
LLaVA, GPT-4VCLIP vision encoderImage understanding for VLMs
Stable DiffusionCLIP text encoderText-guided image generation
Segment Anything (SAM)Contrastive pre-trainingUniversal segmentation
Robotics (RT-2, etc.)DINO/DINOv2 featuresSpatial perception and manipulation
Image searchCLIP embeddingsSemantic search across billions of images
Video understandingDINO + CLIPAction recognition, tracking
The theme: Learn representations once, use them everywhere. Contrastive pre-training is expensive (hundreds of GPU-days), but the resulting encoders are reused across dozens of downstream applications.
"The features are the foundation. Everything else is built on top."
— common wisdom in representation learning

You now understand how machines learn to represent the world: by comparing, contrasting, and aligning. Every VLM, every image search, every AI that "sees" — it starts here.