Contrastive / Representation Learning

What Is It?

Contrastive learning trains neural networks to build useful representations without requiring explicit labels for every sample. The core idea is deceptively simple: learn an embedding space where similar things are close together and different things are far apart.

In supervised variants like CLIP, "similar" means an image and its caption. In self-supervised variants like SimCLR or DINO, "similar" means two augmented views of the same image — no human labels needed at all.

Key Insight

You don't need to tell the model what objects are. You just need to tell it which things go together. The model figures out the structure of the world on its own by learning to separate the meaningful from the incidental.

This family of methods has become the backbone of modern multimodal AI. Every major vision-language model (GPT-4V, LLaVA, Gemini) uses a contrastively-pretrained vision encoder. Every text-to-image model (Stable Diffusion, DALL-E) uses CLIP embeddings for guidance. Contrastive pretraining is the invisible infrastructure of the AI stack.

Architecture

CLIP: Dual Encoder

CLIP (Contrastive Language-Image Pretraining) uses two separate encoders — one for images (a ViT or ResNet) and one for text (a Transformer) — that map their inputs into a shared embedding space. Training maximizes the cosine similarity of correct (image, text) pairs and minimizes it for incorrect ones.

Embedding Space Visualization Interactive

Drag points to explore

SimCLR: Self-Supervised Pipeline

SimCLR takes a single image, creates two random augmentations (crop, flip, color jitter, blur), passes both through the same encoder and a small projection MLP, then applies contrastive loss. The two views of the same image are positives; everything else in the batch is a negative.

Augmentation Pipeline Interactive

Click Run to animate

🖼 Original

→

✂ Random Crop

→

↔ Flip

→

🌈 Color Jitter

→

🌫 Gaussian Blur

→

🧠 Encoder

→

📍 Projection

Two views × same encoder × contrastive loss = learned representations without labels.

Core Methods

The contrastive / representation learning family has diversified into several distinct approaches:

Language-Image

CLIP

Contrastive Language-Image Pretraining. Dual encoder aligns images and text in a shared space using 400M image-text pairs from the web.

OpenAI · 2021

Augmentation

SimCLR

Simple Contrastive Learning of Representations. Two augmented views, shared encoder, InfoNCE loss. Needs large batch sizes (4096+).

Google · 2020

No Negatives

BYOL

Bootstrap Your Own Latent. Uses a momentum-updated target network instead of negative pairs. Avoids representation collapse without negatives.

DeepMind · 2020

Self-Distillation

DINO / DINOv2

Self-Distillation with No Labels. Student-teacher ViT learns via cross-entropy on soft targets. Emergent object segmentation. DINOv2 scales massively.

Meta · 2021 / 2023

Masked

MAE

Masked Autoencoders. Masks 75% of image patches, reconstructs pixels. Not strictly contrastive but a key self-supervised representation learner.

Meta · 2022

Sigmoid Loss

SigLIP

Sigmoid Loss for Language-Image Pretraining. Replaces softmax with per-pair sigmoid. No global normalization needed — scales to larger batches more efficiently.

Google · 2023

InfoNCE Loss

The InfoNCE loss (Noise-Contrastive Estimation) is the mathematical engine behind most contrastive methods. Given a batch of B image-text pairs, we compute a B × B similarity matrix and train the model to put high values on the diagonal (matching pairs) and low values everywhere else.

InfoNCE Loss (Image-to-Text direction) L_i = −log( exp(sim(I_i, T_i) / τ) / Σ_j=1..B exp(sim(I_i, T_j) / τ) ) sim = cosine similarity · τ = temperature parameter · symmetric: average L_i→t and L_t→i

The temperature τ controls the sharpness of the distribution. Low τ makes the model more confident (sharper softmax), high τ makes it softer. Typical values: 0.07 for CLIP, 0.1–0.5 for SimCLR.

Similarity Matrix

B×B Similarity Matrix Interactive

τ = 0.07

B = 8 · Hover cells for values

Diagonal = matching pairs (should be bright). Off-diagonal = non-matching (should be dark). Adjust τ to see how temperature sharpens or softens the distribution.

Why It Matters

Contrastive pretraining has become the invisible infrastructure of modern AI. It is not an end product — it is the foundation upon which multimodal systems are built.

👁

Vision Encoders for VLMs

CLIP and SigLIP are the default vision encoders in GPT-4V, LLaVA, Gemini, and most VLMs. They translate pixels into tokens the LLM can understand.

🤖

Robotics Backbone

DINO and DINOv2 features power robot perception. Their dense, semantic features work for grasping, navigation, and manipulation without task-specific training.

🌐

Multimodal Glue

CLIP's shared embedding space is the bridge between text and images in Stable Diffusion, DALL-E, image search, and zero-shot classification.

In Practice

When someone says a VLM "sees" an image, what they really mean is that a contrastively-pretrained encoder converts that image into a sequence of embeddings. Contrastive pretraining is the reason language models can process vision at all.

Training / Inference

Training

• Data: 400M–2B image-text pairs (CLIP), or unlabeled images (SimCLR, DINO)
• Batch size: Very large — 32K (CLIP), 4096+ (SimCLR). More negatives = better signal.
• Compute: 256–1024 GPUs for days to weeks. CLIP: ~12 days on 256 V100s.
• Key tricks: Mixed precision, gradient checkpointing, learned temperature τ.
• Augmentations: Critical for self-supervised methods. Random crop + color jitter are the most important.
• Collapse prevention: BYOL uses momentum encoder. DINO uses centering + sharpening. SimCLR needs large batches.

Inference

• Zero-shot classification: Encode class names as text, compute cosine sim with image embedding. No fine-tuning.
• Image search: Pre-compute image embeddings, nearest-neighbor lookup at query time.
• VLM backbone: Extract patch-level features from ViT, project into LLM token space.
• Guidance signal: CLIP score guides diffusion model sampling toward text prompts.
• Speed: Single forward pass per encoder. ViT-L/14: ~5ms on A100. Very fast at inference.
• Linear probing: Freeze encoder, train a linear layer on top. Standard evaluation protocol.

Model Comparison

Model	Supervision	Loss	Negatives?	Encoder	ImageNet 0-shot / lin.	Batch Size
CLIP	Language supervision	InfoNCE (symmetric)	Yes (in-batch)	ViT-L/14, RN50	76.2% / 85.4%	32,768
SigLIP	Language supervision	Sigmoid (per-pair)	Yes (in-batch)	ViT-SO400M	83.1% / —	32,768
SimCLR	Self-supervised	NT-Xent (InfoNCE)	Yes (in-batch)	ResNet-50	— / 76.5%	4,096–8,192
BYOL	Self-supervised	MSE (regression)	No	ResNet-50	— / 78.6%	4,096
DINO	Self-supervised	Cross-entropy (soft)	No (teacher-student)	ViT-S/B	— / 78.2%	1,024
DINOv2	Self-supervised	DINO + iBOT + KoLeo	No (teacher-student)	ViT-g/14	83.5% / 86.5%	~3,072
MAE	Self-supervised	MSE (pixel reconst.)	No	ViT-H/14	— / 87.8%	4,096

CLIP vs DINO Feature Comparison

CLIP and DINO learn fundamentally different kinds of features. CLIP learns semantic, language-aligned features (good for classification, retrieval, VLMs). DINO learns dense, spatially-aware features (good for segmentation, detection, robotics).

CLIP vs DINO Feature Properties Interactive

■ CLIP ■ DINO

Complementary Strengths

Many state-of-the-art systems combine both: use CLIP for high-level semantic understanding and DINO for fine-grained spatial reasoning. DINOv2 features are increasingly preferred for robotics and dense prediction tasks, while SigLIP/CLIP remains dominant for VLMs and retrieval.

Contrastive /Representation Learning

What Is It?

Architecture

CLIP: Dual Encoder

SimCLR: Self-Supervised Pipeline

Core Methods

InfoNCE Loss

Similarity Matrix

Why It Matters

Training / Inference

Model Comparison

CLIP vs DINO Feature Comparison

Contrastive /
Representation Learning