← Gleams
Stanford CS 231n · Lecture 16 · Multi-Modal Foundation Models

Multi-Modal Foundation Models

One model that sees images, reads text, and handles tasks it was never trained for — how we went from one-model-per-task to foundation models that generalize everywhere.

CLIP & Contrastive Learning Vision-Language Models SAM & Open-Set Segmentation Model Chaining
Roadmap

What You'll Master

Chapter 01

From Specialized to Foundation

For decades, computer vision worked like this: you want to classify images, so you build an image classifier. You want to detect objects, so you build a separate object detector. You want to generate captions, so you build a third model. Each task got its own architecture, its own dataset, its own training pipeline. The knowledge learned by one model was largely invisible to the others.

This is wasteful. A model that learns to recognize dogs for classification clearly knows something useful for detection, segmentation, and captioning. But the old paradigm — one model per task — couldn't share that knowledge.

Definition
Foundation Model

A model that is (1) trained on broad data at scale using self-supervision, (2) can be adapted to a wide variety of downstream tasks, and (3) exhibits capabilities that were not explicitly programmed or trained for. The term was coined by the Stanford HAI Center in 2021.

The Scaling Hypothesis

Why did we suddenly shift? Because of a bet: that scale alone produces qualitative change. Train a model 10x bigger on 10x more data and you don't get 10x improvement — you get entirely new capabilities that didn't exist at smaller scale.

GPT-2 (1.5B parameters) could barely write coherent paragraphs. GPT-3 (175B) could do arithmetic, translate between languages, and write code — tasks it was never explicitly trained for. These are called emergent abilities: capabilities that appear abruptly at certain model sizes, unpredictable from smaller-scale experiments.

Why Self-Supervision Wins at Scale

Supervised learning requires human labels. ImageNet has 1.28M labeled images — the result of years of crowd-sourcing. But the internet has billions of images with naturally occurring text (alt-text, captions, titles). Self-supervised objectives like "predict the next word" or "match images to their captions" unlock this ocean of data. At scale, the sheer diversity of internet data teaches models concepts that no hand-labeled dataset could cover.

Four Properties of Foundation Models

PropertyOld ParadigmFoundation Model
Task scopeSingle taskMany tasks, often zero-shot
ParametersMillions (ResNet-50: 25M)Hundreds of millions to billions
DataCurated, labeled (1M images)Internet-scale, noisy (400M–5B images)
SupervisionLabels for every exampleSelf-supervised or weakly supervised
Worked Example — Cost of the Old Paradigm

Suppose you need 5 vision capabilities: classification, detection, segmentation, captioning, and visual question answering. Old approach: train 5 separate models, each needing its own labeled dataset. COCO took 70,000+ annotator hours. ImageNet took 25,000 workers over 2 years. Multiply by 5 tasks. Foundation model approach: train ONE model on internet data (no manual labeling), adapt it to all 5 tasks with minimal supervision.

Emergent ≠ Magical

Emergent abilities are surprising but not mysterious. A model trained on enough internet text will encounter arithmetic problems, translation pairs, and code snippets in its training data. It learns these capabilities because they appear in the data distribution — not because intelligence spontaneously crystallized. The surprise is that generic self-supervised training extracts these patterns without explicit task-specific engineering.

Chapter 02

CLIP — Images Meet Text

Remember SimCLR? It pulled together different augmentations of the same image and pushed apart augmentations of different images. The result was an image encoder whose features could be fine-tuned for many tasks. But SimCLR lived in image-only space — it never saw text.

OpenAI's CLIP (Contrastive Language-Image Pre-training, Radford et al. 2021) asked: what if we replace the second image augmentation with the text caption of that image? Instead of pulling together two crops of the same photo, pull together a photo and its caption.

Architecture

CLIP has two encoders trained jointly:

CLIP Architecture
  1. Image encoder: Either a ResNet (ResNet-50 to ResNet-50x64) or a ViT (ViT-B/32 to ViT-L/14). Takes an image, outputs a single embedding vector.
  2. Text encoder: A Transformer (12 layers, 512-dim, 8 heads). Takes a text caption, outputs a single embedding vector.
  3. Projection: Both embeddings are linearly projected to a shared space of the same dimensionality (512 or 768). Then L2-normalized.

The InfoNCE Loss

Given a batch of N (image, text) pairs, CLIP constructs an N×N similarity matrix. Entry (i, j) is the cosine similarity between image i and text j. The diagonal contains matching pairs; everything off-diagonal is a negative pair.

CLIP Contrastive Loss (InfoNCE) Limage→text = −(1/N) ∑i=1N log( exp(sim(Ii, Ti) / τ) / ∑j=1N exp(sim(Ii, Tj) / τ) )

Ltext→image = −(1/N) ∑i=1N log( exp(sim(Ti, Ii) / τ) / ∑j=1N exp(sim(Ti, Ij) / τ) )

LCLIP = (Limage→text + Ltext→image) / 2

Where sim(A, B) = A · B / (||A|| · ||B||) is cosine similarity and τ is a learned temperature parameter.

Each row of the similarity matrix is a softmax classification problem: "which of the N texts matches image i?" And each column is the reverse: "which of the N images matches text j?" The loss symmetrically trains both directions.

Why Temperature Matters

The temperature τ controls the sharpness of the softmax. Small τ makes the distribution peaky — the model must produce very different similarities for matches vs. non-matches. Large τ makes it flat — even small similarity differences suffice. CLIP learns τ as a parameter (initialized to 0.07), so the model decides its own confidence threshold. At convergence, τ ≈ 0.01, meaning CLIP produces very sharp, confident similarity scores.

Worked Example — Batch of 4

Batch: {(cat photo, "a cat"), (dog photo, "a dog"), (car photo, "a red car"), (sunset photo, "beautiful sunset")}. Similarity matrix after training:

         "a cat"  "a dog"  "red car"  "sunset"
cat      0.95     0.30     0.05       0.10
dog      0.25     0.92     0.08       0.12
car      0.03     0.06     0.88       0.15
sunset   0.08     0.05     0.12       0.91

The loss pushes the diagonal (green) toward 1.0 and everything off-diagonal (red) toward 0.0.

CLIP Contrastive Training Interactive

Watch how a batch of image-text pairs forms a contrastive matrix. Green = matching pairs (pull together). Red = non-matching (push apart). Adjust batch size to see how larger batches enable finer concepts.

4
Ready

Training at Scale

CLIP was trained on WIT (WebImageText) — 400 million image-text pairs scraped from the internet. These aren't curated captions; they're alt-text, titles, and descriptions people wrote for web accessibility or SEO. Noisy, but massively diverse.

The batch size was 32,768. Why so large? Each batch creates 32,768 × 32,768 ≈ 1 billion pairwise comparisons. Larger batches mean more negatives, which forces the model to make finer distinctions. With a batch of 100, "a cat on a sofa" and "a cat on a chair" might never appear in the same batch, so the model never needs to distinguish them.

Zero-Shot Classification

Here's CLIP's killer trick. To classify an image without any labeled training data:

CLIP Zero-Shot Classification
  1. Encode the image with CLIP's image encoder → embedding I.
  2. Encode category names as text: "a photo of a cat", "a photo of a dog", "a photo of a car", ... → embeddings T1, T2, T3, ...
  3. Compute cosine similarity between I and each Tk.
  4. Predict the category with the highest similarity.

This is literally 1-nearest-neighbor classification using text embeddings as the "training data." No labels, no fine-tuning, no training loop. And the prompt template matters: "a photo of a [category]" gives +1.3% over just "[category]" on ImageNet. Ensembling 80 prompt templates ("a blurry photo of a [category]", "a sculpture of a [category]", etc.) adds another +3.5%.

Matching a Supervised Baseline with Zero Labels

CLIP ViT-L/14 achieves 76.2% top-1 accuracy on ImageNet — matching a supervised ResNet-101 (76.4%). The ResNet was trained on 1.28M labeled ImageNet images. CLIP used zero ImageNet labels. This is the headline result: internet-scale contrastive pre-training can match supervised training without any task-specific labels.

Why CLIP Generalizes Better

CLIP doesn't just match supervised accuracy — it generalizes better. On ObjectNet (everyday objects in unusual poses and contexts), ImageNet-trained models drop 40–45% accuracy. CLIP drops only 25%. On sketches, adversarial images, and domain-shifted data, CLIP consistently outperforms supervised baselines.

Scale Explains the Gap

CLIP ViT-L/14: 307M parameters, trained on 400M images. ResNet-101: 44.5M parameters, trained on 1.28M images. CLIP sees 300x more data and uses 7x more parameters. But scale alone doesn't explain generalization — training on natural language supervision forces the model to learn concepts (the meaning of "cat"), not just patterns (the pixel statistics of ImageNet cats). A model that understands "cat" transfers to any visual context; a model that memorized ImageNet cats doesn't.

Chapter 03

CLIP's Limits & Fixes

CLIP is impressive but brittle in specific ways. Understanding these failures reveals the boundaries of contrastive learning.

The Compositionality Problem

Show CLIP two captions: "a mug in some grass" and "some grass in a mug." To a human, these are obviously different — one is a mug sitting on grass, the other is a bizarre scene. CLIP assigns nearly identical similarity scores to both. It sees the same words and ignores their order.

CLIP Can't Compose

CLIP encodes text as a single vector — a "bag of concepts." "Red car on blue road" and "blue car on red road" produce almost the same embedding. The model recognizes individual concepts (red, blue, car, road) but fails to bind them correctly (which thing is which color). This is the compositionality failure.

Why does this happen? Think about the batch. Even with N = 32,768, what are the odds that one batch contains both "a mug in some grass" AND "some grass in a mug"? Essentially zero. If the model never sees both in the same batch, it never needs to distinguish them. The contrastive loss only pushes apart captions that co-occur as negatives.

Fix #1: Hard Negatives (NegCLIP)

Yuksekgonul et al. (2023) proposed NegCLIP: deliberately construct confusing negative pairs. Take "a mug in some grass" and shuffle the nouns to get "some grass in a mug." Include both in the same batch. Now the contrastive loss forces the model to pay attention to word order.

Worked Example — Hard Negatives

Original caption: "A black cat and a brown dog sitting on a bench."

Hard negative (noun swap): "A brown cat and a black dog sitting on a bench."

Hard negative (relation swap): "A bench sitting on a black cat and a brown dog."

When both appear in the same batch, the model must learn: black goes with cat, brown goes with dog. Word order encodes meaning.

Hard Negatives Create Hard Positives

Aggressive shuffling can produce negatives that are actually valid. "A dog next to a cat" shuffled to "A cat next to a dog" — these describe the same image! The model is penalized for matching the image to a correct caption. This damages performance on normal data. Careful filtering is needed.

Fix #2: Region Captions

Image-level captions like "a living room" are vague. The image contains a couch, house plants, a lamp, and a window — none of which the caption mentions. Models trained on such data learn to match "living room" to the overall scene but can't ground specific objects.

The fix: train on region-level captions paired with bounding box coordinates. "A red couch" anchored to coordinates (120, 340, 580, 620) in the image. Now the model must learn fine-grained correspondence between text and image regions.

Fix #3: SigLIP — Remove the Batch Size Dependency

CLIP's InfoNCE loss requires a softmax over the entire batch. This creates a hard dependency on batch size: the denominator sums N terms, so gradients and difficulty scale with N. Smaller batches give a weaker learning signal.

SigLIP Loss (Zhai et al. 2023) L = −(1/N) ∑i=1Nj=1N log σ( yij · (sim(Ii, Tj) / τ − b) )

Where yij = +1 if i = j (match), −1 otherwise. σ is the sigmoid function, b is a learned bias.

SigLIP replaces the softmax (global normalization across the batch) with a sigmoid (independent binary decision per pair). Each (image, text) pair is independently classified as matching or not. No denominator over the batch. This means SigLIP works equally well with small and large batches — removing one of CLIP's biggest engineering constraints.

Fix #4: CoCa — Add a Decoder

CLIP can match images to text, but it can't generate text. CoCa (Contrastive Captioner, Yu et al. 2022) adds a text decoder on top of CLIP and trains with both contrastive loss and captioning loss. The contrastive branch aligns representations; the captioning branch teaches the model to produce language. Two losses, one model, better at both tasks.

Data Quality Over Quantity

MetaCLIP (Xu et al. 2023) showed that 400M carefully curated pairs outperform 2B noisy pairs. The curation strategy: start with Wikipedia concepts, match internet images to these concepts, and balance the dataset so no concept dominates. Clean, balanced data beats massive, noisy data. This principle recurs throughout foundation model research.

Chapter 04

LLaVA — Visual Language Models

Large Language Models like LLaMA can answer questions, write code, and reason about text. They do this through next-token prediction: given all previous tokens, predict the next one. This simple objective enables zero-shot generalization to countless tasks. But LLMs are blind — they process only text.

LLaVA (Large Language and Vision Assistant, Liu et al. 2023) asks: what if we could feed images into an LLM as if they were text tokens?

Historical Context

ViLBERT (Lu et al. 2019) was an early attempt at combining vision and language. It used co-attention between image regions and text tokens, but required task-specific fine-tuning for each downstream task. Every new task (VQA, captioning, retrieval) needed a new fine-tuning run with labeled data. LLaVA's insight was to leverage the LLM's existing instruction-following ability, requiring only a lightweight alignment step.

The LLaVA Architecture

LLaVA Architecture (3 Components)
  1. Image encoder (frozen CLIP ViT): Takes the image, extracts features from the penultimate layer. This produces a grid of patch tokens (e.g., 256 tokens for a 16×16 patch grid), not just the CLS token.
  2. Projection layer: A simple linear layer (or small MLP) that maps each CLIP patch token (dim 1024) to the LLM's input space (dim 4096). This is the "bridge" between vision and language.
  3. LLM decoder (LLaMA): The pre-trained language model. Receives a sequence of [visual tokens | text tokens] and does standard autoregressive next-token prediction.
Why the Penultimate Layer?

CLIP's final layer pools all patch information into a single CLS token — great for image-level classification, but it throws away spatial information. "Where is the cat?" requires knowing which part of the image contains the cat. The penultimate layer preserves the full grid of patch tokens, each encoding a different spatial region. LLaVA feeds all 256 patch tokens to the LLM, preserving spatial detail.

Training: Two Stages

LLaVA's training is remarkably lightweight:

LLaVA Training Pipeline
  1. Stage 1 — Feature alignment (frozen LLM, frozen CLIP): Only the linear projection layer is trained. Data: 595K image-caption pairs. Objective: make the projected visual tokens "look like" text tokens to the LLM. Think of it as teaching the bridge to speak the LLM's language.
  2. Stage 2 — Visual instruction tuning (train LLM + projection): Unfreeze the LLM and projection layer. Data: ~150K instruction-following conversations about images (e.g., "What's happening in this image?" → detailed response). Generated by GPT-4 looking at image captions. The CLIP encoder stays frozen.
Worked Example — LLaVA Forward Pass

Input image: A cat sleeping on a keyboard. Input text: "What is the animal doing?"

Step 1: CLIP ViT processes the image → 256 patch tokens of dimension 1024.

Step 2: Linear projection maps each token: 1024 → 4096. Now we have 256 "visual words."

Step 3: Concatenate: [visual token 1, ..., visual token 256, "What", "is", "the", "animal", "doing", "?"]

Step 4: LLaMA processes the full sequence autoregressively and generates: "The cat is sleeping on a computer keyboard."

The LLM treats visual tokens exactly like text tokens — same attention mechanism, same position embeddings, same next-token prediction.

Why This Works

The key insight is that LLMs already know how to reason about visual scenes from text descriptions in their training data. They've read millions of image descriptions, scene analyses, and visual dialogues. The projection layer doesn't teach the LLM about vision — it teaches the visual tokens to speak in the format the LLM already understands. The LLM's existing knowledge does the heavy lifting.

LLaVA-1.5 Improvements

The original LLaVA used a simple linear projection. LLaVA-1.5 (Liu et al. 2024) upgraded to a 2-layer MLP with GELU activation, increased image resolution to 336×336 (from 224×224), and used a larger LLM (Vicuna-13B). These simple changes improved benchmark scores by 10–15% across the board, demonstrating that the architecture was sound — it just needed more capacity at the interface.

Chapter 05

Flamingo — Cross-Attention Fusion

LLaVA concatenates visual and text tokens into one long sequence. This is simple but expensive: 256 visual tokens + 512 text tokens = 768 tokens, and attention scales quadratically. For multiple images or high-resolution inputs, the sequence explodes.

Flamingo (Alayrac et al., DeepMind 2022) takes a different approach: instead of concatenating, it uses cross-attention layers inserted between the frozen LLM layers. The text tokens attend to visual features without the visual tokens bloating the main sequence.

Architecture: Three Components

Flamingo Architecture
  1. Vision encoder (frozen): A pre-trained NFNet (or similar) that processes each image independently into a set of feature tokens.
  2. Perceiver Sampler: A learned module that compresses the variable number of visual tokens (from different image sizes) into a fixed number (64 tokens). Uses learned query vectors that cross-attend to the image features. Think of it as learned pooling — the model decides which visual information to keep.
  3. Gated cross-attention layers: New layers inserted between every existing LLM layer. In each, text tokens (queries) attend to visual tokens (keys/values). A learned gating scalar α controls how much visual information flows in.
Definition
Gated Cross-Attention

Output = LLM_layer(x) + tanh(α) · CrossAttention(x, visual_tokens). The gate α is initialized to 0, so at the start of training, tanh(α) = 0 and the visual information has zero effect. The model gradually learns to "open the gate" as training progresses. This preserves the LLM's pre-trained language abilities during early training.

Why Gating Matters

Without gating, the randomly initialized cross-attention layers would inject noise into the frozen LLM from step 1. This corrupts the LLM's carefully learned language representations. With gating initialized at 0, the model starts as a pure LLM (visual signal = zero), then gradually incorporates vision. It's like adding a new instrument to an orchestra — start silent, fade in slowly, never disrupt the existing harmony.

VLM Architecture Comparison Interactive

Compare how LLaVA and Flamingo fuse visual and text information. LLaVA concatenates tokens; Flamingo uses cross-attention. Click to toggle between them.

Showing LLaVA

Interleaved Image-Text Training

Flamingo trains on sequences that naturally interleave images and text, like web pages: [image1] "A golden retriever playing fetch." [image2] "The same dog swimming in a lake." The text after each image can only attend to its nearest preceding image (masked attention). This teaches the model to ground language in the right visual context.

In-Context Learning

Flamingo's most striking capability: show it a few (image, text) examples, then present a new image, and it generates the right text — without any gradient updates. This is few-shot in-context learning, exactly like GPT-3 does for text tasks.

Worked Example — Few-Shot Visual QA

Prompt sequence:

[photo of 3 apples] "How many items? 3"

[photo of 5 oranges] "How many items? 5"

[photo of 2 bananas] "How many items?"

Flamingo generates: "2"

No fine-tuning. No labeled counting dataset. The model inferred the task format from two examples and applied it to a new image.

LLaVA vs. Flamingo

PropertyLLaVAFlamingo
Fusion methodConcatenate visual + text tokensCross-attention between layers
Visual token countAll patch tokens (256+)Fixed (64 via Perceiver)
LLM modificationNone (tokens go in naturally)New cross-attention layers inserted
Multi-imageDifficult (sequence too long)Natural (interleaved format)
Trained paramsLLM + projectionPerceiver + cross-attention + gates
SimplicitySimpler (just concatenate)More complex (new architecture)
Chapter 06

Molmo — Open-Source Champion

Most top-performing VLMs are closed-source: GPT-4V, Gemini, Claude. You can use their APIs, but you can't inspect their training data, reproduce their results, or verify their safety claims. Molmo (Deitke et al., Allen AI, 2024) set out to build a state-of-the-art VLM that is completely open: open weights, open data, open code, open evaluations.

The Data Problem: Incidental vs. Intentional

Most internet image-text data is incidental: alt-text written for accessibility ("IMG_3847.jpg"), SEO keywords ("cheap red shoes buy now"), or tangential descriptions ("Photo by @user"). This data is abundant (billions of pairs) but low quality.

Molmo's key innovation is intentional data: the PixMo dataset of 700K image-description pairs created with a specific annotation protocol.

PixMo Data Collection Protocol
  1. Human annotators look at an image and speak aloud for 60–90 seconds, describing what they see. Speech is transcribed to text.
  2. Seven guided questions elicit dense coverage: (1) First glance, (2) Object inventory with counts, (3) Any visible text, (4) Spatial positions of key objects, (5) Subtle details others might miss, (6) Background and setting, (7) Style, lighting, and colors.
  3. This produces dense, exhaustive captions averaging 200+ words per image — far richer than typical alt-text (5–15 words).
700K Intentional > 6B Incidental

Llama 3.1-V trained on 6 billion incidental internet image-text pairs. Molmo trained on 700K intentional PixMo pairs. Molmo 72B ranks second only to GPT-4o on academic benchmarks. Molmo 7B beats Gemini 1.5 Flash and GPT-4V. Quality of data trumps quantity by orders of magnitude. This is the strongest evidence yet for the "data quality > data quantity" principle.

Pointing: Grounding in Pixels

Molmo can point at things in images. Ask "Where is the cat?" and it outputs pixel coordinates. Ask "How many people are there?" and it counts by pointing to each one. This grounds the model's reasoning in the actual image, preventing hallucination (making up objects that aren't there).

Worked Example — Counting by Pointing

Question: "How many birds are in this photo?"

Non-pointing model: "There are 5 birds in the photo." (Might be wrong — no way to verify.)

Molmo: "There are 4 birds." [points to each: (120, 45), (340, 89), (510, 132), (200, 201)] (Each point can be verified against the image. If the model misses one, you see it.)

Architecture and Training

Molmo uses a connector similar to LLaVA's projection but with multi-scale features from different layers of the vision encoder. The training recipe follows a three-stage curriculum: (1) alignment pre-training on caption data, (2) supervised fine-tuning on diverse instruction data, (3) preference tuning using DPO (Direct Preference Optimization). The connector design pools features from multiple ViT layers at different resolutions, capturing both fine details and global context.

The Distillation Problem

Many "open" models are trained on outputs from GPT-4 or Claude — this is distillation, not independent training. You can't verify the training data because you can't access the teacher model's training data. Molmo avoids this: PixMo was collected by human annotators describing images directly. No teacher model in the loop.

Chapter 07

SAM — Segment Anything

Mask R-CNN can segment 80 object categories from COCO. DeepLab can do 21 categories from Pascal VOC. But what if you encounter a papillon butterfly, a spectrometer, or a medieval trebuchet? Closed-set segmentation fails on anything outside its fixed vocabulary.

SAM (Segment Anything Model, Kirillov et al., Meta 2023) is a promptable segmentation model: give it a point, a box, or a text description, and it segments the corresponding object — any object, in any image, without ever having seen that object category before.

Architecture: Heavy-Light Split

SAM Architecture
  1. Image encoder (heavy): A ViT-H (632M params) processes the full image once and outputs a spatial feature map. This is the expensive step — ~150ms on a GPU.
  2. Prompt encoder (lightweight): Encodes the user's prompt: points become positional embeddings, boxes become corner embeddings, text is encoded via CLIP. The output is a small set of prompt tokens.
  3. Mask decoder (lightweight): A small Transformer (2 layers) takes the image features + prompt tokens and outputs segmentation masks. This runs in ~10ms — fast enough for interactive use.

The key design: the image encoder runs once per image. The prompt encoder and mask decoder run once per prompt. You can issue hundreds of different prompts (click different points) and only pay for the fast decoder each time. This amortization is what makes SAM interactive.

Handling Ambiguity: Three Masks

When you click a point on a person's shirt, what should be segmented? The shirt? The person? The group of people? All three are valid interpretations. SAM handles this by outputting three masks simultaneously, each at a different granularity level, plus a confidence score for each.

Worked Example — Ambiguous Point Prompt

Click: On the wheel of a car.

Mask 1 (part): Just the wheel. Confidence: 0.85.

Mask 2 (object): The entire car. Confidence: 0.92.

Mask 3 (scene): The car and the road it's on. Confidence: 0.71.

The user or downstream system picks the granularity they need. The loss during training is computed only against the best-matching mask (minimum loss over the three), so the model learns to produce diverse, valid outputs.

SAM Multi-Mask Loss L = mink ∈ {1,2,3} Lmask(M̂k, Mgt) + λ · LIoU(M̂k, Mgt)

Only the mask closest to ground truth contributes to the gradient. This is a multiple-hypothesis loss — the model proposes several answers and is judged on its best one.
SAM Prompting Demo Interactive

Click on objects to place point prompts. SAM produces 3 masks at different granularity levels. Click different locations to see how prompts change the output.

Click on the scene to place a point prompt

The SA-1B Data Engine

SAM was trained on SA-1B: 11 million images, 1.1 billion masks. How do you annotate a billion masks? You don't — you use a model-in-the-loop data engine with three stages:

SA-1B Data Engine (3 Stages)
  1. Manual stage: Professional annotators manually segment objects using an interactive tool (SAM's early version assists with auto-complete). 120K images, 4.3M masks. This bootstraps the first model.
  2. Semi-automatic stage: SAM proposes masks automatically. Annotators correct and add missing ones. 180K images, 5.9M masks. Faster because SAM does most of the work.
  3. Fully automatic stage: SAM generates masks with zero human input using a grid of point prompts (32×32 = 1024 points per image). NMS and filtering remove duplicates and low-confidence masks. 11M images, 1.1B masks. Fully scalable.
Model-in-the-Loop Flywheel

Each stage produces a better model, which makes the next stage cheaper and faster. Stage 1 is expensive (full manual annotation) but produces a mediocre model. Stage 2 uses that model to pre-fill masks, halving annotation time. Stage 3 removes humans entirely, scaling to a billion masks. The model improves its own training data — a virtuous cycle.

SAM 2: Video Segmentation

SAM processes individual frames. SAM 2 (Ravi et al. 2024) extends this to video: prompt on one frame, and the mask propagates through time. It adds a memory mechanism that tracks the segmented object across frames, handling occlusion, deformation, and camera motion. The data engine expands to 50.9K videos with 642.6K masklets (mask tracks over time).

Chapter 08

Chaining Foundation Models

Each foundation model has strengths and blind spots. CLIP knows about common objects but can't recognize rare breeds. SAM can segment anything but doesn't know what it segmented. LLMs can reason but can't see. The most powerful approach is to chain them together, using each model where it excels.

CuPL: Language Describes What CLIP Can't See

Problem: CLIP doesn't know what a "papillon" looks like — this dog breed rarely appears in alt-text. Solution: ask an LLM to describe a papillon.

Worked Example — CuPL (Pratt et al. 2023)

Prompt to GPT-3: "Describe what a papillon dog looks like."

GPT-3 response: "A small dog with large butterfly-shaped ears, long silky fur, a plumed tail carried over the back, and a white coat with patches of color."

Use as CLIP text prompt: Encode this description instead of just "papillon." CLIP has seen many images of small dogs with butterfly ears and silky fur — it just didn't know the breed name. The LLM bridges the vocabulary gap.

Visual Programming: LLMs Write Vision Code

VisProg (Gupta & Kembhavi 2023) goes further: the LLM doesn't just describe — it writes Python code that calls vision models as subroutines.

Worked Example — VisProg

Question: "Are there exactly 3 people in the boat?"

LLM generates code:

boat_box = LOC(image, "boat")
boat_crop = CROP(image, boat_box)
people = DETECT(boat_crop, "person")
count = COUNT(people)
answer = EVAL(count == 3)

Each function call invokes a different foundation model: LOC uses Grounding DINO, DETECT uses OWL-ViT, CROP is pure Python. The LLM is the orchestrator; specialist models are the workers.

Why Chaining Beats Monolithic Models

A single model trying to do everything suffers from task interference: getting better at segmentation might hurt captioning. Chaining avoids this by keeping each model focused on its specialty. It's also modular: swap out a better segmentation model and the whole pipeline improves, without retraining anything. And it's interpretable: you can inspect each step's output to find where errors occur.

ViperGPT: Code as Inference

ViperGPT (Suris et al. 2023) takes the same idea but executes the generated code in a Python runtime, enabling loops, conditionals, and complex logic. It can answer questions like "Which object in the image is the same color as the sky?" by iterating over detected objects, extracting their colors, comparing to the sky region, and returning the match.

Agentic Vision Systems

The next frontier: models that don't just answer questions but take actions. Tool-augmented VLMs can browse the web (click buttons, fill forms), control robots (plan grasps, navigate), or generate and edit images. The LLM reasons about what to do; vision models perceive the environment; action models execute. This is the beginning of agentic AI — foundation models as autonomous actors, not just question-answerers.

Error Propagation

Chaining has a critical weakness: errors compound. If LOC misidentifies the boat, everything downstream is wrong. Each model in the chain has some error rate ε. With K models in sequence, the chance of a correct final answer is roughly (1 − ε)K. With ε = 0.1 and K = 5, you're correct only 59% of the time. Reducing individual error rates matters more in chains than in standalone models.

Chapter 09

Showcase: Zero-Shot Classifier

Let's put CLIP's zero-shot classification trick together from scratch. In this interactive demo, you'll see how image and text embeddings are compared, how prompt templates affect accuracy, and how the cosine similarity space looks in 2D.

CLIP Zero-Shot Classification Interactive

Select an image category and watch how text prompts are compared against the image embedding. Toggle ensemble mode to see how multiple prompt templates improve accuracy.

Pick an image, then classify
What's Happening Under the Hood

Each text prompt ("a photo of a cat") is encoded to a vector. The image is encoded to a vector. We compute cosine similarity between the image vector and every text vector. The highest similarity wins. With ensembling, we average the text vectors from multiple prompts ("a photo of a cat", "a blurry photo of a cat", "a drawing of a cat") before comparing — this denoises the text representation and improves accuracy.

Embedding Space Visualization

In the demo above, the bar chart shows cosine similarities. But what does the embedding space look like? Imagine projecting all image and text embeddings into 2D. Matching pairs cluster together: the "cat" text vectors sit near the cat image vectors, far from car vectors. Ensembling moves the text cluster closer to the image point by averaging out noise.

Worked Example — Ensemble Prompts

Single prompt: "a photo of a cat" → similarity = 0.82.

Ensemble (5 prompts):

"a photo of a cat" → 0.82

"a blurry photo of a cat" → 0.76

"a sketch of a cat" → 0.69

"a close-up of a cat" → 0.85

"a photo of a small cat" → 0.80

Average embedding similarity: 0.84 (higher than any individual prompt except "close-up"). The averaged vector is a more robust representation of the concept "cat."

Chapter 10

Summary & Connections

Evolution Timeline

YearModelContribution
2020SimCLRImage-image contrastive pre-training
2021CLIPImage-text contrastive, zero-shot transfer
2022FlamingoCross-attention VLM, in-context learning
2022CoCaContrastive + captioning in one model
2023LLaVASimple projection from CLIP to LLM
2023SAMPromptable segmentation, 1B masks
2023SigLIPSigmoid loss, no batch size dependency
2023VisProgLLM writes code to chain vision models
2024LLaVA-1.5MLP projection, higher resolution
2024MolmoOpen-source VLM, intentional data, pointing
2024SAM 2Video segmentation with memory

Model Comparison

ModelInputOutputKey TrickOpen?
CLIPImage + textSimilarity scoreContrastive pre-training at scaleWeights
LLaVAImage + questionText answerProject CLIP tokens into LLMWeights + data
FlamingoImages + textTextGated cross-attention, PerceiverNo
MolmoImage + questionText + pointsIntentional data, pointingEverything
SAMImage + promptMasksHeavy/light split, multi-maskWeights + data

Five Principles of Foundation Models

The Core Lessons

1. Scale unlocks emergence. Capabilities appear at model/data sizes that couldn't be predicted from smaller experiments.

2. Self-supervision unlocks data. Labels are expensive and limited; natural language supervision is free and infinite.

3. Quality beats quantity. 700K intentional captions beat 6B incidental ones. Curation strategy matters more than crawl size.

4. Compositionality is hard. Models learn individual concepts before learning to combine them. Explicit hard negatives and region-level training help.

5. Chain, don't monolith. Specialized models composed together outperform single models trying to do everything.

Connections to Other Lessons

CLIP builds directly on contrastive learning ideas from CLIP & Contrastive Learning. The vision-language models here extend concepts from Vision-Language Models. SAM uses the ViT architecture covered in Transformers. The chaining approach connects to Vision-Language-Action Models.

References

Radford et al. "Learning Transferable Visual Models From Natural Language Supervision." ICML, 2021. (CLIP)
Liu et al. "Visual Instruction Tuning." NeurIPS, 2023. (LLaVA)
Alayrac et al. "Flamingo: a Visual Language Model for Few-Shot Learning." NeurIPS, 2022.
Deitke et al. "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models." 2024.
Kirillov et al. "Segment Anything." ICCV, 2023. (SAM)
Zhai et al. "Sigmoid Loss for Language Image Pre-Training." ICCV, 2023. (SigLIP)
Yu et al. "CoCa: Contrastive Captioners are Image-Text Foundation Models." 2022.
Gupta & Kembhavi. "Visual Programming: Compositional visual reasoning without training." CVPR, 2023. (VisProg)
Pratt et al. "What Does a Platypus Look Like? Generating Customized Prompts for Zero-Shot Image Classification." ICCV, 2023. (CuPL)
Xu et al. "Demystifying CLIP Data." ICLR, 2024. (MetaCLIP)
Yuksekgonul et al. "When and why vision-language models behave like bags-of-words." ICLR, 2023. (NegCLIP)
Ravi et al. "SAM 2: Segment Anything in Images and Videos." 2024.
Suris et al. "ViperGPT: Visual Inference via Python Execution for Reasoning." ICCV, 2023.
Lu et al. "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations." NeurIPS, 2019.
Mnih et al. "Human-level control through deep reinforcement learning." Nature, 2015. (DQN, referenced for foundation model analogy)

The One Sentence

Foundation models learn general representations from internet-scale self-supervised data, then transfer to tasks they were never trained for — and the magic comes from scale, data quality, and compositionality.