One model that sees images, reads text, and handles tasks it was never trained for — how we went from one-model-per-task to foundation models that generalize everywhere.
For decades, computer vision worked like this: you want to classify images, so you build an image classifier. You want to detect objects, so you build a separate object detector. You want to generate captions, so you build a third model. Each task got its own architecture, its own dataset, its own training pipeline. The knowledge learned by one model was largely invisible to the others.
This is wasteful. A model that learns to recognize dogs for classification clearly knows something useful for detection, segmentation, and captioning. But the old paradigm — one model per task — couldn't share that knowledge.
A model that is (1) trained on broad data at scale using self-supervision, (2) can be adapted to a wide variety of downstream tasks, and (3) exhibits capabilities that were not explicitly programmed or trained for. The term was coined by the Stanford HAI Center in 2021.
Why did we suddenly shift? Because of a bet: that scale alone produces qualitative change. Train a model 10x bigger on 10x more data and you don't get 10x improvement — you get entirely new capabilities that didn't exist at smaller scale.
GPT-2 (1.5B parameters) could barely write coherent paragraphs. GPT-3 (175B) could do arithmetic, translate between languages, and write code — tasks it was never explicitly trained for. These are called emergent abilities: capabilities that appear abruptly at certain model sizes, unpredictable from smaller-scale experiments.
Supervised learning requires human labels. ImageNet has 1.28M labeled images — the result of years of crowd-sourcing. But the internet has billions of images with naturally occurring text (alt-text, captions, titles). Self-supervised objectives like "predict the next word" or "match images to their captions" unlock this ocean of data. At scale, the sheer diversity of internet data teaches models concepts that no hand-labeled dataset could cover.
| Property | Old Paradigm | Foundation Model |
|---|---|---|
| Task scope | Single task | Many tasks, often zero-shot |
| Parameters | Millions (ResNet-50: 25M) | Hundreds of millions to billions |
| Data | Curated, labeled (1M images) | Internet-scale, noisy (400M–5B images) |
| Supervision | Labels for every example | Self-supervised or weakly supervised |
Suppose you need 5 vision capabilities: classification, detection, segmentation, captioning, and visual question answering. Old approach: train 5 separate models, each needing its own labeled dataset. COCO took 70,000+ annotator hours. ImageNet took 25,000 workers over 2 years. Multiply by 5 tasks. Foundation model approach: train ONE model on internet data (no manual labeling), adapt it to all 5 tasks with minimal supervision.
Emergent abilities are surprising but not mysterious. A model trained on enough internet text will encounter arithmetic problems, translation pairs, and code snippets in its training data. It learns these capabilities because they appear in the data distribution — not because intelligence spontaneously crystallized. The surprise is that generic self-supervised training extracts these patterns without explicit task-specific engineering.
Remember SimCLR? It pulled together different augmentations of the same image and pushed apart augmentations of different images. The result was an image encoder whose features could be fine-tuned for many tasks. But SimCLR lived in image-only space — it never saw text.
OpenAI's CLIP (Contrastive Language-Image Pre-training, Radford et al. 2021) asked: what if we replace the second image augmentation with the text caption of that image? Instead of pulling together two crops of the same photo, pull together a photo and its caption.
CLIP has two encoders trained jointly:
Given a batch of N (image, text) pairs, CLIP constructs an N×N similarity matrix. Entry (i, j) is the cosine similarity between image i and text j. The diagonal contains matching pairs; everything off-diagonal is a negative pair.
Each row of the similarity matrix is a softmax classification problem: "which of the N texts matches image i?" And each column is the reverse: "which of the N images matches text j?" The loss symmetrically trains both directions.
The temperature τ controls the sharpness of the softmax. Small τ makes the distribution peaky — the model must produce very different similarities for matches vs. non-matches. Large τ makes it flat — even small similarity differences suffice. CLIP learns τ as a parameter (initialized to 0.07), so the model decides its own confidence threshold. At convergence, τ ≈ 0.01, meaning CLIP produces very sharp, confident similarity scores.
Batch: {(cat photo, "a cat"), (dog photo, "a dog"), (car photo, "a red car"), (sunset photo, "beautiful sunset")}. Similarity matrix after training:
"a cat" "a dog" "red car" "sunset"
cat 0.95 0.30 0.05 0.10
dog 0.25 0.92 0.08 0.12
car 0.03 0.06 0.88 0.15
sunset 0.08 0.05 0.12 0.91
The loss pushes the diagonal (green) toward 1.0 and everything off-diagonal (red) toward 0.0.
Watch how a batch of image-text pairs forms a contrastive matrix. Green = matching pairs (pull together). Red = non-matching (push apart). Adjust batch size to see how larger batches enable finer concepts.
CLIP was trained on WIT (WebImageText) — 400 million image-text pairs scraped from the internet. These aren't curated captions; they're alt-text, titles, and descriptions people wrote for web accessibility or SEO. Noisy, but massively diverse.
The batch size was 32,768. Why so large? Each batch creates 32,768 × 32,768 ≈ 1 billion pairwise comparisons. Larger batches mean more negatives, which forces the model to make finer distinctions. With a batch of 100, "a cat on a sofa" and "a cat on a chair" might never appear in the same batch, so the model never needs to distinguish them.
Here's CLIP's killer trick. To classify an image without any labeled training data:
This is literally 1-nearest-neighbor classification using text embeddings as the "training data." No labels, no fine-tuning, no training loop. And the prompt template matters: "a photo of a [category]" gives +1.3% over just "[category]" on ImageNet. Ensembling 80 prompt templates ("a blurry photo of a [category]", "a sculpture of a [category]", etc.) adds another +3.5%.
CLIP ViT-L/14 achieves 76.2% top-1 accuracy on ImageNet — matching a supervised ResNet-101 (76.4%). The ResNet was trained on 1.28M labeled ImageNet images. CLIP used zero ImageNet labels. This is the headline result: internet-scale contrastive pre-training can match supervised training without any task-specific labels.
CLIP doesn't just match supervised accuracy — it generalizes better. On ObjectNet (everyday objects in unusual poses and contexts), ImageNet-trained models drop 40–45% accuracy. CLIP drops only 25%. On sketches, adversarial images, and domain-shifted data, CLIP consistently outperforms supervised baselines.
CLIP ViT-L/14: 307M parameters, trained on 400M images. ResNet-101: 44.5M parameters, trained on 1.28M images. CLIP sees 300x more data and uses 7x more parameters. But scale alone doesn't explain generalization — training on natural language supervision forces the model to learn concepts (the meaning of "cat"), not just patterns (the pixel statistics of ImageNet cats). A model that understands "cat" transfers to any visual context; a model that memorized ImageNet cats doesn't.
CLIP is impressive but brittle in specific ways. Understanding these failures reveals the boundaries of contrastive learning.
Show CLIP two captions: "a mug in some grass" and "some grass in a mug." To a human, these are obviously different — one is a mug sitting on grass, the other is a bizarre scene. CLIP assigns nearly identical similarity scores to both. It sees the same words and ignores their order.
CLIP encodes text as a single vector — a "bag of concepts." "Red car on blue road" and "blue car on red road" produce almost the same embedding. The model recognizes individual concepts (red, blue, car, road) but fails to bind them correctly (which thing is which color). This is the compositionality failure.
Why does this happen? Think about the batch. Even with N = 32,768, what are the odds that one batch contains both "a mug in some grass" AND "some grass in a mug"? Essentially zero. If the model never sees both in the same batch, it never needs to distinguish them. The contrastive loss only pushes apart captions that co-occur as negatives.
Yuksekgonul et al. (2023) proposed NegCLIP: deliberately construct confusing negative pairs. Take "a mug in some grass" and shuffle the nouns to get "some grass in a mug." Include both in the same batch. Now the contrastive loss forces the model to pay attention to word order.
Original caption: "A black cat and a brown dog sitting on a bench."
Hard negative (noun swap): "A brown cat and a black dog sitting on a bench."
Hard negative (relation swap): "A bench sitting on a black cat and a brown dog."
When both appear in the same batch, the model must learn: black goes with cat, brown goes with dog. Word order encodes meaning.
Aggressive shuffling can produce negatives that are actually valid. "A dog next to a cat" shuffled to "A cat next to a dog" — these describe the same image! The model is penalized for matching the image to a correct caption. This damages performance on normal data. Careful filtering is needed.
Image-level captions like "a living room" are vague. The image contains a couch, house plants, a lamp, and a window — none of which the caption mentions. Models trained on such data learn to match "living room" to the overall scene but can't ground specific objects.
The fix: train on region-level captions paired with bounding box coordinates. "A red couch" anchored to coordinates (120, 340, 580, 620) in the image. Now the model must learn fine-grained correspondence between text and image regions.
CLIP's InfoNCE loss requires a softmax over the entire batch. This creates a hard dependency on batch size: the denominator sums N terms, so gradients and difficulty scale with N. Smaller batches give a weaker learning signal.
SigLIP replaces the softmax (global normalization across the batch) with a sigmoid (independent binary decision per pair). Each (image, text) pair is independently classified as matching or not. No denominator over the batch. This means SigLIP works equally well with small and large batches — removing one of CLIP's biggest engineering constraints.
CLIP can match images to text, but it can't generate text. CoCa (Contrastive Captioner, Yu et al. 2022) adds a text decoder on top of CLIP and trains with both contrastive loss and captioning loss. The contrastive branch aligns representations; the captioning branch teaches the model to produce language. Two losses, one model, better at both tasks.
MetaCLIP (Xu et al. 2023) showed that 400M carefully curated pairs outperform 2B noisy pairs. The curation strategy: start with Wikipedia concepts, match internet images to these concepts, and balance the dataset so no concept dominates. Clean, balanced data beats massive, noisy data. This principle recurs throughout foundation model research.
Large Language Models like LLaMA can answer questions, write code, and reason about text. They do this through next-token prediction: given all previous tokens, predict the next one. This simple objective enables zero-shot generalization to countless tasks. But LLMs are blind — they process only text.
LLaVA (Large Language and Vision Assistant, Liu et al. 2023) asks: what if we could feed images into an LLM as if they were text tokens?
ViLBERT (Lu et al. 2019) was an early attempt at combining vision and language. It used co-attention between image regions and text tokens, but required task-specific fine-tuning for each downstream task. Every new task (VQA, captioning, retrieval) needed a new fine-tuning run with labeled data. LLaVA's insight was to leverage the LLM's existing instruction-following ability, requiring only a lightweight alignment step.
CLIP's final layer pools all patch information into a single CLS token — great for image-level classification, but it throws away spatial information. "Where is the cat?" requires knowing which part of the image contains the cat. The penultimate layer preserves the full grid of patch tokens, each encoding a different spatial region. LLaVA feeds all 256 patch tokens to the LLM, preserving spatial detail.
LLaVA's training is remarkably lightweight:
Input image: A cat sleeping on a keyboard. Input text: "What is the animal doing?"
Step 1: CLIP ViT processes the image → 256 patch tokens of dimension 1024.
Step 2: Linear projection maps each token: 1024 → 4096. Now we have 256 "visual words."
Step 3: Concatenate: [visual token 1, ..., visual token 256, "What", "is", "the", "animal", "doing", "?"]
Step 4: LLaMA processes the full sequence autoregressively and generates: "The cat is sleeping on a computer keyboard."
The LLM treats visual tokens exactly like text tokens — same attention mechanism, same position embeddings, same next-token prediction.
The key insight is that LLMs already know how to reason about visual scenes from text descriptions in their training data. They've read millions of image descriptions, scene analyses, and visual dialogues. The projection layer doesn't teach the LLM about vision — it teaches the visual tokens to speak in the format the LLM already understands. The LLM's existing knowledge does the heavy lifting.
The original LLaVA used a simple linear projection. LLaVA-1.5 (Liu et al. 2024) upgraded to a 2-layer MLP with GELU activation, increased image resolution to 336×336 (from 224×224), and used a larger LLM (Vicuna-13B). These simple changes improved benchmark scores by 10–15% across the board, demonstrating that the architecture was sound — it just needed more capacity at the interface.
LLaVA concatenates visual and text tokens into one long sequence. This is simple but expensive: 256 visual tokens + 512 text tokens = 768 tokens, and attention scales quadratically. For multiple images or high-resolution inputs, the sequence explodes.
Flamingo (Alayrac et al., DeepMind 2022) takes a different approach: instead of concatenating, it uses cross-attention layers inserted between the frozen LLM layers. The text tokens attend to visual features without the visual tokens bloating the main sequence.
Output = LLM_layer(x) + tanh(α) · CrossAttention(x, visual_tokens). The gate α is initialized to 0, so at the start of training, tanh(α) = 0 and the visual information has zero effect. The model gradually learns to "open the gate" as training progresses. This preserves the LLM's pre-trained language abilities during early training.
Without gating, the randomly initialized cross-attention layers would inject noise into the frozen LLM from step 1. This corrupts the LLM's carefully learned language representations. With gating initialized at 0, the model starts as a pure LLM (visual signal = zero), then gradually incorporates vision. It's like adding a new instrument to an orchestra — start silent, fade in slowly, never disrupt the existing harmony.
Compare how LLaVA and Flamingo fuse visual and text information. LLaVA concatenates tokens; Flamingo uses cross-attention. Click to toggle between them.
Flamingo trains on sequences that naturally interleave images and text, like web pages: [image1] "A golden retriever playing fetch." [image2] "The same dog swimming in a lake." The text after each image can only attend to its nearest preceding image (masked attention). This teaches the model to ground language in the right visual context.
Flamingo's most striking capability: show it a few (image, text) examples, then present a new image, and it generates the right text — without any gradient updates. This is few-shot in-context learning, exactly like GPT-3 does for text tasks.
Prompt sequence:
[photo of 3 apples] "How many items? 3"
[photo of 5 oranges] "How many items? 5"
[photo of 2 bananas] "How many items?"
Flamingo generates: "2"
No fine-tuning. No labeled counting dataset. The model inferred the task format from two examples and applied it to a new image.
| Property | LLaVA | Flamingo |
|---|---|---|
| Fusion method | Concatenate visual + text tokens | Cross-attention between layers |
| Visual token count | All patch tokens (256+) | Fixed (64 via Perceiver) |
| LLM modification | None (tokens go in naturally) | New cross-attention layers inserted |
| Multi-image | Difficult (sequence too long) | Natural (interleaved format) |
| Trained params | LLM + projection | Perceiver + cross-attention + gates |
| Simplicity | Simpler (just concatenate) | More complex (new architecture) |
Most top-performing VLMs are closed-source: GPT-4V, Gemini, Claude. You can use their APIs, but you can't inspect their training data, reproduce their results, or verify their safety claims. Molmo (Deitke et al., Allen AI, 2024) set out to build a state-of-the-art VLM that is completely open: open weights, open data, open code, open evaluations.
Most internet image-text data is incidental: alt-text written for accessibility ("IMG_3847.jpg"), SEO keywords ("cheap red shoes buy now"), or tangential descriptions ("Photo by @user"). This data is abundant (billions of pairs) but low quality.
Molmo's key innovation is intentional data: the PixMo dataset of 700K image-description pairs created with a specific annotation protocol.
Llama 3.1-V trained on 6 billion incidental internet image-text pairs. Molmo trained on 700K intentional PixMo pairs. Molmo 72B ranks second only to GPT-4o on academic benchmarks. Molmo 7B beats Gemini 1.5 Flash and GPT-4V. Quality of data trumps quantity by orders of magnitude. This is the strongest evidence yet for the "data quality > data quantity" principle.
Molmo can point at things in images. Ask "Where is the cat?" and it outputs pixel coordinates. Ask "How many people are there?" and it counts by pointing to each one. This grounds the model's reasoning in the actual image, preventing hallucination (making up objects that aren't there).
Question: "How many birds are in this photo?"
Non-pointing model: "There are 5 birds in the photo." (Might be wrong — no way to verify.)
Molmo: "There are 4 birds." [points to each: (120, 45), (340, 89), (510, 132), (200, 201)] (Each point can be verified against the image. If the model misses one, you see it.)
Molmo uses a connector similar to LLaVA's projection but with multi-scale features from different layers of the vision encoder. The training recipe follows a three-stage curriculum: (1) alignment pre-training on caption data, (2) supervised fine-tuning on diverse instruction data, (3) preference tuning using DPO (Direct Preference Optimization). The connector design pools features from multiple ViT layers at different resolutions, capturing both fine details and global context.
Many "open" models are trained on outputs from GPT-4 or Claude — this is distillation, not independent training. You can't verify the training data because you can't access the teacher model's training data. Molmo avoids this: PixMo was collected by human annotators describing images directly. No teacher model in the loop.
Mask R-CNN can segment 80 object categories from COCO. DeepLab can do 21 categories from Pascal VOC. But what if you encounter a papillon butterfly, a spectrometer, or a medieval trebuchet? Closed-set segmentation fails on anything outside its fixed vocabulary.
SAM (Segment Anything Model, Kirillov et al., Meta 2023) is a promptable segmentation model: give it a point, a box, or a text description, and it segments the corresponding object — any object, in any image, without ever having seen that object category before.
The key design: the image encoder runs once per image. The prompt encoder and mask decoder run once per prompt. You can issue hundreds of different prompts (click different points) and only pay for the fast decoder each time. This amortization is what makes SAM interactive.
When you click a point on a person's shirt, what should be segmented? The shirt? The person? The group of people? All three are valid interpretations. SAM handles this by outputting three masks simultaneously, each at a different granularity level, plus a confidence score for each.
Click: On the wheel of a car.
Mask 1 (part): Just the wheel. Confidence: 0.85.
Mask 2 (object): The entire car. Confidence: 0.92.
Mask 3 (scene): The car and the road it's on. Confidence: 0.71.
The user or downstream system picks the granularity they need. The loss during training is computed only against the best-matching mask (minimum loss over the three), so the model learns to produce diverse, valid outputs.
Click on objects to place point prompts. SAM produces 3 masks at different granularity levels. Click different locations to see how prompts change the output.
SAM was trained on SA-1B: 11 million images, 1.1 billion masks. How do you annotate a billion masks? You don't — you use a model-in-the-loop data engine with three stages:
Each stage produces a better model, which makes the next stage cheaper and faster. Stage 1 is expensive (full manual annotation) but produces a mediocre model. Stage 2 uses that model to pre-fill masks, halving annotation time. Stage 3 removes humans entirely, scaling to a billion masks. The model improves its own training data — a virtuous cycle.
SAM processes individual frames. SAM 2 (Ravi et al. 2024) extends this to video: prompt on one frame, and the mask propagates through time. It adds a memory mechanism that tracks the segmented object across frames, handling occlusion, deformation, and camera motion. The data engine expands to 50.9K videos with 642.6K masklets (mask tracks over time).
Each foundation model has strengths and blind spots. CLIP knows about common objects but can't recognize rare breeds. SAM can segment anything but doesn't know what it segmented. LLMs can reason but can't see. The most powerful approach is to chain them together, using each model where it excels.
Problem: CLIP doesn't know what a "papillon" looks like — this dog breed rarely appears in alt-text. Solution: ask an LLM to describe a papillon.
Prompt to GPT-3: "Describe what a papillon dog looks like."
GPT-3 response: "A small dog with large butterfly-shaped ears, long silky fur, a plumed tail carried over the back, and a white coat with patches of color."
Use as CLIP text prompt: Encode this description instead of just "papillon." CLIP has seen many images of small dogs with butterfly ears and silky fur — it just didn't know the breed name. The LLM bridges the vocabulary gap.
VisProg (Gupta & Kembhavi 2023) goes further: the LLM doesn't just describe — it writes Python code that calls vision models as subroutines.
Question: "Are there exactly 3 people in the boat?"
LLM generates code:
boat_box = LOC(image, "boat")
boat_crop = CROP(image, boat_box)
people = DETECT(boat_crop, "person")
count = COUNT(people)
answer = EVAL(count == 3)
Each function call invokes a different foundation model: LOC uses Grounding DINO, DETECT uses OWL-ViT, CROP is pure Python. The LLM is the orchestrator; specialist models are the workers.
A single model trying to do everything suffers from task interference: getting better at segmentation might hurt captioning. Chaining avoids this by keeping each model focused on its specialty. It's also modular: swap out a better segmentation model and the whole pipeline improves, without retraining anything. And it's interpretable: you can inspect each step's output to find where errors occur.
ViperGPT (Suris et al. 2023) takes the same idea but executes the generated code in a Python runtime, enabling loops, conditionals, and complex logic. It can answer questions like "Which object in the image is the same color as the sky?" by iterating over detected objects, extracting their colors, comparing to the sky region, and returning the match.
The next frontier: models that don't just answer questions but take actions. Tool-augmented VLMs can browse the web (click buttons, fill forms), control robots (plan grasps, navigate), or generate and edit images. The LLM reasons about what to do; vision models perceive the environment; action models execute. This is the beginning of agentic AI — foundation models as autonomous actors, not just question-answerers.
Chaining has a critical weakness: errors compound. If LOC misidentifies the boat, everything downstream is wrong. Each model in the chain has some error rate ε. With K models in sequence, the chance of a correct final answer is roughly (1 − ε)K. With ε = 0.1 and K = 5, you're correct only 59% of the time. Reducing individual error rates matters more in chains than in standalone models.
Let's put CLIP's zero-shot classification trick together from scratch. In this interactive demo, you'll see how image and text embeddings are compared, how prompt templates affect accuracy, and how the cosine similarity space looks in 2D.
Select an image category and watch how text prompts are compared against the image embedding. Toggle ensemble mode to see how multiple prompt templates improve accuracy.
Each text prompt ("a photo of a cat") is encoded to a vector. The image is encoded to a vector. We compute cosine similarity between the image vector and every text vector. The highest similarity wins. With ensembling, we average the text vectors from multiple prompts ("a photo of a cat", "a blurry photo of a cat", "a drawing of a cat") before comparing — this denoises the text representation and improves accuracy.
In the demo above, the bar chart shows cosine similarities. But what does the embedding space look like? Imagine projecting all image and text embeddings into 2D. Matching pairs cluster together: the "cat" text vectors sit near the cat image vectors, far from car vectors. Ensembling moves the text cluster closer to the image point by averaging out noise.
Single prompt: "a photo of a cat" → similarity = 0.82.
Ensemble (5 prompts):
"a photo of a cat" → 0.82
"a blurry photo of a cat" → 0.76
"a sketch of a cat" → 0.69
"a close-up of a cat" → 0.85
"a photo of a small cat" → 0.80
Average embedding similarity: 0.84 (higher than any individual prompt except "close-up"). The averaged vector is a more robust representation of the concept "cat."
| Year | Model | Contribution |
|---|---|---|
| 2020 | SimCLR | Image-image contrastive pre-training |
| 2021 | CLIP | Image-text contrastive, zero-shot transfer |
| 2022 | Flamingo | Cross-attention VLM, in-context learning |
| 2022 | CoCa | Contrastive + captioning in one model |
| 2023 | LLaVA | Simple projection from CLIP to LLM |
| 2023 | SAM | Promptable segmentation, 1B masks |
| 2023 | SigLIP | Sigmoid loss, no batch size dependency |
| 2023 | VisProg | LLM writes code to chain vision models |
| 2024 | LLaVA-1.5 | MLP projection, higher resolution |
| 2024 | Molmo | Open-source VLM, intentional data, pointing |
| 2024 | SAM 2 | Video segmentation with memory |
| Model | Input | Output | Key Trick | Open? |
|---|---|---|---|---|
| CLIP | Image + text | Similarity score | Contrastive pre-training at scale | Weights |
| LLaVA | Image + question | Text answer | Project CLIP tokens into LLM | Weights + data |
| Flamingo | Images + text | Text | Gated cross-attention, Perceiver | No |
| Molmo | Image + question | Text + points | Intentional data, pointing | Everything |
| SAM | Image + prompt | Masks | Heavy/light split, multi-mask | Weights + data |
1. Scale unlocks emergence. Capabilities appear at model/data sizes that couldn't be predicted from smaller experiments.
2. Self-supervision unlocks data. Labels are expensive and limited; natural language supervision is free and infinite.
3. Quality beats quantity. 700K intentional captions beat 6B incidental ones. Curation strategy matters more than crawl size.
4. Compositionality is hard. Models learn individual concepts before learning to combine them. Explicit hard negatives and region-level training help.
5. Chain, don't monolith. Specialized models composed together outperform single models trying to do everything.
CLIP builds directly on contrastive learning ideas from CLIP & Contrastive Learning. The vision-language models here extend concepts from Vision-Language Models. SAM uses the ViT architecture covered in Transformers. The chaining approach connects to Vision-Language-Action Models.
Radford et al. "Learning Transferable Visual Models From Natural Language Supervision." ICML, 2021. (CLIP)
Liu et al. "Visual Instruction Tuning." NeurIPS, 2023. (LLaVA)
Alayrac et al. "Flamingo: a Visual Language Model for Few-Shot Learning." NeurIPS, 2022.
Deitke et al. "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models." 2024.
Kirillov et al. "Segment Anything." ICCV, 2023. (SAM)
Zhai et al. "Sigmoid Loss for Language Image Pre-Training." ICCV, 2023. (SigLIP)
Yu et al. "CoCa: Contrastive Captioners are Image-Text Foundation Models." 2022.
Gupta & Kembhavi. "Visual Programming: Compositional visual reasoning without training." CVPR, 2023. (VisProg)
Pratt et al. "What Does a Platypus Look Like? Generating Customized Prompts for Zero-Shot Image Classification." ICCV, 2023. (CuPL)
Xu et al. "Demystifying CLIP Data." ICLR, 2024. (MetaCLIP)
Yuksekgonul et al. "When and why vision-language models behave like bags-of-words." ICLR, 2023. (NegCLIP)
Ravi et al. "SAM 2: Segment Anything in Images and Videos." 2024.
Suris et al. "ViperGPT: Visual Inference via Python Execution for Reasoning." ICCV, 2023.
Lu et al. "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations." NeurIPS, 2019.
Mnih et al. "Human-level control through deep reinforcement learning." Nature, 2015. (DQN, referenced for foundation model analogy)
Foundation models learn general representations from internet-scale self-supervised data, then transfer to tasks they were never trained for — and the magic comes from scale, data quality, and compositionality.