Liu, Li, Li, Lee — 2023

Improved Baselines with Visual Instruction Tuning

A simple recipe that matches or beats far more complex VLMs: CLIP ViT-L/14 at 336px, a two-layer MLP projection, and Vicuna-13B — trained on just 1.2M public data samples in one day on 8 A100s.

Prerequisites: Vision Transformers (ViT) + CLIP + LLM basics
10
Chapters
5+
Simulations

Chapter 0: The Problem

By late 2023, large multimodal models (LMMs) were racing to become general-purpose visual assistants. But the leading approaches looked wildly different in complexity:

Meanwhile, the original LLaVA (2023) took a radically simpler approach: a single linear projection layer connecting CLIP to an LLM, trained on just 600K image-text pairs. It excelled at open-ended visual conversation but struggled on benchmarks requiring short, precise answers (like VQA).

The community assumed that to match the complex models on academic benchmarks, you needed their complexity. Hundreds of millions of training samples. Specialized vision-language modules. Proprietary data.

The question LLaVA-1.5 asks: What if the original LLaVA's simplicity was actually a strength, not a weakness? What if a few targeted improvements — better resolution, a slightly more expressive projection, and smarter data mixing — could beat all of these complex systems? LLaVA-1.5 shows the answer is yes, using only 1.2M public data samples and training in a single day.
Why did the original LLaVA struggle on academic VQA benchmarks despite excelling at open-ended visual conversation?

Chapter 1: The Key Insight

LLaVA-1.5's insight is almost embarrassingly simple. The fully-connected vision-language connector in LLaVA is already surprisingly powerful and data-efficient. You don't need a Q-Former. You don't need billions of pre-training pairs. You need three targeted improvements:

Improvement 1: MLP instead of linear projection

The original LLaVA used a single linear layer to map CLIP visual features into the LLM's embedding space. LLaVA-1.5 replaces this with a two-layer MLP (with GELU activation). That's it — one extra linear layer and a nonlinearity. This gives the connector enough representational power to better align visual and language features.

Improvement 2: Higher resolution input

Swap CLIP-ViT-L/14 at 224px for CLIP-ViT-L/14 at 336px. More pixels means the model can read text in images, distinguish fine details, and handle complex scenes. The 336px model produces 576 visual tokens (24×24 patches) instead of 256 tokens (16×16).

Improvement 3: Academic VQA data with response formatting

Include VQA datasets (VQAv2, GQA, OCR-VQA, etc.) in instruction tuning, but with a critical twist: append a response format prompt that tells the model when to give short answers. "Answer the question using a single word or phrase." This prevents the model from overfitting to either long-form or short-form responses.

The recipe in one sentence: CLIP ViT-L/14@336px → 2-layer MLP → Vicuna-13B, trained on 558K alignment pairs then 665K mixed instruction data. That's the entire architecture. It achieves state-of-the-art across 11 benchmarks, beating systems that use 1000× more pre-training data.
LLaVA-1.5 vs. Competitors: Data Efficiency

Pre-training data used (millions of samples) versus number of benchmarks won. LLaVA-1.5 achieves more with dramatically less data.

What are the three key modifications LLaVA-1.5 makes over the original LLaVA?

Chapter 2: The Architecture

LLaVA-1.5's architecture is one of the simplest possible designs for a visual language model. Three components, connected in a straight line:

Component 1: Vision Encoder — CLIP ViT-L/14@336px

A Vision Transformer pre-trained by OpenAI's CLIP on 400M image-text pairs. It processes 336×336 images, dividing them into 14×14 pixel patches. That gives us a grid of 24×24 = 576 patch tokens, each a 1024-dimensional vector encoding local visual features.

The vision encoder is frozen during pre-training and unfrozen during instruction tuning. This is important: the visual features are already excellent from CLIP training. We just need to teach the LLM how to read them.

Component 2: Vision-Language Projector — 2-Layer MLP

Each of the 576 visual tokens passes through:

Linear
1024 → 4096 dimensions
GELU
Nonlinear activation
Linear
4096 → 4096 (match LLM hidden dim)

That's it. Two linear layers with a GELU in between. No attention, no cross-attention, no query tokens, no resampling. Each visual patch becomes one token in the LLM's vocabulary space. The projector has roughly 33M parameters — less than 0.3% of the full model.

Component 3: Language Model — Vicuna-13B

Vicuna is a fine-tuned LLaMA model trained on ShareGPT conversations. It handles the actual reasoning: the 576 projected visual tokens are concatenated with the text tokens and fed into Vicuna as a single sequence. The LLM processes them jointly via self-attention, treating visual tokens as just another "language" it has learned to read.

LLaVA-1.5 Architecture

The full architecture: image patches flow through CLIP, get projected by a 2-layer MLP, and are concatenated with text tokens before entering the LLM.

Why not use a Q-Former? Q-Formers (as in BLIP-2 and InstructBLIP) use cross-attention with learned query tokens to compress visual features into a fixed number of tokens. This adds complexity and requires pre-training on hundreds of millions of image-text pairs. LLaVA-1.5 shows that simply keeping all 576 tokens and letting the LLM's self-attention sort them out is actually more effective — the LLM already knows how to attend to the relevant tokens.
How many visual tokens does LLaVA-1.5 feed into the LLM per image?

Chapter 3: Two-Stage Training

LLaVA-1.5 follows a clean two-stage training protocol. Each stage serves a distinct purpose, and the setup is carefully designed to be data-efficient.

Stage 1: Vision-Language Alignment Pre-training

The goal here is simple: teach the MLP projector to translate CLIP's visual features into the LLM's word embedding space. Think of it as training a translator between two languages that already exist.

After this stage, the LLM can already describe images — but it hasn't learned to follow diverse visual instructions.

Stage 2: Visual Instruction Tuning

Now we teach the model to follow complex visual instructions: answer VQA questions, describe regions, reason about scenes, engage in multi-turn conversations.

Total training cost: ~1 day on a single 8-A100 node. Compare this with Qwen-VL, which trains on 1.4 billion image-text pairs. LLaVA-1.5 achieves competitive or better results with 1000× less pre-training data. The key is not brute-force data volume but surgical data selection and a training protocol that doesn't waste parameters learning redundant alignments.
Two-Stage Training Pipeline

Click each stage to see what is trained (highlighted) vs. frozen (dimmed). The transition from Stage 1 to Stage 2 unfreezes the LLM.

In Stage 1 of LLaVA-1.5 training, which component(s) are trained?

Chapter 4: The Data Recipe

This is where LLaVA-1.5 gets clever. The data mixture for Stage 2 instruction tuning is carefully designed to balance multiple capabilities without sacrificing any one of them.

The data sources

LLaVA-Instruct
150K GPT-4-generated visual conversations, detailed descriptions, and complex reasoning (from original LLaVA)
Academic VQA
VQAv2, GQA, OKVQA, A-OKVQA — with response formatting prompts for short answers
OCR Data
OCR-VQA, TextCaps — teaches the model to read text in images
Region-Level
Visual Genome, RefCOCO — grounds language in spatial regions of the image
ShareGPT
40K text-only multi-turn conversations — preserves the LLM's language quality and reasoning

The response formatting trick

The single most important data engineering insight in the paper. When including VQA data that expects short answers, the authors append a prompt: "Answer the question using a single word or phrase."

Without this, the model gets confused: sometimes it sees training data that expects "yellow" and other times data that expects a full paragraph. With the formatting prompt, the model learns to switch output modes based on the user's instructions. This solves the multitask balancing problem that plagued InstructBLIP, which overfitted to short answers even when detailed responses were requested.

Why ShareGPT matters: Including 40K text-only conversations might seem irrelevant for a vision model. But it serves a critical purpose: it prevents the LLM from "forgetting" how to have natural, flowing conversations. Without it, the model's language quality degrades as it specializes on visual tasks. The text data acts as a regularizer for the LLM's conversational abilities.
What is the response formatting prompt trick, and why is it critical?

Chapter 5: Resolution Matters

One of the clearest results in the paper is the impact of input resolution. Upgrading from 224px to 336px gives consistent improvements across all benchmarks. Why?

The math of resolution

CLIP ViT-L/14 uses 14×14 pixel patches. At different resolutions:

That's 2.25× more visual tokens. Each token still represents a 14×14 pixel region, but that region now covers a smaller portion of the original image, meaning finer details are captured.

What higher resolution buys you

LLaVA-1.5-HD: Going beyond 336px

The paper also explores scaling to even higher resolutions with LLaVA-1.5-HD. Instead of interpolating position embeddings (which requires expensive retraining), they use a grid-based approach:

  1. Split the high-res image into a grid of 336×336 patches
  2. Encode each patch independently with CLIP
  3. Concatenate all patch features into a single long sequence
  4. Also encode a downsampled global view for context

This allows scaling to any resolution without modifying the vision encoder at all.

Resolution Comparison: 224px vs 336px

See how many more visual tokens the model gets at higher resolution. Each cell represents one ViT patch. More patches = finer detail for the LLM to reason about.

Why does increasing input resolution from 224px to 336px improve performance on OCR and detail-heavy tasks?

Chapter 6: Results

LLaVA-1.5 achieves state-of-the-art results across 11 benchmarks, spanning academic VQA, visual reasoning, and open-ended conversation. The results are remarkable because the model uses vastly less training data and a far simpler architecture than its competitors.

Headline numbers (LLaVA-1.5-13B)

The efficiency story: InstructBLIP pre-trains on 129M image-text pairs. Qwen-VL uses 1.4B pairs (much of it proprietary). LLaVA-1.5 uses just 558K pre-training pairs and 665K instruction tuning samples — all publicly available. It trains in ~1 day on 8 A100s. Despite using 200× to 2500× less pre-training data, it matches or beats these systems on the majority of benchmarks.
Benchmark Comparison

LLaVA-1.5 (teal) vs. competitors across key benchmarks. Normalized to percentage of best known score.

How does LLaVA-1.5's pre-training data volume compare to InstructBLIP and Qwen-VL?

Chapter 7: Ablation Studies

The paper's ablation studies are particularly illuminating because they isolate each design choice and measure its individual contribution. Let's walk through the key findings.

MLP vs. Linear Projection

Replacing the single linear layer with a two-layer MLP improves performance across the board. On MME, the jump is from 1323.8 to 1355.2 — a 31-point improvement from adding just one extra linear layer and a GELU activation. The MLP gives the projection enough capacity to learn non-trivial mappings between visual and language feature spaces.

Resolution: 224px vs. 336px

Scaling from 224px to 336px improves GQA from 50.3 to 51.4, MME from 1426.5 to 1450, and MM-Vet from 30.8 to 30.3. The gains are especially large on tasks requiring fine-grained visual understanding.

Cumulative improvements (Table 2 from the paper)

The paper shows how each modification stacks:

LLaVA baseline
MME: 809.6 • MM-Vet: 25.5
+ VQAv2 data
MME: 1197.0 (+387) • MM-Vet: 27.7
+ Format prompt
MME: 1323.8 (+127) • MM-Vet: 26.3
+ MLP connector
MME: 1355.2 (+31) • MM-Vet: 27.8
+ OKVQA/OCR + Region + 336px + GQA + ShareGPT + 13B
MME: 1531.3 (+176) • MM-Vet: 36.1
The biggest single gain: Adding VQAv2 data with response formatting gives the single largest improvement (+387 on MME). This confirms that the original LLaVA's weakness wasn't architectural — it was a data gap. The model already had the capacity; it just hadn't seen the right training signal.
Ablation: Cumulative MME Score

Each modification builds on the previous. The tallest jump comes from adding VQA data, not from architectural changes.

Which single change gave the largest improvement to LLaVA-1.5's MME score?

Chapter 8: What Makes It Work

LLaVA-1.5's success is a lesson in research taste. The paper doesn't introduce any novel architecture. It doesn't collect new data. It doesn't use tricks that are hard to reproduce. Instead, it systematically identifies what actually matters and strips away what doesn't.

Lesson 1: The connector doesn't need to be smart

Q-Formers, Perceiver Resamplers, and other complex vision-language bridges compress visual information into a fixed number of tokens. This compression is lossy — and it turns out the LLM's self-attention is already perfectly capable of selecting which visual tokens to attend to. A simple MLP that just translates the feature space (without compressing the token count) is sufficient.

Lesson 2: Data quality and formatting beat data quantity

InstructBLIP trains on 129M pairs. Qwen-VL trains on 1.4B pairs. LLaVA-1.5 uses 558K pre-training pairs. The difference? LLaVA-1.5 is extremely intentional about what goes into the instruction tuning mix, and it uses response formatting prompts to prevent task interference. Quantity cannot substitute for curation.

Lesson 3: Resolution is an easy win

Going from 224px to 336px costs only 2× more compute (from the extra visual tokens) but gives outsized returns on every benchmark. When your model literally can't see fine details, no amount of clever architecture will help.

Lesson 4: Reproducibility is a feature

The entire recipe uses publicly available data, open-source models, and trains in one day on one 8-GPU node. This isn't just convenient — it's a statement. The paper shows that state-of-the-art multimodal capability doesn't require massive compute or proprietary data. The barrier to entry is lower than the field assumed.

The meta-lesson: Complexity in ML systems often persists not because it's necessary but because no one bothers to check. The original LLaVA showed a simple architecture could work well. LLaVA-1.5 shows that with two targeted improvements and better data, it can be state-of-the-art. The next time you reach for a complex module, ask: have I tried the simple thing with good data first?
Why does LLaVA-1.5 outperform models that use vision-language bridges like Q-Formers?

Chapter 9: Connections

LLaVA-1.5 sits at a critical junction in the evolution of visual language models. Let's trace its lineage and influence.

Ancestors

Contemporaries and successors

The lasting impact: LLaVA-1.5 didn't invent a new architecture. Its contribution was showing the community that the right answer was already in front of them — a simple connector, the right resolution, and curated data. This paper became a reference baseline for virtually every VLM paper that followed in 2024, and its training recipe (two-stage, MLP projection, mixed data with format prompts) became the standard template.
What is LLaVA-1.5's most lasting contribution to the field?