LLaVA-1.5 — Veanors

Chapter 0: The Problem

By late 2023, large multimodal models (LMMs) were racing to become general-purpose visual assistants. But the leading approaches looked wildly different in complexity:

InstructBLIP used a Q-Former module pre-trained on 129 million image-text pairs, with frozen LLMs and a complex resampling architecture
Qwen-VL trained on 1.4 billion image-text pairs using in-house data
BLIP-2 introduced a two-stage training pipeline with multiple frozen components and a specially designed query transformer

Meanwhile, the original LLaVA (2023) took a radically simpler approach: a single linear projection layer connecting CLIP to an LLM, trained on just 600K image-text pairs. It excelled at open-ended visual conversation but struggled on benchmarks requiring short, precise answers (like VQA).

The community assumed that to match the complex models on academic benchmarks, you needed their complexity. Hundreds of millions of training samples. Specialized vision-language modules. Proprietary data.

The question LLaVA-1.5 asks: What if the original LLaVA's simplicity was actually a strength, not a weakness? What if a few targeted improvements — better resolution, a slightly more expressive projection, and smarter data mixing — could beat all of these complex systems? LLaVA-1.5 shows the answer is yes, using only 1.2M public data samples and training in a single day.

Why did the original LLaVA struggle on academic VQA benchmarks despite excelling at open-ended visual conversation?

It lacked VQA-style training data with short-form answers, so it tended to give verbose responses even when a single word was expected Its visual encoder was too weak It used too many parameters

Chapter 1: The Key Insight

LLaVA-1.5's insight is almost embarrassingly simple. The fully-connected vision-language connector in LLaVA is already surprisingly powerful and data-efficient. You don't need a Q-Former. You don't need billions of pre-training pairs. You need three targeted improvements:

Improvement 1: MLP instead of linear projection

The original LLaVA used a single linear layer to map CLIP visual features into the LLM's embedding space. LLaVA-1.5 replaces this with a two-layer MLP (with GELU activation). That's it — one extra linear layer and a nonlinearity. This gives the connector enough representational power to better align visual and language features.

Improvement 2: Higher resolution input

Swap CLIP-ViT-L/14 at 224px for CLIP-ViT-L/14 at 336px. More pixels means the model can read text in images, distinguish fine details, and handle complex scenes. The 336px model produces 576 visual tokens (24×24 patches) instead of 256 tokens (16×16).

Improvement 3: Academic VQA data with response formatting

Include VQA datasets (VQAv2, GQA, OCR-VQA, etc.) in instruction tuning, but with a critical twist: append a response format prompt that tells the model when to give short answers. "Answer the question using a single word or phrase." This prevents the model from overfitting to either long-form or short-form responses.

The recipe in one sentence: CLIP ViT-L/14@336px → 2-layer MLP → Vicuna-13B, trained on 558K alignment pairs then 665K mixed instruction data. That's the entire architecture. It achieves state-of-the-art across 11 benchmarks, beating systems that use 1000× more pre-training data.

LLaVA-1.5 vs. Competitors: Data Efficiency

Pre-training data used (millions of samples) versus number of benchmarks won. LLaVA-1.5 achieves more with dramatically less data.

What are the three key modifications LLaVA-1.5 makes over the original LLaVA?

Two-layer MLP projection (instead of linear), 336px CLIP resolution (instead of 224px), and academic VQA data with response formatting prompts Q-Former module, more training data, and a larger LLM New visual encoder, contrastive pre-training, and RLHF

Chapter 2: The Architecture

LLaVA-1.5's architecture is one of the simplest possible designs for a visual language model. Three components, connected in a straight line:

Component 1: Vision Encoder — CLIP ViT-L/14@336px

A Vision Transformer pre-trained by OpenAI's CLIP on 400M image-text pairs. It processes 336×336 images, dividing them into 14×14 pixel patches. That gives us a grid of 24×24 = 576 patch tokens, each a 1024-dimensional vector encoding local visual features.

The vision encoder is frozen during pre-training and unfrozen during instruction tuning. This is important: the visual features are already excellent from CLIP training. We just need to teach the LLM how to read them.

Component 2: Vision-Language Projector — 2-Layer MLP

Each of the 576 visual tokens passes through:

Linear

1024 → 4096 dimensions

↓

GELU

Nonlinear activation

↓

Linear

4096 → 4096 (match LLM hidden dim)

That's it. Two linear layers with a GELU in between. No attention, no cross-attention, no query tokens, no resampling. Each visual patch becomes one token in the LLM's vocabulary space. The projector has roughly 33M parameters — less than 0.3% of the full model.

Component 3: Language Model — Vicuna-13B

Vicuna is a fine-tuned LLaMA model trained on ShareGPT conversations. It handles the actual reasoning: the 576 projected visual tokens are concatenated with the text tokens and fed into Vicuna as a single sequence. The LLM processes them jointly via self-attention, treating visual tokens as just another "language" it has learned to read.

LLaVA-1.5 Architecture

The full architecture: image patches flow through CLIP, get projected by a 2-layer MLP, and are concatenated with text tokens before entering the LLM.

Why not use a Q-Former? Q-Formers (as in BLIP-2 and InstructBLIP) use cross-attention with learned query tokens to compress visual features into a fixed number of tokens. This adds complexity and requires pre-training on hundreds of millions of image-text pairs. LLaVA-1.5 shows that simply keeping all 576 tokens and letting the LLM's self-attention sort them out is actually more effective — the LLM already knows how to attend to the relevant tokens.

How many visual tokens does LLaVA-1.5 feed into the LLM per image?

576 tokens (24×24 patches from the 336px CLIP encoder), each projected to the LLM's hidden dimension by the MLP 32 compressed query tokens from a Q-Former 1 global [CLS] token

Chapter 3: Two-Stage Training

LLaVA-1.5 follows a clean two-stage training protocol. Each stage serves a distinct purpose, and the setup is carefully designed to be data-efficient.

Stage 1: Vision-Language Alignment Pre-training

The goal here is simple: teach the MLP projector to translate CLIP's visual features into the LLM's word embedding space. Think of it as training a translator between two languages that already exist.

Data: 558K image-caption pairs filtered from CC3M (Conceptual Captions)
What's trained: Only the MLP projector (33M parameters)
What's frozen: Both the CLIP encoder and the LLM
Duration: ~6 hours on 8×A100 GPUs

After this stage, the LLM can already describe images — but it hasn't learned to follow diverse visual instructions.

Stage 2: Visual Instruction Tuning

Now we teach the model to follow complex visual instructions: answer VQA questions, describe regions, reason about scenes, engage in multi-turn conversations.

Data: 665K mixed samples (see Chapter 4 for the full recipe)
What's trained: The MLP projector AND the full LLM
What's frozen: The CLIP encoder (optionally unfrozen for further gains)
Duration: ~20 hours on 8×A100 GPUs

Total training cost: ~1 day on a single 8-A100 node. Compare this with Qwen-VL, which trains on 1.4 billion image-text pairs. LLaVA-1.5 achieves competitive or better results with 1000× less pre-training data. The key is not brute-force data volume but surgical data selection and a training protocol that doesn't waste parameters learning redundant alignments.

Two-Stage Training Pipeline

Click each stage to see what is trained (highlighted) vs. frozen (dimmed). The transition from Stage 1 to Stage 2 unfreezes the LLM.

In Stage 1 of LLaVA-1.5 training, which component(s) are trained?

Only the MLP projector — the CLIP encoder and LLM are both frozen, so the projector learns to translate visual features into the LLM's embedding space The full model end-to-end Only the LLM

Chapter 4: The Data Recipe

This is where LLaVA-1.5 gets clever. The data mixture for Stage 2 instruction tuning is carefully designed to balance multiple capabilities without sacrificing any one of them.

The data sources

LLaVA-Instruct

150K GPT-4-generated visual conversations, detailed descriptions, and complex reasoning (from original LLaVA)

↓

Academic VQA

VQAv2, GQA, OKVQA, A-OKVQA — with response formatting prompts for short answers

↓

OCR Data

OCR-VQA, TextCaps — teaches the model to read text in images

↓

Region-Level

Visual Genome, RefCOCO — grounds language in spatial regions of the image

↓

ShareGPT

40K text-only multi-turn conversations — preserves the LLM's language quality and reasoning

The response formatting trick

The single most important data engineering insight in the paper. When including VQA data that expects short answers, the authors append a prompt: "Answer the question using a single word or phrase."

Without this, the model gets confused: sometimes it sees training data that expects "yellow" and other times data that expects a full paragraph. With the formatting prompt, the model learns to switch output modes based on the user's instructions. This solves the multitask balancing problem that plagued InstructBLIP, which overfitted to short answers even when detailed responses were requested.

Why ShareGPT matters: Including 40K text-only conversations might seem irrelevant for a vision model. But it serves a critical purpose: it prevents the LLM from "forgetting" how to have natural, flowing conversations. Without it, the model's language quality degrades as it specializes on visual tasks. The text data acts as a regularizer for the LLM's conversational abilities.

What is the response formatting prompt trick, and why is it critical?

Appending "Answer the question using a single word or phrase" to VQA questions so the model learns when to give short vs. long answers, solving the multitask balancing problem Converting all answers to JSON format Removing all short answers from the dataset

Chapter 5: Resolution Matters

One of the clearest results in the paper is the impact of input resolution. Upgrading from 224px to 336px gives consistent improvements across all benchmarks. Why?

The math of resolution

CLIP ViT-L/14 uses 14×14 pixel patches. At different resolutions:

224px: 224/14 = 16 patches per side → 256 visual tokens
336px: 336/14 = 24 patches per side → 576 visual tokens

That's 2.25× more visual tokens. Each token still represents a 14×14 pixel region, but that region now covers a smaller portion of the original image, meaning finer details are captured.

What higher resolution buys you

Text reading: OCR tasks like TextVQA improve dramatically because individual characters are large enough to be recognized within a patch
Fine-grained detail: Small objects, facial expressions, and distant features become visible
Reduced hallucination: When the model can actually "see" the details, it hallucinates less about them

LLaVA-1.5-HD: Going beyond 336px

The paper also explores scaling to even higher resolutions with LLaVA-1.5-HD. Instead of interpolating position embeddings (which requires expensive retraining), they use a grid-based approach:

Split the high-res image into a grid of 336×336 patches
Encode each patch independently with CLIP
Concatenate all patch features into a single long sequence
Also encode a downsampled global view for context

This allows scaling to any resolution without modifying the vision encoder at all.

Resolution Comparison: 224px vs 336px

See how many more visual tokens the model gets at higher resolution. Each cell represents one ViT patch. More patches = finer detail for the LLM to reason about.

Why does increasing input resolution from 224px to 336px improve performance on OCR and detail-heavy tasks?

The number of visual tokens increases from 256 to 576, and each patch covers a smaller image region — so fine details like text characters are large enough to be recognized The CLIP model becomes more accurate at higher resolution The LLM processes the image faster

Chapter 6: Results

LLaVA-1.5 achieves state-of-the-art results across 11 benchmarks, spanning academic VQA, visual reasoning, and open-ended conversation. The results are remarkable because the model uses vastly less training data and a far simpler architecture than its competitors.

Headline numbers (LLaVA-1.5-13B)

VQAv2: 80.0 (vs. Qwen-VL-Chat's 78.2 trained on 1.4B pairs)
GQA: 63.3 (vs. InstructBLIP's 49.5 trained on 129M pairs)
TextVQA: 61.3 (competitive without any specialized OCR module)
POPE: 85.9 (hallucination benchmark — LLaVA-1.5 hallucinates less)
MME: 1531.3 (comprehensive multimodal evaluation)
MM-Vet: 36.1 (integrated visual-language capabilities)
SciQA-IMG: 71.6 (science question answering with images)

The efficiency story: InstructBLIP pre-trains on 129M image-text pairs. Qwen-VL uses 1.4B pairs (much of it proprietary). LLaVA-1.5 uses just 558K pre-training pairs and 665K instruction tuning samples — all publicly available. It trains in ~1 day on 8 A100s. Despite using 200× to 2500× less pre-training data, it matches or beats these systems on the majority of benchmarks.

Benchmark Comparison

LLaVA-1.5 (teal) vs. competitors across key benchmarks. Normalized to percentage of best known score.

How does LLaVA-1.5's pre-training data volume compare to InstructBLIP and Qwen-VL?

LLaVA-1.5 uses 558K pre-training pairs — roughly 200× less than InstructBLIP (129M) and 2500× less than Qwen-VL (1.4B) — yet matches or beats them They all use roughly the same amount of data LLaVA-1.5 uses more data but is more efficient

Chapter 7: Ablation Studies

The paper's ablation studies are particularly illuminating because they isolate each design choice and measure its individual contribution. Let's walk through the key findings.

MLP vs. Linear Projection

Replacing the single linear layer with a two-layer MLP improves performance across the board. On MME, the jump is from 1323.8 to 1355.2 — a 31-point improvement from adding just one extra linear layer and a GELU activation. The MLP gives the projection enough capacity to learn non-trivial mappings between visual and language feature spaces.

Resolution: 224px vs. 336px

Scaling from 224px to 336px improves GQA from 50.3 to 51.4, MME from 1426.5 to 1450, and MM-Vet from 30.8 to 30.3. The gains are especially large on tasks requiring fine-grained visual understanding.

Cumulative improvements (Table 2 from the paper)

The paper shows how each modification stacks:

LLaVA baseline

MME: 809.6 • MM-Vet: 25.5

↓

+ VQAv2 data

MME: 1197.0 (+387) • MM-Vet: 27.7

↓

+ Format prompt

MME: 1323.8 (+127) • MM-Vet: 26.3

↓

+ MLP connector

MME: 1355.2 (+31) • MM-Vet: 27.8

↓

+ OKVQA/OCR + Region + 336px + GQA + ShareGPT + 13B

MME: 1531.3 (+176) • MM-Vet: 36.1

The biggest single gain: Adding VQAv2 data with response formatting gives the single largest improvement (+387 on MME). This confirms that the original LLaVA's weakness wasn't architectural — it was a data gap. The model already had the capacity; it just hadn't seen the right training signal.

Ablation: Cumulative MME Score

Each modification builds on the previous. The tallest jump comes from adding VQA data, not from architectural changes.

Which single change gave the largest improvement to LLaVA-1.5's MME score?

Adding VQAv2 data with response formatting prompts (+387 points on MME) — the data, not the architecture, was the bottleneck Switching from linear to MLP projection Increasing resolution from 224px to 336px

Chapter 8: What Makes It Work

LLaVA-1.5's success is a lesson in research taste. The paper doesn't introduce any novel architecture. It doesn't collect new data. It doesn't use tricks that are hard to reproduce. Instead, it systematically identifies what actually matters and strips away what doesn't.

Lesson 1: The connector doesn't need to be smart

Q-Formers, Perceiver Resamplers, and other complex vision-language bridges compress visual information into a fixed number of tokens. This compression is lossy — and it turns out the LLM's self-attention is already perfectly capable of selecting which visual tokens to attend to. A simple MLP that just translates the feature space (without compressing the token count) is sufficient.

Lesson 2: Data quality and formatting beat data quantity

InstructBLIP trains on 129M pairs. Qwen-VL trains on 1.4B pairs. LLaVA-1.5 uses 558K pre-training pairs. The difference? LLaVA-1.5 is extremely intentional about what goes into the instruction tuning mix, and it uses response formatting prompts to prevent task interference. Quantity cannot substitute for curation.

Lesson 3: Resolution is an easy win

Going from 224px to 336px costs only 2× more compute (from the extra visual tokens) but gives outsized returns on every benchmark. When your model literally can't see fine details, no amount of clever architecture will help.

Lesson 4: Reproducibility is a feature

The entire recipe uses publicly available data, open-source models, and trains in one day on one 8-GPU node. This isn't just convenient — it's a statement. The paper shows that state-of-the-art multimodal capability doesn't require massive compute or proprietary data. The barrier to entry is lower than the field assumed.

The meta-lesson: Complexity in ML systems often persists not because it's necessary but because no one bothers to check. The original LLaVA showed a simple architecture could work well. LLaVA-1.5 shows that with two targeted improvements and better data, it can be state-of-the-art. The next time you reach for a complex module, ask: have I tried the simple thing with good data first?

Why does LLaVA-1.5 outperform models that use vision-language bridges like Q-Formers?

Q-Formers compress visual tokens (lossy), while LLaVA-1.5 keeps all 576 tokens and lets the LLM's self-attention decide what to focus on — preserving more information with less complexity LLaVA-1.5 uses a larger LLM Q-Formers are poorly implemented

Chapter 9: Connections

LLaVA-1.5 sits at a critical junction in the evolution of visual language models. Let's trace its lineage and influence.

Ancestors

CLIP (Radford et al., 2021): The vision encoder. CLIP's contrastive pre-training on 400M image-text pairs gives LLaVA-1.5 its visual backbone. Without CLIP's strong zero-shot visual features, the simple MLP projection wouldn't work.
LLaVA (Liu et al., 2023): The direct predecessor. Established the linear projection + LLM architecture and visual instruction tuning. LLaVA-1.5 is an incremental refinement, not a new paradigm.
Flamingo (Alayrac et al., 2022): Pioneered the idea of feeding visual tokens into a frozen LLM, using cross-attention layers interleaved with the LLM. More complex than LLaVA but established the paradigm of "vision as tokens for the LLM."
BLIP-2 (Li et al., 2023): Introduced the Q-Former for bridging vision and language modalities. The Q-Former's complexity is exactly what LLaVA-1.5 argues against — you can match its results with a 2-layer MLP.

Contemporaries and successors

InternVL (Chen et al., 2024): Scales the vision encoder itself (InternViT-6B), taking the opposite approach to LLaVA-1.5's data efficiency. Shows that scaling the vision side also works, but at much higher cost.
Qwen-VL (Bai et al., 2023): Uses massive pre-training data (1.4B pairs, including proprietary data) and a three-stage training pipeline. LLaVA-1.5 matches it with 2500× less pre-training data.
LLaVA-NeXT / LLaVA-1.6 (Liu et al., 2024): Builds directly on LLaVA-1.5 with dynamic high-resolution encoding, stronger LLM backbones (Mistral, Yi), and improved data. The HD grid approach from this paper becomes central.
Multimodal LLMs broadly: GPT-4V, Gemini, Claude — the commercial frontier models adopted similar architectures. The consensus the field converged on (ViT + projection + LLM) was essentially validated by LLaVA-1.5's showing that this simple recipe is competitive.

The lasting impact: LLaVA-1.5 didn't invent a new architecture. Its contribution was showing the community that the right answer was already in front of them — a simple connector, the right resolution, and curated data. This paper became a reference baseline for virtually every VLM paper that followed in 2024, and its training recipe (two-stage, MLP projection, mixed data with format prompts) became the standard template.

What is LLaVA-1.5's most lasting contribution to the field?

Demonstrating that a simple MLP connector with curated data and proper resolution can match complex VLM architectures trained on orders of magnitude more data — establishing a reproducible baseline for the field Inventing a new type of vision encoder Creating the largest multimodal dataset

Improved Baselines with Visual Instruction Tuning