A simple recipe that matches or beats far more complex VLMs: CLIP ViT-L/14 at 336px, a two-layer MLP projection, and Vicuna-13B — trained on just 1.2M public data samples in one day on 8 A100s.
By late 2023, large multimodal models (LMMs) were racing to become general-purpose visual assistants. But the leading approaches looked wildly different in complexity:
Meanwhile, the original LLaVA (2023) took a radically simpler approach: a single linear projection layer connecting CLIP to an LLM, trained on just 600K image-text pairs. It excelled at open-ended visual conversation but struggled on benchmarks requiring short, precise answers (like VQA).
The community assumed that to match the complex models on academic benchmarks, you needed their complexity. Hundreds of millions of training samples. Specialized vision-language modules. Proprietary data.
LLaVA-1.5's insight is almost embarrassingly simple. The fully-connected vision-language connector in LLaVA is already surprisingly powerful and data-efficient. You don't need a Q-Former. You don't need billions of pre-training pairs. You need three targeted improvements:
The original LLaVA used a single linear layer to map CLIP visual features into the LLM's embedding space. LLaVA-1.5 replaces this with a two-layer MLP (with GELU activation). That's it — one extra linear layer and a nonlinearity. This gives the connector enough representational power to better align visual and language features.
Swap CLIP-ViT-L/14 at 224px for CLIP-ViT-L/14 at 336px. More pixels means the model can read text in images, distinguish fine details, and handle complex scenes. The 336px model produces 576 visual tokens (24×24 patches) instead of 256 tokens (16×16).
Include VQA datasets (VQAv2, GQA, OCR-VQA, etc.) in instruction tuning, but with a critical twist: append a response format prompt that tells the model when to give short answers. "Answer the question using a single word or phrase." This prevents the model from overfitting to either long-form or short-form responses.
Pre-training data used (millions of samples) versus number of benchmarks won. LLaVA-1.5 achieves more with dramatically less data.
LLaVA-1.5's architecture is one of the simplest possible designs for a visual language model. Three components, connected in a straight line:
A Vision Transformer pre-trained by OpenAI's CLIP on 400M image-text pairs. It processes 336×336 images, dividing them into 14×14 pixel patches. That gives us a grid of 24×24 = 576 patch tokens, each a 1024-dimensional vector encoding local visual features.
The vision encoder is frozen during pre-training and unfrozen during instruction tuning. This is important: the visual features are already excellent from CLIP training. We just need to teach the LLM how to read them.
Each of the 576 visual tokens passes through:
That's it. Two linear layers with a GELU in between. No attention, no cross-attention, no query tokens, no resampling. Each visual patch becomes one token in the LLM's vocabulary space. The projector has roughly 33M parameters — less than 0.3% of the full model.
Vicuna is a fine-tuned LLaMA model trained on ShareGPT conversations. It handles the actual reasoning: the 576 projected visual tokens are concatenated with the text tokens and fed into Vicuna as a single sequence. The LLM processes them jointly via self-attention, treating visual tokens as just another "language" it has learned to read.
The full architecture: image patches flow through CLIP, get projected by a 2-layer MLP, and are concatenated with text tokens before entering the LLM.
LLaVA-1.5 follows a clean two-stage training protocol. Each stage serves a distinct purpose, and the setup is carefully designed to be data-efficient.
The goal here is simple: teach the MLP projector to translate CLIP's visual features into the LLM's word embedding space. Think of it as training a translator between two languages that already exist.
After this stage, the LLM can already describe images — but it hasn't learned to follow diverse visual instructions.
Now we teach the model to follow complex visual instructions: answer VQA questions, describe regions, reason about scenes, engage in multi-turn conversations.
Click each stage to see what is trained (highlighted) vs. frozen (dimmed). The transition from Stage 1 to Stage 2 unfreezes the LLM.
This is where LLaVA-1.5 gets clever. The data mixture for Stage 2 instruction tuning is carefully designed to balance multiple capabilities without sacrificing any one of them.
The single most important data engineering insight in the paper. When including VQA data that expects short answers, the authors append a prompt: "Answer the question using a single word or phrase."
Without this, the model gets confused: sometimes it sees training data that expects "yellow" and other times data that expects a full paragraph. With the formatting prompt, the model learns to switch output modes based on the user's instructions. This solves the multitask balancing problem that plagued InstructBLIP, which overfitted to short answers even when detailed responses were requested.
One of the clearest results in the paper is the impact of input resolution. Upgrading from 224px to 336px gives consistent improvements across all benchmarks. Why?
CLIP ViT-L/14 uses 14×14 pixel patches. At different resolutions:
That's 2.25× more visual tokens. Each token still represents a 14×14 pixel region, but that region now covers a smaller portion of the original image, meaning finer details are captured.
The paper also explores scaling to even higher resolutions with LLaVA-1.5-HD. Instead of interpolating position embeddings (which requires expensive retraining), they use a grid-based approach:
This allows scaling to any resolution without modifying the vision encoder at all.
See how many more visual tokens the model gets at higher resolution. Each cell represents one ViT patch. More patches = finer detail for the LLM to reason about.
LLaVA-1.5 achieves state-of-the-art results across 11 benchmarks, spanning academic VQA, visual reasoning, and open-ended conversation. The results are remarkable because the model uses vastly less training data and a far simpler architecture than its competitors.
LLaVA-1.5 (teal) vs. competitors across key benchmarks. Normalized to percentage of best known score.
The paper's ablation studies are particularly illuminating because they isolate each design choice and measure its individual contribution. Let's walk through the key findings.
Replacing the single linear layer with a two-layer MLP improves performance across the board. On MME, the jump is from 1323.8 to 1355.2 — a 31-point improvement from adding just one extra linear layer and a GELU activation. The MLP gives the projection enough capacity to learn non-trivial mappings between visual and language feature spaces.
Scaling from 224px to 336px improves GQA from 50.3 to 51.4, MME from 1426.5 to 1450, and MM-Vet from 30.8 to 30.3. The gains are especially large on tasks requiring fine-grained visual understanding.
The paper shows how each modification stacks:
Each modification builds on the previous. The tallest jump comes from adding VQA data, not from architectural changes.
LLaVA-1.5's success is a lesson in research taste. The paper doesn't introduce any novel architecture. It doesn't collect new data. It doesn't use tricks that are hard to reproduce. Instead, it systematically identifies what actually matters and strips away what doesn't.
Q-Formers, Perceiver Resamplers, and other complex vision-language bridges compress visual information into a fixed number of tokens. This compression is lossy — and it turns out the LLM's self-attention is already perfectly capable of selecting which visual tokens to attend to. A simple MLP that just translates the feature space (without compressing the token count) is sufficient.
InstructBLIP trains on 129M pairs. Qwen-VL trains on 1.4B pairs. LLaVA-1.5 uses 558K pre-training pairs. The difference? LLaVA-1.5 is extremely intentional about what goes into the instruction tuning mix, and it uses response formatting prompts to prevent task interference. Quantity cannot substitute for curation.
Going from 224px to 336px costs only 2× more compute (from the extra visual tokens) but gives outsized returns on every benchmark. When your model literally can't see fine details, no amount of clever architecture will help.
The entire recipe uses publicly available data, open-source models, and trains in one day on one 8-GPU node. This isn't just convenient — it's a statement. The paper shows that state-of-the-art multimodal capability doesn't require massive compute or proprietary data. The barrier to entry is lower than the field assumed.
LLaVA-1.5 sits at a critical junction in the evolution of visual language models. Let's trace its lineage and influence.