Visual Instruction Tuning — Engineermaxxing

Introduction

In March 2023, OpenAI released GPT-4 with multimodal capabilities. For the first time, a production language model could look at an image and have an intelligent conversation about it. It could read charts, explain memes, debug code from screenshots, and reason about spatial relationships. The demonstrations were staggering.

But GPT-4 was closed. Its weights were proprietary. Its training data was unknown. Its architecture was a guess. The open research community faced a question with enormous stakes: could an open model match this capability, and if so, how?

One month later, in April 2023, Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee released LLaVA (Large Language and Vision Assistant). The paper, "Visual Instruction Tuning," proposed something almost provocatively simple: take an existing vision encoder (CLIP ViT-L/14), connect it to an existing language model (LLaMA/Vicuna) with a single linear projection layer, and train the whole thing in two stages using just 595K alignment pairs and 158K instruction examples.

The cost was measured in hundreds of dollars of compute. The result was a model that could hold multi-turn visual conversations, describe images in detail, and answer complex reasoning questions — capabilities that, just weeks earlier, seemed to require billions of dollars of training infrastructure.

The secret was not a better architecture. It was better data, generated through a pipeline that used text-only GPT-4 to produce instruction-response pairs from image captions and bounding boxes. No pixel access needed. No expensive human annotation. Just clever prompt engineering applied to metadata that already existed.

LLaVA did not match GPT-4V on benchmarks. It did not need to. What it proved was that visual instruction tuning — the process of teaching a VLM to follow instructions about images — was a tractable, reproducible problem. It democratized multimodal AI research overnight.

This article covers the complete LLaVA recipe, from the two-stage training pipeline to the GPT-4 data generation trick, and follows the lineage through LLaVA-1.5 and LLaVA-NeXT to the broader ecosystem of visual instruction-tuned models. We will look at every decision that matters and explain exactly why each one works.

The Two-Stage Recipe

Visual instruction tuning is not one training run. It is two, with fundamentally different objectives, data, and parameter budgets. Getting the split right is the difference between a model that hallucinates visual details and one that actually sees.

Stage 1: Feature Alignment (Pre-training)

The vision encoder speaks one language. The LLM speaks another. Stage 1 exists to build a translator between them. Here is the setup:

Vision encoder: CLIP ViT-L/14 — frozen. Every weight locked.
LLM: Vicuna-7B (or 13B) — frozen. Every weight locked.
Projection layer: A single linear layer, W, mapping from the vision encoder’s output dimension (1024) to the LLM’s input dimension (4096). This is the only thing that trains.
Data: 595K image-text pairs filtered from CC3M (Conceptual Captions 3M). Each pair is reformatted as a single-turn conversation: the user asks “Describe this image briefly” and the assistant responds with the caption.
Objective: Standard next-token prediction on the assistant’s response.

The projection layer has roughly 4 million parameters (1024 × 4096). Compare this to the 7 billion parameters in the LLM and the 300 million in the vision encoder. You are training 0.05% of the system. This converges fast — a few hours on 8 A100 GPUs.

💡 Why so little data?

The projection layer is doing something geometrically simple: it is learning an affine transformation from CLIP’s representation space to Vicuna’s token embedding space. Both spaces are already well-structured (CLIP from contrastive pre-training, Vicuna from language modeling). The projection just needs to find the right rotation and scaling. 595K examples are more than enough to nail a linear mapping.

Stage 2: End-to-End Instruction Tuning

Now we teach the model to do things with visual information. The setup changes significantly:

Vision encoder: Still frozen. The CLIP features are good as-is.
Projection layer: Trainable. It continues to adapt.
LLM: Trainable. The entire Vicuna model is unfrozen.
Data: 158K instruction-response pairs generated by GPT-4 (the text-only version). Three types: conversations, detailed descriptions, and complex reasoning.
Objective: Next-token prediction, but loss is computed only on assistant tokens. The user turns and image tokens are masked from the loss.

This stage is more expensive because you are backpropagating through 7B parameters. But the data is small (158K examples), and you train for only one epoch. The result is a model that has learned to follow visual instructions — to describe, to analyze, to reason, to converse — all grounded in real visual features.

Why Freeze? The Catastrophic Forgetting Problem

Freezing components is not laziness. It is the most important engineering decision in the pipeline.

Why freeze the vision encoder in both stages? CLIP ViT-L/14 was trained on 400 million image-text pairs using contrastive learning. Its features generalize to essentially every visual concept humans care about. If you fine-tune it on 595K captioning pairs, you will overwrite that generality. The model will get slightly better at captioning and catastrophically worse at everything else. The features will become brittle. A frozen vision encoder is a universal vision encoder.

Why freeze the LLM in Stage 1? Vicuna was fine-tuned on 70K ShareGPT conversations. It already knows how to be a helpful assistant, how to format responses, how to handle multi-turn dialogue. If you backpropagate noisy alignment gradients through it from 595K simple captioning pairs, you risk degrading these instruction-following capabilities. The LLM should learn from instruction data (Stage 2), not from alignment data (Stage 1).

Why unfreeze the LLM in Stage 2? Because instruction-following requires the LLM to integrate visual information into its reasoning process. A frozen LLM can only treat image tokens as context — it cannot learn new attention patterns between visual features and its own internal representations. Unfreezing lets the model learn how to think about what it sees.

Stage 1: θ* = argmin_W Σ L(x_img, x_text; W, θ_CLIP_frozen, θ_LLM_frozen) Stage 2: θ* = argmin_{W,θ_LLM} Σ L_masked(x_img, x_instruct; W, θ_CLIP_frozen, θ_LLM)

Two-Stage Training Pipeline Interactive

Stage 1 — training projection only

GPT-4 Data Generation Pipeline

This is the part of the LLaVA paper that changed everything. Not the architecture — the architecture is trivially simple. The breakthrough was figuring out how to generate high-quality visual instruction data without showing any images to the data generation model.

Here is the problem: you need training data that looks like multi-turn visual conversations. A user shows an image and asks a question. An assistant provides a detailed, accurate answer. Collecting this through human annotation is expensive (think $10–50 per conversation). You need at least 100K pairs. The budget does not work.

Here is the solution: use text-only GPT-4 with image metadata as a proxy for the image itself.

For each COCO image, you already have:

Captions: 5 human-written captions describing the image.
Bounding boxes: Object categories and their spatial coordinates (e.g., “person: [0.12, 0.34, 0.56, 0.78]”).

You feed these to GPT-4 with a carefully designed prompt that says, roughly: “Given these captions and bounding boxes, generate a multi-turn conversation between a curious user and an AI assistant about this image.” GPT-4 has never seen the actual pixels. But given detailed captions and spatial layout information, it can hallucinate plausible visual conversations that would be consistent with the image.

Three Types of Instruction Data

The LLaVA authors generated three distinct types of data, each targeting a different capability:

Conversation (58K pairs): Multi-turn dialogues about an image. The user asks increasingly specific questions, and the assistant answers based on what it “sees.” These teach the model to maintain context across turns and to connect visual observations to natural language.
Detailed Description (23K pairs): Single-turn pairs where the user asks for a comprehensive description and the assistant provides a long, thorough response covering objects, spatial relationships, colors, activities, background elements, and overall scene composition. These teach exhaustive visual grounding.
Complex Reasoning (77K pairs): Single-turn pairs where the user asks a question requiring multi-step inference. “What might happen next in this scene?” “Why is this person doing X?” “What does this sign suggest about the location?” These teach the model to reason about visual content, not just describe it.

Cost and Scaling

The total API cost for generating all 158K pairs was approximately $500. In 2023 dollars, using the GPT-4 API at ~$0.03 per 1K tokens. For context, hiring human annotators to produce the same data at $15 per conversation would cost roughly $2.4 million. The LLaVA data pipeline achieved a 4,800x cost reduction.

This cost efficiency had a second-order effect that was arguably more important than the paper itself: it meant that anyone could generate visual instruction data. You did not need an annotation team. You needed a credit card and a weekend. This unlocked a Cambrian explosion of VLM research.

📚 No pixel access needed

The most counterintuitive aspect of this pipeline is that GPT-4 never sees any pixels. It works entirely from text descriptions. This means the quality of the generated data is bounded by the quality of the source captions and annotations. COCO's human-written captions are detailed enough to support this. Weaker caption sources (like web alt-text) produce weaker instruction data.

Instruction Data Types Interactive

Click each card to see an example prompt-response pair.

Conversation 58K pairs

Multi-turn dialogues. Teaches context maintenance and visual grounding across turns.

User: What is the main object in this image?
Asst: The main object is a red double-decker bus parked on a city street.
User: Are there any people near the bus?
Asst: Yes, there are several pedestrians walking on the sidewalk next to the bus, and two people appear to be waiting at the bus stop.

Detailed Description 23K pairs

Exhaustive single-turn descriptions. Teaches comprehensive visual coverage.

User: Describe this image in detail.
Asst: The image shows a bustling outdoor market on a sunny day. In the foreground, a vendor arranges colorful produce — tomatoes, peppers, and leafy greens — on a wooden table covered with a checkered cloth. Behind the stall, rows of similar vendors extend along a narrow cobblestone street...

Complex Reasoning 77K pairs

Multi-step inference questions. Teaches reasoning about visual content.

User: Based on the items on the desk, what kind of work does this person likely do?
Asst: Given the multiple monitors showing code editors, the mechanical keyboard, the stack of O'Reilly technical books, and the whiteboard with system architecture diagrams, this person is most likely a software engineer or systems architect...

Conversation Formats

Image Tokens in the Sequence

The vision encoder (CLIP ViT-L/14 at 224px resolution) produces 256 patch tokens (a 14×14 grid, each patch being 16×16 pixels, plus sometimes a CLS token that gets discarded). After the linear projection, these become 256 embeddings in the LLM’s space. They are inserted directly into the token sequence, replacing a special <image> placeholder.

The conversation template looks like this:

System

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

User

[img_1] [img_2] [img_3] ... [img_256]
What is unusual about this image?

Assistant

The unusual aspect of this image is that a man is ironing clothes on the back of a moving taxi cab in the middle of a busy street...

From the LLM’s perspective, image tokens are just another part of the input sequence. They attend to text tokens and text tokens attend to them through standard causal self-attention. There is no special cross-attention mechanism. The projection layer’s job is to make sure the image embeddings are “compatible” enough with the text embeddings that the attention mechanism can find useful correlations.

Loss Masking

During training, not all tokens contribute to the loss. This is critical for learning efficiency:

System prompt tokens: Masked (loss = 0). The model should not be rewarded for memorizing the system prompt.
Image tokens: Masked (loss = 0). These are inputs, not predictions.
User message tokens: Masked (loss = 0). The model should not be rewarded for predicting what the user says.
Assistant response tokens: Active (loss computed). This is what the model needs to learn to generate.

In a multi-turn conversation, this means only the assistant’s turns count. If a conversation has 3 user turns and 3 assistant turns, only the 3 assistant turns produce gradients. This is the same pattern used in standard instruction tuning (e.g., Alpaca, Vicuna), extended to handle image tokens.

L = -Σ_{t ∈ assistant_tokens} log P(x_t | x_<t, v_1...v_256)

where v_1...v_256 are the projected image patch embeddings and x_<t includes all preceding tokens (system, user, and previous assistant turns).

LLaVA-1.5

LLaVA-1.5 (Liu et al., October 2023) was not a new idea. It was the same idea, executed with ruthless attention to three details that collectively gained +3 to +12 points on every benchmark:

1. MLP Connector Instead of Linear Projection

The original LLaVA used a single linear layer: h = W * v. LLaVA-1.5 replaced this with a two-layer MLP with GELU activation:

h = W_2 \cdot GELU(W_1 \cdot v)

This sounds trivial. It is not. A linear projection can only learn affine transformations — rotations, scalings, translations. An MLP can learn nonlinear mappings. The CLIP embedding space and the Vicuna embedding space are not linearly aligned. There are concepts that CLIP encodes as curved manifolds that need to be “straightened out” before the LLM can use them effectively. The MLP handles this. Average gain: +3 points across benchmarks. For two lines of code.

2. Higher Resolution: CLIP ViT-L/14 @ 336px

The original LLaVA used 224×224 input resolution (256 tokens). LLaVA-1.5 bumped this to 336×336, which produces 576 tokens (24×24 grid). More tokens means more visual detail preserved, at the cost of longer sequences. For tasks that require reading text in images, recognizing small objects, or understanding spatial layout, this is a large win.

3. Academic Task Data Mixing

The original 158K dataset was expanded to 665K instruction pairs by mixing in academic VQA datasets reformatted as conversations:

VQAv2, GQA, OKVQA (visual question answering)
OCR-VQA, TextVQA (text in images)
Region-level VQA (Visual Genome)
The original LLaVA-Instruct 158K

This mixture was key. Pure GPT-4-generated data produces models that are articulate but imprecise. Pure academic VQA data produces models that are precise but terse. The mixture gets you both.

Results

LLaVA-1.5-13B achieved state-of-the-art performance on 11 out of 12 benchmarks among open-source models, using only publicly available data. The 7B version was competitive with models 3–5x its size. Total training cost: under $100 in compute.

LLaVA-NeXT (1.6)

The single biggest limitation of LLaVA-1.5 was resolution. At 336×336 pixels, a 1024×768 image is downsampled by 3x in each dimension, losing 89% of its pixels. Fine text becomes illegible. Small objects vanish. Dense scenes turn to mush.

LLaVA-NeXT (Liu et al., January 2024) solved this with dynamic high-resolution tiling.

The AnyRes Strategy

Instead of resizing every image to a fixed resolution, AnyRes works as follows:

Compute the optimal grid: Given an input image of size W×H, find the best tiling configuration from a set of candidates (e.g., 1×1, 1×2, 2×1, 2×2, 1×3, 3×1). The goal is to minimize distortion while staying within a token budget.
Split the image into tiles: Each tile is 336×336 pixels. A 672×672 image becomes a 2×2 grid of 4 tiles.
Process each tile independently: Each tile goes through CLIP ViT-L/14@336px, producing 576 tokens per tile.
Add a global context tile: The original image is also resized to 336×336 and processed as a single “overview” tile. This gives the model both fine-grained local detail (from tiles) and global scene context.
Concatenate all tokens: For a 2×2 grid, you get (4 + 1) × 576 = 2,880 tokens.

The maximum configuration supports up to 5 tiles, producing 5 × 576 = 2,880 image tokens. This is an 11x increase over the original LLaVA’s 256 tokens.

Improved Data: 1.2M Curated Instructions

LLaVA-NeXT expanded the instruction data to 1.2 million examples with higher quality curation:

High-quality GPT-4V-generated conversations (replacing the text-only GPT-4 pipeline)
Document and chart understanding data (DocVQA, ChartQA)
Multi-image and video conversation data
Harder reasoning questions with chain-of-thought formatting

💡 The resolution-quality tradeoff

More tokens means higher resolution, but also longer sequences. A 2,880-token image inside a 4,096-token context window leaves only 1,216 tokens for the actual conversation. LLaVA-NeXT addressed this by requiring longer context windows (up to 32K tokens) in the LLM backbone, which motivated the switch from Vicuna to models like Mistral-7B and Hermes-Yi-34B.

LLaVA Evolution Comparison Interactive

Feature	LLaVA (Apr 2023)	LLaVA-1.5 (Oct 2023)	LLaVA-NeXT (Jan 2024)
Connector	Linear	2-layer MLP	2-layer MLP
Resolution	224px	336px	Dynamic (up to ~672px)
Image Tokens	256	576	Up to 2,880
Training Data	158K instruct	665K instruct	1.2M instruct
LLM Backbone	Vicuna-7B/13B	Vicuna-7B/13B	Mistral-7B / Yi-34B
VQAv2 (test)	~76	80.0	82.3
MMBench	~36	67.7	72.1
TextVQA	~46	61.3	65.2

Beyond LLaVA

LLaVA established the template. Other teams took the same core idea — connect a vision encoder to an LLM, train with visual instructions — and made different choices at every decision point. The diversity of these approaches reveals which design choices actually matter.

InternVL

InternVL (Chen et al., December 2023) rejected the assumption that you should use an off-the-shelf vision encoder. Instead, they trained InternViT-6B from scratch — a 6-billion-parameter vision transformer, co-trained with a language model through progressive alignment.

Why 6B parameters for a vision encoder when CLIP uses 300M? Because InternVL was designed for a different regime: one where the vision encoder needs to handle document-level OCR, fine-grained chart parsing, and dense scene understanding. CLIP’s features were trained for image-level semantics (what is in this image?). InternViT’s features were trained for pixel-level semantics (what is at every location in this image?).

The co-training strategy matters: vision and language parameters are updated together from the start, so the representation spaces grow towards each other rather than being jammed together after the fact. The tradeoff is cost — training InternViT-6B is enormously more expensive than fine-tuning a projection layer.

Qwen-VL

Qwen-VL (Bai et al., August 2023) introduced two innovations that addressed specific weaknesses in the LLaVA design:

Visual resampler: Instead of projecting all patch tokens directly into the LLM (which produces a long sequence), Qwen-VL uses a cross-attention resampler (similar to Flamingo’s Perceiver architecture) to compress the visual tokens from 256 down to 256 fixed-size query tokens. The difference is that these tokens are learnable queries that attend to the image, extracting only the relevant information. This decouples image resolution from sequence length.
Position-aware vision-language adapter: Qwen-VL adds 2D positional encodings to the visual tokens after the resampler, preserving spatial layout information that would otherwise be flattened. This is critical for tasks that require understanding where things are in the image, not just what they are — tasks like referring expression comprehension and visual grounding.

Qwen-VL was also notable for its training data scale: 1.4 billion image-text pairs for pre-training (vs. LLaVA’s 595K), giving the visual resampler far more signal to learn from.

Phi-3-Vision

Microsoft’s Phi-3-Vision (Abdin et al., May 2024) proved that you do not need a large model to achieve strong visual instruction following. Built on the 3.8B Phi-3-mini backbone, Phi-3-Vision was one of the first models to demonstrate that a well-curated small VLM could outperform much larger models that were trained on noisier data.

The key insight was data quality over model size. Phi-3-Vision was trained on a carefully filtered dataset where every instruction-response pair was vetted for accuracy, detail, and reasoning quality. The team found that removing the noisiest 30% of training data improved performance, even though the dataset got smaller. This confirmed something the LLaVA team had hinted at: for instruction tuning, 100K excellent examples beat 1M mediocre ones.

Data Quality & Curation

As the field matured, it became clear that data quality — not model architecture, not training tricks, not scale — was the primary determinant of VLM capability. Two datasets and one principle illustrate this.

ShareGPT4V

ShareGPT4V (Chen et al., November 2023) replaced the text-only GPT-4 pipeline with actual GPT-4V calls. Instead of generating descriptions from captions and bounding boxes, GPT-4V could see the image directly and produce conversations grounded in the actual visual content. The resulting 100K conversations were dramatically higher quality — fewer hallucinations, more precise spatial descriptions, better handling of text in images and subtle visual details.

ALLaVA

ALLaVA (Chen et al., 2024) pushed the data generation pipeline further by using GPT-4V with structured prompting templates that enforced format consistency, minimum detail thresholds, and explicit grounding requirements. Each response had to reference specific visual elements with their approximate locations. This structured approach reduced hallucination rates by roughly 40% compared to free-form GPT-4 generation.

Mixture Ratios

Through extensive ablation studies across multiple groups, a rough consensus emerged on effective data mixing for Stage 2 instruction tuning:

~50% academic task data (VQA, OCR, chart understanding) — provides precision and factual grounding
~30% conversation data (multi-turn visual chat) — provides fluency and instruction-following ability
~20% reasoning data (complex inference, chain-of-thought) — provides depth and analytical capability

Deviating significantly from these ratios causes measurable regressions. Too much academic data produces terse models that answer “yes” or “no” when asked for explanations. Too much conversation data produces models that are fluent but inaccurate. Too much reasoning data produces models that over-elaborate on simple questions.

Quality Filtering and Deduplication

Effective filtering removes approximately 15–30% of generated data. Common rejection criteria include:

Hallucinated objects: The response mentions objects not present in the image (detectable by cross-referencing with source captions).
Generic responses: Responses that could apply to any image (“This is an interesting image that shows various elements...”).
Format violations: Responses that break the expected conversation structure or contain meta-commentary about being an AI.
Near-duplicates: Multiple training examples from similar images that would bias the model toward specific visual patterns.

Semantic deduplication (using CLIP embeddings to identify visually similar images) is particularly important for web-scraped data, where the same image often appears hundreds of times with slightly different crops or compression artifacts.

Evaluation Benchmarks

Visual instruction tuning needs evaluation benchmarks that test more than just classification accuracy. Here are the ones that matter, what they measure, and where saturation is occurring.

Benchmark	What It Measures	Format	Score Range	Saturation?
VQAv2	General visual question answering on natural images	Open-ended, soft accuracy	0–100	Yes. Top models >82. Human ~87.
MMMU	College-level multimodal reasoning (math, science, art, engineering)	Multiple choice	0–100	No. Best open ~48. GPT-4V ~56.
MMBench	Systematic VLM evaluation across 20 ability dimensions	Multiple choice with circular eval	0–100	Partially. Top open ~78.
POPE	Object hallucination detection (yes/no questions about object presence)	Binary (yes/no)	0–100 (F1)	Partially. Best ~87. Random = 50.
HallusionBench	Advanced hallucination probing with adversarial image manipulations	Multiple choice / open	0–100	No. Best open ~47.
MMHal-Bench	Hallucination in free-form descriptions (GPT-4 judged)	Open-ended, LLM-as-judge	0–6 (rating)	No. Best open ~3.4.

📚 The hallucination problem

POPE, HallusionBench, and MMHal-Bench all measure the same fundamental failure mode: the model describes things that are not in the image. This remains the Achilles’ heel of visual instruction tuning. Models trained on GPT-4-generated data inherit GPT-4’s tendency to confabulate plausible-sounding details. The field has not solved this yet. Better data helps. Better prompting helps. But the problem is structural — language models are trained to produce plausible text, and sometimes plausible text contradicts visual reality.

Benchmark Radar Chart Interactive

Code Examples

Data Formatting for LLaVA Training

The training data must be formatted as a JSON list of conversations. Each item specifies an image path and a multi-turn conversation:

JSON

[
  {
    "id": "000000033471",
    "image": "coco/train2017/000000033471.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat are the people doing in this image?"
      },
      {
        "from": "gpt",
        "value": "The image shows a group of people gathered around a dining table, enjoying a meal together. There are plates of food, glasses of wine, and animated conversation happening across the table."
      },
      {
        "from": "human",
        "value": "How many people are visible?"
      },
      {
        "from": "gpt",
        "value": "There are six people visible in the image — four seated at the table and two standing nearby, apparently engaged in conversation."
      }
    ]
  }
]

The <image> token appears exactly once in the first human turn. During tokenization, it is expanded to the full sequence of image patch tokens.

Projection Layer Implementation

Python

import torch
import torch.nn as nn

class LLaVAProjector(nn.Module):
    """
    Maps vision encoder outputs to LLM embedding space.

    LLaVA v1:   single linear layer  (W @ v)
    LLaVA v1.5: two-layer MLP        (W2 @ GELU(W1 @ v))
    """
    def __init__(self, vision_dim=1024, llm_dim=4096, version="v1.5"):
        super().__init__()
        if version == "v1":
            self.proj = nn.Linear(vision_dim, llm_dim)
        elif version == "v1.5":
            self.proj = nn.Sequential(
                nn.Linear(vision_dim, llm_dim),
                nn.GELU(),
                nn.Linear(llm_dim, llm_dim),
            )

    def forward(self, image_features):
        """
        Args:
            image_features: (batch, num_patches, vision_dim)
                            e.g. (1, 576, 1024) for ViT-L/14@336px
        Returns:
            projected: (batch, num_patches, llm_dim)
                       e.g. (1, 576, 4096) — ready to insert into LLM
        """
        return self.proj(image_features)


# Example: v1 projector has ~4M params, v1.5 has ~33M params
v1_proj = LLaVAProjector(version="v1")
v15_proj = LLaVAProjector(version="v1.5")

print(f"v1  params: {sum(p.numel() for p in v1_proj.parameters()):,}")
print(f"v1.5 params: {sum(p.numel() for p in v15_proj.parameters()):,}")
# v1  params: 4,198,400
# v1.5 params: 37,752,832

Conversation Template Processing

Python

from dataclasses import dataclass
from typing import List, Dict

IMAGE_TOKEN = "<image>"
NUM_IMAGE_PATCHES = 576  # ViT-L/14@336px: 24x24 grid

SYSTEM_PROMPT = (
    "A chat between a curious user and an artificial intelligence "
    "assistant. The assistant gives helpful, detailed, and polite "
    "answers to the user's questions."
)

@dataclass
class ConversationProcessor:
    """Processes multi-turn visual conversations for LLaVA training."""

    tokenizer: object  # HuggingFace tokenizer

    def format_conversation(
        self, conversations: List[Dict], image_features: object
    ) -> Dict:
        """
        Format a conversation for training.

        Returns dict with:
          - input_ids: full token sequence
          - labels: same length, with -100 for masked positions
          - image_features: projected vision encoder outputs
        """
        # Build the full text
        text_parts = [f"SYSTEM: {SYSTEM_PROMPT}\n"]
        label_mask = [False]  # system prompt is masked

        for turn in conversations:
            role = turn["from"]
            content = turn["value"]

            if role == "human":
                # Replace  with placeholder tokens
                if IMAGE_TOKEN in content:
                    content = content.replace(
                        IMAGE_TOKEN,
                        "<im_patch>" * NUM_IMAGE_PATCHES
                    )
                text_parts.append(f"USER: {content}\n")
                label_mask.append(False)  # user turns: masked
            else:
                text_parts.append(f"ASSISTANT: {content}\n")
                label_mask.append(True)   # assistant turns: active

        full_text = "".join(text_parts)
        input_ids = self.tokenizer.encode(full_text)

        # Build labels: -100 for masked tokens, input_ids for active
        labels = self._build_labels(input_ids, text_parts, label_mask)

        return {
            "input_ids": input_ids,
            "labels": labels,
            "image_features": image_features,
        }

    def _build_labels(self, input_ids, text_parts, mask_flags):
        """
        Create label tensor. Tokens from masked parts get -100 (ignored
        by CrossEntropyLoss). Tokens from active parts keep their IDs.
        """
        labels = []
        for part, masked in zip(text_parts, mask_flags):
            part_ids = self.tokenizer.encode(part, add_special_tokens=False)
            if masked:
                labels.extend([-100] * len(part_ids))
            else:
                labels.extend(part_ids)
        return labels

Multi-Turn Inference

Python

import torch
from PIL import Image
from transformers import CLIPVisionModel, CLIPImageProcessor

class LLaVAInference:
    """Minimal multi-turn inference with a LLaVA-style model."""

    def __init__(self, vision_encoder, projector, llm, tokenizer):
        self.vision_encoder = vision_encoder.eval()
        self.projector = projector.eval()
        self.llm = llm.eval()
        self.tokenizer = tokenizer
        self.image_processor = CLIPImageProcessor.from_pretrained(
            "openai/clip-vit-large-patch14-336"
        )
        self.conversation_history = []
        self.image_features = None

    @torch.no_grad()
    def set_image(self, image_path: str):
        """Encode an image and cache the projected features."""
        image = Image.open(image_path).convert("RGB")
        pixel_values = self.image_processor(
            image, return_tensors="pt"
        ).pixel_values.cuda()

        # Extract CLIP features: (1, 576, 1024)
        vision_output = self.vision_encoder(
            pixel_values, output_hidden_states=True
        )
        # Use second-to-last layer (common practice)
        features = vision_output.hidden_states[-2][:, 1:, :]

        # Project to LLM space: (1, 576, 4096)
        self.image_features = self.projector(features)
        self.conversation_history = []

    @torch.no_grad()
    def chat(self, user_message: str, max_new_tokens: int = 512) -> str:
        """Send a message and get a response. Maintains history."""
        self.conversation_history.append({
            "role": "user", "content": user_message
        })

        # Build input sequence with image tokens + text
        prompt = self._build_prompt()
        input_ids = self.tokenizer.encode(prompt, return_tensors="pt").cuda()

        # Insert image embeddings at the image token positions
        input_embeds = self._merge_image_and_text(input_ids)

        # Generate
        outputs = self.llm.generate(
            inputs_embeds=input_embeds,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            do_sample=True,
        )

        response = self.tokenizer.decode(
            outputs[0], skip_special_tokens=True
        )
        self.conversation_history.append({
            "role": "assistant", "content": response
        })
        return response

    def _build_prompt(self) -> str:
        """Build the full prompt including history."""
        parts = [f"SYSTEM: {SYSTEM_PROMPT}\n"]
        for i, turn in enumerate(self.conversation_history):
            if turn["role"] == "user":
                content = turn["content"]
                if i == 0:  # first user turn gets image tokens
                    content = IMAGE_TOKEN + "\n" + content
                parts.append(f"USER: {content}\n")
            else:
                parts.append(f"ASSISTANT: {turn['content']}\n")
        parts.append("ASSISTANT:")
        return "".join(parts)


# Usage:
# model = LLaVAInference(vision_encoder, projector, llm, tokenizer)
# model.set_image("photo.jpg")
# print(model.chat("What do you see in this image?"))
# print(model.chat("Tell me more about the person on the left."))

Summary

Visual instruction tuning is the process that transforms a vision encoder + LLM assembly into a model that can actually converse about images. The key lessons from LLaVA and its descendants:

Two stages are essential. Alignment first (cheap, fast, projection-only), then instruction tuning (expensive, full LLM). Skipping Stage 1 degrades everything.
Data generation is the bottleneck, not architecture. The GPT-4 pipeline that generated 158K pairs for $500 was more important than any architectural decision.
Freeze wisely. The vision encoder stays frozen because its features are already universal. The LLM gets unfrozen in Stage 2 because it needs to learn new reasoning patterns over visual inputs.
Small improvements compound. MLP connector (+3 points), higher resolution (+2 points), better data mixing (+5 points). LLaVA-1.5 is just LLaVA with three careful changes.
Resolution is destiny. LLaVA-NeXT’s AnyRes tiling was the single biggest capability jump in the lineage. More pixels, more knowledge.
Hallucination is unsolved. Every model in this family still describes things that are not in the image. The benchmarks (POPE, HallusionBench) confirm it. Better data helps, but the problem is structural.

The next article in this series covers training pipelines and scaling — the multi-stage pre-training strategies, resolution scheduling, interleaved data formats, and scaling laws that take these ideas from lab experiments to production systems.

References

Seminal papers and key works referenced in this article.

Liu et al. "Visual Instruction Tuning." NeurIPS, 2023. arXiv
Liu et al. "Improved Baselines with Visual Instruction Tuning." CVPR, 2024. arXiv
Liu et al. "LLaVA-NeXT: Improved reasoning, OCR, and world knowledge." 2024.
Sharma et al. "Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset." ACL, 2018.
Goyal et al. "Making the V in VQA Matter." CVPR, 2017. arXiv