Introduction
For most of their history, vision-language models existed as benchmark entries — numbers on leaderboards, claimed by papers that few practitioners could reproduce. That era ended around 2023. Today, VLMs process millions of documents per day in production, assist radiologists reading scans at 3 AM, and navigate computer interfaces on behalf of users who never asked what a "visual token" is.
This shift matters because it changes the questions we need to ask. When a model exists only on a benchmark, we care about accuracy. When it processes real patient data, we care about hallucination rates, failure modes, latency, regulatory compliance, and adversarial robustness. The gap between "works on the test set" and "works in production" is where most of the engineering lives.
This article surveys the major application domains where VLMs are deployed or actively researched, then confronts the central unsolved problem — hallucination — and closes with the frontier: unified multimodal models that blur the line between understanding and generation entirely. Throughout, we maintain the Feynman standard: if we cannot explain why a technique works mechanically, we flag it as empirical rather than dressing it up with post-hoc narratives.
We survey seven application domains (documents, video, GUIs, medical imaging, OCR, creative tools), then dedicate major sections to hallucination (the field's most urgent unsolved problem) and the path toward unified multimodal systems. Code examples show practical pipelines for document QA, video processing, GUI detection, and hallucination measurement.
Document Understanding
For decades, document processing meant OCR followed by rule-based extraction: Tesseract produces text, regular expressions find invoice numbers, layout heuristics identify tables. This pipeline is brittle. It fails on handwriting, degrades with rotation, cannot handle overlapping columns, and requires separate systems for each document type. VLMs offer a fundamentally different approach: feed the document image directly to the model and ask questions in natural language.
OCR-Free Document Analysis
The key insight behind OCR-free document understanding is that a sufficiently capable vision encoder can recognize text directly from pixels, without an explicit OCR stage. This is not merely a convenience — it eliminates an entire error-propagation pathway. When OCR misreads "l" as "1" or drops a comma, every downstream component inherits that error. When the VLM reads directly from the image, it has access to spatial context, font styling, and layout structure simultaneously.
Donut (Kim et al., 2022) was among the first to demonstrate this. The architecture is a Swin Transformer encoder feeding a BART decoder, trained end-to-end on document images without any OCR preprocessing. The model learns to read text, understand layout, and answer questions in a single forward pass. On receipt parsing, Donut matched or exceeded OCR-pipeline approaches while being simpler to deploy.
TextMonkey (Liu et al., 2024) extends this to dense text scenarios. The critical challenge is resolution: a standard 224×224 input cannot resolve small text in a full-page document. TextMonkey addresses this with a Shifted Window Attention mechanism that processes high-resolution inputs efficiently, combined with token resampling to keep the sequence length manageable. The model processes images at resolutions up to 896×896, enough to read 10-point body text in most documents.
DocOwl (Ye et al., 2023; DocOwl 1.5, Hu et al., 2024) takes a different approach: rather than building a specialized architecture, it fine-tunes a general-purpose VLM (based on mPLUG-Owl) on document-specific data. DocOwl 1.5 introduces a "Unified Structure Learning" scheme where the model first learns to parse document structure (headings, paragraphs, tables, captions) before learning to answer questions. This curriculum matters — without structural understanding, the model treats documents as bags of text fragments rather than spatially organized information.
Document Benchmarks
The standard benchmarks for document VLMs each stress different capabilities:
| Benchmark | Task | Key Challenge | Metric |
|---|---|---|---|
| DocVQA | Question answering on document images | Text recognition + spatial reasoning | ANLS (Avg. Normalized Levenshtein Similarity) |
| ChartQA | Question answering on charts and plots | Visual data extraction + numerical reasoning | Relaxed accuracy (5% tolerance) |
| InfoVQA | Infographic understanding | Complex layouts, multi-element reasoning | ANLS |
| TableVQA | Table structure recognition | Row/column alignment, spanning cells | TEDS (Tree Edit Distance Similarity) |
| MP-DocVQA | Multi-page document QA | Cross-page reasoning, long context | ANLS |
A critical detail: ANLS (Average Normalized Levenshtein Similarity) is used instead of exact match because document answers often involve OCR-like text where minor character differences should not count as complete failures. An answer of "$ 12,345.67" versus "$12,345.67" should score high even though exact string match gives 0.
High-Resolution Tiling
The fundamental tension in document VLMs is resolution versus compute. A standard A4 document scanned at 300 DPI produces a 2480×3508 image. A ViT-L with 14×14 patches processes 224×224 inputs, giving 16×16 = 256 visual tokens. At this resolution, 10-point text is unreadable — each character occupies fewer pixels than a single patch.
The solution is tiling: divide the high-resolution image into overlapping crops, encode each crop independently, then either (a) concatenate all crop tokens into a single sequence or (b) use a global thumbnail alongside the crops. InternVL, Qwen-VL, and LLaVA-NeXT all use variants of this approach. The trade-off is clear: a 4×4 tiling grid produces 16×256 = 4,096 visual tokens, making the LLM's context window the new bottleneck.
The math for optimal tiling is straightforward. Given an input image of size H×W and a vision encoder that processes patches of size P at resolution R, the minimum number of tiles needed is:
# Minimum tiles to resolve text at font_size pixels tall
# Each tile has R pixels, each character needs ~font_size pixels
# So we need ceil(H/R) x ceil(W/R) tiles at minimum
import math
def compute_tiles(H, W, R=448, max_tiles=12):
"""Compute optimal tiling for document understanding."""
n_h = math.ceil(H / R)
n_w = math.ceil(W / R)
total = n_h * n_w
if total > max_tiles:
# Scale down: find largest factor pair under max_tiles
scale = math.sqrt(max_tiles / total)
n_h = max(1, round(n_h * scale))
n_w = max(1, round(n_w * scale))
return n_h, n_w
# A4 at 300 DPI
tiles_h, tiles_w = compute_tiles(3508, 2480)
print(f"Tiling: {tiles_h}x{tiles_w} = {tiles_h * tiles_w} tiles")
# Tiling: 5x4 = 20 tiles -> capped to ~12 in practice
Forms have large text fields with clear spatial separation — ideal for patch-level text recognition. Dense tables with 8-point text and thin cell borders require sub-patch resolution. This is why the best document VLMs still occasionally fall back to explicit OCR for table-heavy documents, using the VLM for semantic understanding and OCR for character-level accuracy.
Video VLMs
Extending VLMs from images to video introduces a new dimension — time — and with it, a combinatorial explosion in the number of visual tokens. A 30-second video at 30 FPS contains 900 frames. If each frame produces 256 tokens, that is 230,400 visual tokens — far exceeding any current LLM context window. Every video VLM is, at its core, a strategy for managing this explosion.
Temporal Modeling
Video-LLaVA (Lin et al., 2023) takes the simplest approach: uniformly sample N frames (typically 8), encode each independently with the image encoder, concatenate the tokens, and feed them to the LLM. The temporal reasoning happens entirely in the LLM's self-attention, which attends over interleaved frame tokens. This works surprisingly well for simple temporal questions ("What happened after the person picked up the cup?") because the LLM can compare frame representations directly.
LLaVA-NeXT-Video (Liu et al., 2024) improves on this with a crucial insight: not all frames are equally important. It uses AnyRes dynamic resolution to encode keyframes at high resolution while compressing non-keyframe intervals into fewer tokens. The resulting token sequence is both more informative and shorter.
More sophisticated approaches add explicit temporal attention. Video-ChatGPT (Maaz et al., 2023) pools frame features both spatially and temporally before projection, producing a compact representation that captures both what appears in each frame and how content changes across frames. The pooling happens in two stages:
# Video-ChatGPT temporal + spatial pooling
# Given frame features: (T, N, D) where T=frames, N=spatial tokens, D=dim
def video_chatgpt_pool(frame_features):
"""
frame_features: (T, N, D) - T frames, N spatial tokens, D dimensions
Returns: (N + T, D) - spatial-pooled + temporal-pooled
"""
T, N, D = frame_features.shape
# Temporal pooling: average across frames for each spatial position
# Captures "what is consistently present at each location"
temporal_pooled = frame_features.mean(dim=0) # (N, D)
# Spatial pooling: average across spatial positions for each frame
# Captures "what is the overall content of each frame"
spatial_pooled = frame_features.mean(dim=1) # (T, D)
# Concatenate: LLM gets both views
video_tokens = torch.cat([temporal_pooled, spatial_pooled], dim=0) # (N+T, D)
return video_tokens
Frame Sampling Strategies
The choice of which frames to process is not a detail — it is a fundamental design decision that determines what temporal relationships the model can even see. The main strategies are:
Uniform sampling selects frames at regular intervals. For a 900-frame video sampled to 8 frames, this takes every 112th frame. It guarantees coverage but can miss brief events entirely. A hand wave that lasts 10 frames (0.3 seconds) has a 90% chance of being skipped.
Keyframe extraction uses scene-change detection (typically frame-to-frame pixel difference or histogram divergence) to find transition points. This captures major events but produces redundant samples during slow-moving scenes.
Content-aware sampling uses CLIP or a lightweight classifier to score each frame's relevance to the query, then samples high-scoring frames. This is query-dependent, meaning the same video produces different frame sets for different questions. The cost is an additional forward pass through a scoring model for every frame.
Hierarchical sampling operates at multiple temporal scales: a sparse global sample (one frame per 5 seconds) captures the narrative, while a dense local sample around detected events captures details. This mirrors how humans watch video — scanning broadly, then focusing on interesting moments.
Video Benchmarks
Video understanding benchmarks test increasingly sophisticated temporal reasoning:
| Benchmark | Focus | Duration | Key Metric |
|---|---|---|---|
| ActivityNet-QA | Activity recognition, temporal localization | ~3 min avg | Accuracy + Score |
| Video-MME | Comprehensive multimodal evaluation | Short / Medium / Long | Multi-task accuracy |
| EgoSchema | Egocentric video understanding | 3 min clips | 5-way multiple choice |
| PerceptionTest | Temporal reasoning, tracking, physics | ~23 sec avg | Multi-task accuracy |
| LVBench | Long-form video comprehension | Up to 1 hour | QA accuracy |
The gap between short-clip and long-video performance is enormous. Models that score 70%+ on ActivityNet-QA (3-minute clips) may drop below 40% on hour-long videos in LVBench. The bottleneck is not temporal reasoning per se but information retention — the model simply cannot hold enough visual context to answer questions about events separated by 30 minutes of footage.
GUI Agents
Perhaps the most immediately practical application of VLMs is using computer interfaces on behalf of humans. A GUI agent must understand screenshots at pixel level — identifying buttons, text fields, menus, icons — then predict the sequence of actions (click, type, scroll) needed to accomplish a goal. This requires a combination of visual grounding, spatial reasoning, and planning that pushes VLMs to their limits.
Screen Understanding
CogAgent (Hong et al., 2023) is a dedicated GUI-understanding VLM built on CogVLM with a critical architectural addition: a high-resolution cross-attention module that processes screenshots at 1120×1120. The standard resolution (224×224 or 448×448) cannot distinguish a clickable button from a text label — both are blobs of similar color at low resolution. At 1120×1120, individual UI elements become clearly resolvable, and the model can read button text, interpret icons, and detect the fine borders that separate interactive from non-interactive elements.
Ferret-UI (You et al., 2024) addresses a subtler problem: in mobile interfaces, the same visual pattern (a rounded rectangle with text) can be a button, a card, a text input, or a navigation tab depending on context. Ferret-UI adds "referring and grounding" capabilities — the model can point to specific UI elements in response to descriptions and describe specific elements given their coordinates. This bidirectional grounding is essential for reliable GUI interaction.
SeeClick (Cheng et al., 2024) tackles the core action prediction problem: given a screenshot and an instruction, predict where to click. The model outputs normalized (x, y) coordinates, trained on millions of web browsing demonstrations. The key finding is that click accuracy is highly sensitive to resolution — doubling the input resolution from 448 to 896 improves click accuracy by 15-20% on average, because the model can more precisely locate small interactive targets like checkboxes and dropdown arrows.
Action Prediction
GUI agents must predict not just single actions but action sequences. The typical formulation is as a conversation: the model receives a goal ("Book a flight from SFO to JFK on March 15"), observes the current screen state, predicts the next action, then receives the updated screen and continues. This is fundamentally an MDP (Markov Decision Process) where:
- State = current screenshot + action history + goal
- Action space = {click(x,y), type(text), scroll(direction), press(key), wait}
- Transition = execute action, capture new screenshot
- Reward = task completion (sparse, delayed)
The challenge is that this MDP has enormous action spaces (any pixel could be a click target, any string could be typed) and long horizons (booking a flight may take 20+ actions). Current GUI agents are trained primarily via behavioral cloning on human demonstrations, with some work on online RL for refinement. The error-compounding problem is severe: a single wrong click in a 20-step sequence can leave the agent on a completely wrong page with no way to recover except starting over.
Evaluating GUI agents is uniquely difficult because the environment is non-stationary. A web page may change between evaluation runs due to A/B tests, content updates, or personalization. The standard approach — WebArena (Zhou et al., 2023) — uses self-hosted web applications with fixed content, but this sacrifices ecological validity. Production web pages are messier, slower, and more adversarial than any benchmark environment.
Medical Imaging
Medical imaging is simultaneously the highest-stakes and most heavily regulated application domain for VLMs. A model that misreads a chest X-ray is not just wrong — it may delay treatment for a patient with lung cancer. The technical and regulatory challenges are distinct from general VLM research, and solutions that work on natural images frequently fail on medical data.
Clinical VLMs
BiomedCLIP (Zhang et al., 2023) adapts the CLIP contrastive learning framework to biomedical data, training on PubMedCLIP's 1.6M image-text pairs from biomedical literature. The critical difference from standard CLIP is domain-specific: medical images have fundamentally different statistics than natural images. A chest X-ray is grayscale, low-contrast, with pathology visible as subtle density changes. ImageNet pretraining, which learns to distinguish dogs from cats using color and texture, provides weak initialization for distinguishing pneumonia from normal lung parenchyma.
LLaVA-Med (Li et al., 2024) follows the LLaVA architecture but with a biomedical training curriculum. Stage 1: align a biomedical vision encoder to a language model using 600K biomedical image-text pairs from PubMed. Stage 2: instruction-tune on 60K biomedical conversations generated by GPT-4 from image captions. The two-stage approach is necessary because medical visual features and medical language are both specialized — you need to align vision features to medical concepts (Stage 1) and then learn to converse about them (Stage 2).
Med-PaLM M (Tu et al., 2024) takes the most ambitious approach: a single model that handles multiple medical tasks across multiple modalities. Built on PaLM-E, it processes chest X-rays, dermatology photographs, pathology slides, and genomics data using task-specific encoders feeding a unified language model. The key result is that a single generalist model can approach (though not yet match) specialist models across all tasks simultaneously, suggesting that medical knowledge transfers across imaging modalities.
The performance gap between medical VLMs and radiologists is instructive. On chest X-ray report generation (MIMIC-CXR benchmark), the best models achieve clinical accuracy rates of 60-70% for finding-level descriptions. Expert radiologists score 85-95%. The gap is largest for rare findings and subtle abnormalities — exactly the cases where automated assistance would be most valuable.
Regulatory Landscape
The FDA regulates AI/ML medical devices through three pathways, and VLMs face unique challenges in each:
510(k) clearance requires demonstrating "substantial equivalence" to a legally marketed predicate device. Most current AI radiology tools use this pathway. For VLMs, the challenge is that there may be no predicate — a model that generates free-text radiology reports has different risks than a model that outputs a binary classification.
De Novo classification is used for novel devices without predicates. This requires more extensive validation, including pre-specified clinical studies. The VLM's generative nature creates a problem: the output space is effectively infinite, making exhaustive testing impossible. How do you validate that a model will never generate a dangerously wrong report across all possible input images?
Predetermined change control plans (PCCPs) allow manufacturers to specify in advance how a model will be updated post-market. For VLMs that are continuously updated (as most LLMs are), this is essential. The FDA's 2023 guidance explicitly addresses "locked" versus "adaptive" algorithms, with adaptive algorithms requiring more rigorous monitoring.
General VLMs are trained on internet data where "lung opacity" might appear in a photography forum discussing overexposed images. Medical VLMs must learn that "lung opacity" in a radiology context means potential pathology requiring clinical correlation. This semantic collision means that simply fine-tuning GPT-4V on medical data can produce models that are confidently wrong in domain-specific ways — using the right vocabulary with the wrong clinical meaning.
OCR and Structured Extraction
Optical character recognition is one of the oldest problems in computer vision, but VLMs are redefining what "OCR" means. Traditional OCR outputs characters. VLM-based OCR outputs structured understanding — not just the text "Total: $42.50" but the semantic role of that text within the document's information hierarchy.
Text in the wild presents challenges that document OCR does not: perspective distortion, variable lighting, artistic fonts, partial occlusion, and backgrounds that interfere with text segmentation. Street signs at oblique angles, handwritten notes on whiteboards, and text printed on curved surfaces all require the model to simultaneously perform geometric reasoning and character recognition.
GOT (General OCR Theory) (Wei et al., 2024) proposes a unified framework that reframes OCR as a vision-language generation task. Instead of building separate pipelines for text detection, recognition, and layout analysis, GOT encodes the entire image and generates structured output (including text content, bounding boxes, and reading order) autoregressively. The model is trained to output formatted text in various structured formats — markdown, LaTeX, HTML — depending on the input document type.
The practical applications span:
- Receipt parsing — extracting line items, totals, vendor names, dates. The challenge is not reading the text (modern VLMs do this well) but understanding the layout: which number is the total, which is the tax, which is a line item price.
- License plate reading — requires reading text from moving, partially occluded, variably-lit sources at angles. Specialized models still dominate here due to latency requirements (sub-100ms), but VLMs are competitive on accuracy when latency is not a constraint.
- Scene text understanding — reading and interpreting text in natural scenes. "What does the sign say?" is easy. "Is this restaurant open?" (requiring reading hours on a door sign and comparing to current time) is where VLMs add value over raw OCR.
- Handwriting recognition — historically a separate subfield, now subsumed by VLMs. Models like GPT-4V can read handwritten notes with accuracy approaching dedicated HTR (Handwritten Text Recognition) systems, especially for common scripts.
End-to-end versus pipeline approaches remain a live debate. End-to-end models (a single VLM doing everything) are simpler to deploy and can use global context. Pipeline approaches (OCR → layout analysis → extraction → LLM reasoning) allow each stage to be optimized independently and provide interpretable intermediate outputs. In production, hybrid approaches are common: use a fast OCR engine for text extraction, then feed the extracted text plus the original image to a VLM for semantic understanding.
Creative and Design Applications
VLMs are finding unexpected applications in creative domains, where "understanding" an image means something different than in document analysis or medical imaging. Here, the model must grasp aesthetic properties, cultural context, humor, and intent.
Image editing guided by language uses VLMs as the "understanding" component in editing pipelines. InstructPix2Pix (Brooks et al., 2023) showed that language-guided editing is possible, but VLMs add a crucial capability: they can critique and refine edits. A pipeline might generate an edit, then ask the VLM "Does this look natural?" or "Did the edit change only what was requested?" This VLM-in-the-loop approach catches artifacts that the editing model alone would miss.
Visual design critique applies VLMs to user interface evaluation: identifying alignment issues, color contrast violations (WCAG accessibility standards), inconsistent spacing, and typographic hierarchy problems. The model receives a screenshot and produces structured feedback. This is an active area where VLMs are genuinely useful because they combine pixel-level visual analysis with knowledge of design principles absorbed from training data.
Meme understanding is a surprisingly difficult benchmark for VLMs because memes require understanding visual-textual incongruity, cultural references, sarcasm, and in-group signaling simultaneously. The "Hateful Memes" benchmark (Kiela et al., 2020) showed that even strong VLMs struggle to detect when an image-text combination is hateful versus benign, because the hateful meaning emerges from the combination rather than either modality alone. A picture of a person is benign. A caption is benign. Together, they may be hateful. This compositionality challenge remains largely unsolved.
Fashion analysis uses VLMs for outfit recommendation, style classification, trend detection, and virtual try-on assistance. The commercial applications are substantial: e-commerce platforms use VLMs to generate product descriptions, match customer preferences to items, and provide styling advice. The technical challenge is attribute-level precision — the model must distinguish "silk" from "satin," "A-line" from "fit-and-flare," and "burgundy" from "maroon" at a level of detail that most general VLMs lack without domain-specific fine-tuning.
Hallucination
Hallucination is the central unsolved problem in VLMs. Not "one of" the central problems — the central problem. A model that hallucinates cannot be trusted in any application where correctness matters, which includes every application discussed in this article. Until hallucination is solved (or at least controlled to known, bounded rates), VLMs remain tools that require human oversight, not autonomous systems.
The term "hallucination" in the VLM context means generating content that is not grounded in the visual input. The model claims to see objects that are not present, attributes properties that are incorrect, or describes relationships that do not exist. This is distinct from (but related to) language model hallucination, because the VLM has access to ground truth — the image — and still gets it wrong.
Types of Hallucination
The taxonomy of VLM hallucinations has three major categories, each with different causes and different implications:
Object hallucination is the most studied type. The model invents objects that are not present. This is strongly correlated with language prior bias: dogs frequently co-occur with parks in the training data, so the model's language prior overwhelms the visual evidence (or lack thereof). CHAIR (Caption Hallucination Assessment with Image Relevance) measures this directly by comparing mentioned objects to ground-truth object annotations.
Attribute hallucination involves correct object identification with incorrect properties. The model sees the car but gets the color, shape, or type wrong. This is mechanistically interesting because it suggests the visual features successfully activated the "car" concept but failed to bind the correct attributes to it. This is related to the "binding problem" in cognitive science — how does a system associate properties with specific objects when multiple objects are present?
Relation hallucination involves incorrect spatial, temporal, or functional relationships between correctly identified objects. The model sees both the cat and the bowl but invents a relationship (eating from, on the table) that does not exist. This is the hardest type to measure and mitigate because relationships are inherently more ambiguous than objects or attributes.
Measurement
You cannot fix what you cannot measure, and hallucination measurement is harder than it appears. The two most widely used metrics approach the problem differently:
POPE (Polling-based Object Probing Evaluation) (Li et al., 2023) converts hallucination detection into a binary classification problem. The model is asked "Is there a [object] in the image?" for both present and absent objects. The key insight is the sampling strategy for negative objects: random sampling (randomly chosen absent objects) is easy, popular sampling (frequently co-occurring but absent objects) is hard, and adversarial sampling (objects similar to present ones) is hardest. A model scoring 90% on random POPE may score only 65% on adversarial POPE.
# POPE evaluation: three difficulty levels
# For an image containing {cat, chair, table}
# Random negative sampling: choose random absent objects
random_negatives = ["airplane", "mountain", "guitar"]
# Easy to reject - no semantic connection to image content
# Popular negative sampling: choose frequently co-occurring absent objects
popular_negatives = ["dog", "couch", "television"]
# Harder - these objects plausibly appear in indoor scenes
# Adversarial negative sampling: choose visually/semantically similar absent objects
adversarial_negatives = ["kitten", "stool", "desk"]
# Hardest - these are near-synonyms or visually similar to present objects
CHAIR (Caption Hallucination Assessment with Image Relevance) (Rohrbach et al., 2018) measures hallucination in free-form captions. CHAIRi counts the fraction of generated captions that mention at least one hallucinated object. CHAIRs counts the fraction of mentioned objects that are hallucinated. The ground truth comes from COCO object annotations. CHAIR's limitation is that it only measures object-level hallucination using a fixed object vocabulary; attribute and relation hallucinations are invisible to it.
# CHAIR metric computation
def compute_chair(generated_caption, ground_truth_objects, coco_synonyms):
"""
CHAIR_s: fraction of mentioned objects that are hallucinated
CHAIR_i: 1 if any hallucinated object, else 0
"""
# Extract mentioned objects from caption using COCO vocabulary
mentioned = extract_objects(generated_caption, coco_synonyms)
# Count hallucinations
hallucinated = [obj for obj in mentioned if obj not in ground_truth_objects]
chair_s = len(hallucinated) / max(len(mentioned), 1)
chair_i = 1 if len(hallucinated) > 0 else 0
return chair_s, chair_i
# Typical results (lower is better):
# LLaVA-1.5: CHAIR_s = 0.15, CHAIR_i = 0.46
# LLaVA + RLHF: CHAIR_s = 0.08, CHAIR_i = 0.28
# GPT-4V: CHAIR_s = 0.05, CHAIR_i = 0.18
Mitigation
Mitigating hallucination is an active research area with no silver bullet. The main approaches attack different root causes:
RLHF with hallucination-specific feedback (Sun et al., 2023) trains a reward model that penalizes hallucinated content. The reward model is trained on human annotations where annotators flag hallucinated objects, attributes, and relations. During PPO training, the VLM learns to generate descriptions that the reward model scores highly. This reduces CHAIR scores by 30-50% but does not eliminate hallucination entirely, and can cause the model to become overly conservative — refusing to describe ambiguous content rather than risking hallucination.
Contrastive decoding (Leng et al., 2024; VCD) exploits a clever observation: hallucinations come from the language prior, while accurate descriptions come from the visual input. VCD runs two forward passes — one with the image (the "faithful" distribution) and one without (the "hallucination" distribution) — then subtracts the hallucination distribution from the faithful one at each token generation step:
# Visual Contrastive Decoding (VCD)
# p_faithful = P(token | image, text_so_far) -- with image
# p_hallucinate = P(token | noise, text_so_far) -- without image (or with noise)
def vcd_logits(logits_with_image, logits_without_image, alpha=1.0, beta=0.1):
"""
Subtract language prior to amplify visually-grounded tokens.
alpha: contrastive strength (higher = more suppression of hallucination)
beta: adaptive plausibility threshold
"""
# Adaptive plausibility constraint: only consider tokens that are
# at least somewhat likely given the image
cutoff = torch.log(torch.tensor(beta)) + logits_with_image.max()
mask = logits_with_image < cutoff
# Contrastive logits: amplify tokens that are MORE likely with the image
# than without it
cd_logits = (1 + alpha) * logits_with_image - alpha * logits_without_image
cd_logits[mask] = float('-inf')
return cd_logits
Visual grounding approaches force the model to point to image regions that support its claims. If the model says "there is a dog," it must also indicate where in the image the dog appears. This makes hallucination costly: the model cannot invent a dog without also inventing a plausible location, which is harder. Grounding also enables post-hoc verification — if the indicated region does not contain a dog-like object, the claim can be automatically flagged.
Inference-time interventions include chain-of-thought prompting ("First describe what you see in detail, then answer the question"), self-consistency checking (generate multiple answers and flag disagreements), and attention visualization (check whether the model's attention aligns with its claims). These are practical because they require no retraining, but they increase latency and are not foolproof.
The Path to Unified Multimodal
The VLMs discussed throughout this series are fundamentally understanding models: they take images in and produce text out. The next frontier is models that both understand and generate across modalities — reading an image, writing about it, producing a new image in response, narrating a video, and generating speech, all within a single architecture and a single forward pass.
Gemini (Google DeepMind, 2023) was trained natively on interleaved multimodal data from the start — text, images, audio, and video were all present during pretraining, not bolted on afterward. This "native multimodal" approach avoids the modality gap that plagues bridge-based architectures (like LLaVA's linear projection from CLIP to LLM). In a native multimodal model, the shared representation space is learned jointly across all modalities from the beginning, so visual and textual representations are naturally aligned rather than forcibly mapped.
GPT-4o (OpenAI, 2024) pushes further into any-to-any generation: the model can accept text, images, and audio as input and produce text, images, and audio as output. Mechanistically, this requires the model to operate simultaneously as an encoder and decoder for multiple modalities. The architecture details are not public, but the likely approach involves modality-specific tokenizers (converting images to discrete tokens via VQ-VAE or similar, audio to speech tokens via a codec) feeding a single transformer that generates tokens in all modality vocabularies.
Interleaved understanding and generation is the key capability that distinguishes these models from earlier work. A conversation might flow: User sends image → model describes it in text → user asks for a modification → model generates a new image → user asks about differences → model explains in text. This requires the model to maintain coherent state across modality transitions, which is architecturally non-trivial: the representation of "the image I just generated" must be compatible with the representation of "the image the user just sent" for the model to reason about both.
The technical challenges on the path to truly unified multimodal models include:
- Video generation — generating coherent video requires temporal consistency over hundreds of frames. Current video generation models (Sora, Runway Gen-3) are separate systems. Integrating video generation into a conversational VLM requires solving the token budget problem: even a 10-second video at modest resolution may require generating millions of tokens.
- 3D understanding — reasoning about 3D structure from 2D images. Current VLMs have surprisingly poor 3D understanding: they cannot reliably judge which object is in front of another, estimate relative sizes, or reason about occlusion. This matters for robotics, AR/VR, and spatial reasoning tasks.
- Audio-visual reasoning — jointly understanding what is seen and heard. "What instrument is making that sound?" requires the model to associate a visual object (a violin) with an auditory signal (a sustained note). Current multimodal models process audio and video largely independently.
Open Problems
The field of VLMs is mature enough to have a clear map of what remains unsolved. These are not incremental improvements but fundamental capabilities that current models lack:
Long-context visual reasoning. Current models handle 1-10 images effectively. Tasks requiring reasoning over 100+ images — analyzing a photo album chronologically, reviewing a full building blueprint set, or tracking changes across monthly satellite images — exceed current capabilities. The bottleneck is not just context length but the ability to maintain and query visual memories over extended sequences.
Compositional understanding. Given "a blue cube to the left of a red sphere, with a green cylinder behind both," humans form a precise 3D scene representation. VLMs fail systematically on compositional spatial reasoning (Thrush et al., 2022; Winoground), particularly when multiple objects share attributes or when spatial relations are non-canonical. The Winoground benchmark — where model must match two captions to two images that differ only in object-attribute binding — remains below 20% for most open-source VLMs (vs. chance at 25% for the hardest formulation).
Physical intuition. "Will the stack of books fall?" "Which cup will overflow first?" These questions require understanding of gravity, liquid dynamics, support relationships, and material properties. Current VLMs have weak physical reasoning because their training data is overwhelmingly static images with text descriptions, not interactive physics simulations. Benchmarks like IntPhys and CLEVRER test this, and VLMs score only slightly above chance on physical prediction tasks.
Causal reasoning from images. "Why is the road wet?" (it rained). "Why is the person running?" (they are late, or exercising, or fleeing). Causal reasoning requires the model to infer processes that produced the observed state, which goes beyond pattern recognition into counterfactual reasoning. Visual Commonsense Reasoning (VCR) tests this, but the questions are limited to multiple-choice format, which allows models to exploit linguistic shortcuts.
Robustness to adversarial inputs. Simple perturbations — adding imperceptible noise, rotating the image slightly, changing the JPEG compression level — can cause dramatic changes in VLM outputs. More concerning are semantic adversarial attacks: adding misleading text overlay to an image ("This image shows a dog" on a picture of a cat) can cause VLMs to override their visual understanding in favor of the text, because the language prior treats image-embedded text as authoritative.
Efficient deployment. A production VLM system processing 1000 document pages per second requires hardware costing hundreds of thousands of dollars. The visual encoder, LLM, and projection layers all require GPU memory and compute. Techniques like visual token pruning, early exit, speculative decoding, and quantization can reduce costs, but the field lacks systematic frameworks for trading off accuracy against latency and cost at deployment time.
Code Examples
These examples demonstrate practical pipelines for the application domains discussed above. Each is self-contained and uses widely available libraries.
Document QA Pipeline
A complete pipeline for answering questions about document images, using tiling for high-resolution support:
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
import math
class DocumentQA:
"""OCR-free document question answering with high-resolution tiling."""
def __init__(self, model_name="microsoft/Florence-2-large", device="cuda"):
self.processor = AutoProcessor.from_pretrained(
model_name, trust_remote_code=True
)
self.model = AutoModelForCausalLM.from_pretrained(
model_name, trust_remote_code=True,
torch_dtype=torch.float16
).to(device)
self.device = device
def tile_image(self, image, tile_size=448, max_tiles=12):
"""Split high-res image into overlapping tiles + global thumbnail."""
W, H = image.size
n_w = math.ceil(W / tile_size)
n_h = math.ceil(H / tile_size)
# Cap tile count
if n_w * n_h > max_tiles:
scale = math.sqrt(max_tiles / (n_w * n_h))
n_w = max(1, round(n_w * scale))
n_h = max(1, round(n_h * scale))
tiles = []
tile_w = W // n_w
tile_h = H // n_h
for i in range(n_h):
for j in range(n_w):
x0 = j * tile_w
y0 = i * tile_h
x1 = min(x0 + tile_w + tile_w // 4, W) # 25% overlap
y1 = min(y0 + tile_h + tile_h // 4, H)
tile = image.crop((x0, y0, x1, y1))
tiles.append(tile.resize((tile_size, tile_size)))
# Global thumbnail
thumbnail = image.resize((tile_size, tile_size))
tiles.insert(0, thumbnail)
return tiles, (n_h, n_w)
def ask(self, image_path, question):
"""Answer a question about a document image."""
image = Image.open(image_path).convert("RGB")
prompt = f"{question}"
inputs = self.processor(
text=prompt, images=image, return_tensors="pt"
).to(self.device, torch.float16)
with torch.no_grad():
generated_ids = self.model.generate(
**inputs, max_new_tokens=256,
num_beams=3, early_stopping=True
)
result = self.processor.batch_decode(
generated_ids, skip_special_tokens=True
)[0]
return result
# Usage
doc_qa = DocumentQA()
answer = doc_qa.ask("invoice.png", "What is the total amount due?")
print(f"Answer: {answer}")
# For table extraction
table_result = doc_qa.ask("financial_report.png",
"Extract all values from the revenue table as structured data"
)
print(f"Table: {table_result}")
Video Frame Extraction and Processing
A practical video processing pipeline with multiple sampling strategies:
import cv2
import numpy as np
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
class VideoVLMProcessor:
"""Video processing with multiple frame sampling strategies."""
def __init__(self, clip_model="openai/clip-vit-base-patch32"):
self.clip_processor = CLIPProcessor.from_pretrained(clip_model)
self.clip_model = CLIPModel.from_pretrained(clip_model)
def extract_frames(self, video_path):
"""Extract all frames from a video file."""
cap = cv2.VideoCapture(video_path)
frames = []
fps = cap.get(cv2.CAP_PROP_FPS)
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
# Convert BGR -> RGB
frames.append(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
cap.release()
return frames, fps
def uniform_sample(self, frames, n_frames=8):
"""Uniform sampling: evenly spaced frames."""
indices = np.linspace(0, len(frames) - 1, n_frames, dtype=int)
return [frames[i] for i in indices], indices
def keyframe_sample(self, frames, threshold=30.0, max_frames=16):
"""Keyframe sampling: detect scene changes via histogram diff."""
keyframe_indices = [0] # Always include first frame
for i in range(1, len(frames)):
# Compare histograms between consecutive frames
hist_prev = cv2.calcHist([frames[i-1]], [0,1,2], None,
[8,8,8], [0,256,0,256,0,256])
hist_curr = cv2.calcHist([frames[i]], [0,1,2], None,
[8,8,8], [0,256,0,256,0,256])
diff = cv2.compareHist(hist_prev, hist_curr, cv2.HISTCMP_BHATTACHARYYA)
if diff > threshold / 100.0:
keyframe_indices.append(i)
# Cap to max_frames by selecting most significant changes
if len(keyframe_indices) > max_frames:
keyframe_indices = np.linspace(
0, len(keyframe_indices) - 1, max_frames, dtype=int
)
keyframe_indices = [keyframe_indices[i] for i in keyframe_indices]
return [frames[i] for i in keyframe_indices], keyframe_indices
def content_aware_sample(self, frames, query, n_frames=8):
"""Content-aware sampling: select frames most relevant to query."""
# Score every Kth frame (skip frames for efficiency)
skip = max(1, len(frames) // 100)
candidate_indices = list(range(0, len(frames), skip))
# CLIP-score each candidate frame against the query
scores = []
for idx in candidate_indices:
pil_image = Image.fromarray(frames[idx])
inputs = self.clip_processor(
text=[query], images=pil_image, return_tensors="pt"
)
with torch.no_grad():
outputs = self.clip_model(**inputs)
score = outputs.logits_per_image.item()
scores.append((idx, score))
# Select top-n frames by relevance, maintaining temporal order
scores.sort(key=lambda x: x[1], reverse=True)
selected = sorted([s[0] for s in scores[:n_frames]])
return [frames[i] for i in selected], selected
def process_video_qa(self, video_path, question, strategy="uniform"):
"""Full pipeline: extract frames, sample, and prepare for VLM."""
frames, fps = self.extract_frames(video_path)
print(f"Video: {len(frames)} frames at {fps:.1f} FPS "
f"({len(frames)/fps:.1f}s)")
if strategy == "uniform":
sampled, indices = self.uniform_sample(frames)
elif strategy == "keyframe":
sampled, indices = self.keyframe_sample(frames)
elif strategy == "content_aware":
sampled, indices = self.content_aware_sample(frames, question)
# Convert to PIL for VLM input
pil_frames = [Image.fromarray(f) for f in sampled]
timestamps = [idx / fps for idx in indices]
print(f"Sampled {len(sampled)} frames at t={timestamps}")
return pil_frames, timestamps
# Usage
processor = VideoVLMProcessor()
frames, times = processor.process_video_qa(
"lecture.mp4",
"What equation is written on the whiteboard?",
strategy="content_aware"
)
GUI Element Detection
Detecting and classifying interactive elements in screenshots:
import torch
from PIL import Image, ImageDraw
from transformers import AutoProcessor, AutoModelForCausalLM
class GUIElementDetector:
"""Detect and classify interactive UI elements in screenshots."""
UI_ELEMENTS = [
"button", "text input", "checkbox", "dropdown",
"link", "tab", "slider", "toggle", "menu item",
"search bar", "icon", "navigation bar"
]
def __init__(self, model_name="microsoft/Florence-2-large", device="cuda"):
self.processor = AutoProcessor.from_pretrained(
model_name, trust_remote_code=True
)
self.model = AutoModelForCausalLM.from_pretrained(
model_name, trust_remote_code=True,
torch_dtype=torch.float16
).to(device)
self.device = device
def detect_elements(self, screenshot_path):
"""Detect all UI elements with bounding boxes and labels."""
image = Image.open(screenshot_path).convert("RGB")
W, H = image.size
# Use grounding detection prompt
prompt = "interactive UI elements"
inputs = self.processor(
text=prompt, images=image, return_tensors="pt"
).to(self.device, torch.float16)
with torch.no_grad():
generated_ids = self.model.generate(
**inputs, max_new_tokens=1024,
num_beams=3
)
result = self.processor.batch_decode(
generated_ids, skip_special_tokens=False
)[0]
# Parse structured output into element list
parsed = self.processor.post_process_generation(
result, task="",
image_size=(W, H)
)
return parsed
def find_clickable(self, screenshot_path, target_description):
"""Find the element matching a description and return click coords."""
image = Image.open(screenshot_path).convert("RGB")
W, H = image.size
prompt = f"{target_description}"
inputs = self.processor(
text=prompt, images=image, return_tensors="pt"
).to(self.device, torch.float16)
with torch.no_grad():
generated_ids = self.model.generate(
**inputs, max_new_tokens=256
)
result = self.processor.batch_decode(
generated_ids, skip_special_tokens=False
)[0]
parsed = self.processor.post_process_generation(
result, task="",
image_size=(W, H)
)
# Return center of first matching bounding box
if parsed and 'bboxes' in parsed and len(parsed['bboxes']) > 0:
bbox = parsed['bboxes'][0]
center_x = (bbox[0] + bbox[2]) / 2
center_y = (bbox[1] + bbox[3]) / 2
return (center_x / W, center_y / H) # Normalized coords
return None
def visualize(self, screenshot_path, output_path):
"""Draw detected elements on screenshot."""
image = Image.open(screenshot_path).convert("RGB")
elements = self.detect_elements(screenshot_path)
draw = ImageDraw.Draw(image)
colors = ["#ef4444", "#f59e0b", "#22c55e", "#3b82f6",
"#8b5cf6", "#ec4899"]
if elements and 'bboxes' in elements:
for i, (bbox, label) in enumerate(
zip(elements['bboxes'], elements.get('labels', []))
):
color = colors[i % len(colors)]
draw.rectangle(bbox, outline=color, width=2)
draw.text((bbox[0], bbox[1] - 12), label,
fill=color)
image.save(output_path)
return output_path
# Usage
detector = GUIElementDetector()
# Detect all UI elements
elements = detector.detect_elements("app_screenshot.png")
print(f"Found {len(elements.get('bboxes', []))} UI elements")
# Find specific element for clicking
click_pos = detector.find_clickable(
"app_screenshot.png", "Submit button"
)
if click_pos:
print(f"Click at normalized coords: ({click_pos[0]:.3f}, {click_pos[1]:.3f})")
Hallucination Detection
Implementing POPE-style probing and CHAIR measurement:
import json
import re
from collections import defaultdict
class HallucinationDetector:
"""Detect and measure VLM hallucinations using POPE and CHAIR metrics."""
# COCO object categories (80 classes)
COCO_OBJECTS = [
"person", "bicycle", "car", "motorcycle", "airplane", "bus", "train",
"truck", "boat", "traffic light", "fire hydrant", "stop sign",
"parking meter", "bench", "bird", "cat", "dog", "horse", "sheep",
"cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella",
"handbag", "tie", "suitcase", "frisbee", "skis", "snowboard",
"sports ball", "kite", "baseball bat", "baseball glove", "skateboard",
"surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork",
"knife", "spoon", "bowl", "banana", "apple", "sandwich", "orange",
"broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair",
"couch", "potted plant", "bed", "dining table", "toilet", "tv",
"laptop", "mouse", "remote", "keyboard", "cell phone", "microwave",
"oven", "toaster", "sink", "refrigerator", "book", "clock", "vase",
"scissors", "teddy bear", "hair drier", "toothbrush"
]
# Common co-occurrence pairs for adversarial POPE
COOCCURRENCE = {
"cat": ["dog", "kitten", "mouse"],
"car": ["truck", "bus", "motorcycle"],
"person": ["dog", "bicycle", "backpack"],
"chair": ["couch", "dining table", "desk"],
"cup": ["bottle", "wine glass", "bowl"],
}
def pope_evaluate(self, vlm_fn, image_path, gt_objects, mode="random"):
"""
POPE evaluation: probe model with yes/no questions.
vlm_fn: callable(image_path, question) -> str
gt_objects: set of objects actually in the image
mode: 'random', 'popular', or 'adversarial'
"""
gt_set = set(gt_objects)
absent = [o for o in self.COCO_OBJECTS if o not in gt_set]
# Select negative samples based on mode
if mode == "random":
import random
negatives = random.sample(absent, min(len(gt_set), len(absent)))
elif mode == "popular":
# Choose objects that frequently co-occur with present objects
negatives = []
for obj in gt_set:
for co in self.COOCCURRENCE.get(obj, []):
if co not in gt_set and co in absent:
negatives.append(co)
negatives = negatives[:len(gt_set)]
elif mode == "adversarial":
# Choose visually or semantically similar absent objects
negatives = []
for obj in gt_set:
for co in self.COOCCURRENCE.get(obj, []):
if co not in gt_set:
negatives.append(co)
negatives = negatives[:len(gt_set)]
results = {"tp": 0, "fp": 0, "tn": 0, "fn": 0}
# Positive probes (objects that ARE present)
for obj in gt_set:
response = vlm_fn(image_path, f"Is there a {obj} in the image?")
if "yes" in response.lower():
results["tp"] += 1
else:
results["fn"] += 1
# Negative probes (objects that are NOT present)
for obj in negatives:
response = vlm_fn(image_path, f"Is there a {obj} in the image?")
if "yes" in response.lower():
results["fp"] += 1 # Hallucination!
else:
results["tn"] += 1
total = sum(results.values())
accuracy = (results["tp"] + results["tn"]) / max(total, 1)
precision = results["tp"] / max(results["tp"] + results["fp"], 1)
hallucination_rate = results["fp"] / max(results["fp"] + results["tn"], 1)
return {
"accuracy": accuracy,
"precision": precision,
"hallucination_rate": hallucination_rate,
"details": results,
"mode": mode
}
def chair_evaluate(self, caption, gt_objects):
"""
CHAIR metric: measure object hallucination in free-form captions.
Returns CHAIR_s (sentence-level) and CHAIR_i (instance-level).
"""
# Extract mentioned objects from caption
caption_lower = caption.lower()
mentioned = set()
for obj in self.COCO_OBJECTS:
# Check for object mention (word boundary aware)
pattern = r'\b' + re.escape(obj) + r'(?:s|es)?\b'
if re.search(pattern, caption_lower):
mentioned.add(obj)
gt_set = set(gt_objects)
hallucinated = mentioned - gt_set
chair_s = len(hallucinated) / max(len(mentioned), 1)
chair_i = 1 if len(hallucinated) > 0 else 0
return {
"chair_s": chair_s,
"chair_i": chair_i,
"mentioned_objects": mentioned,
"hallucinated_objects": hallucinated,
"ground_truth_objects": gt_set
}
# Usage
detector = HallucinationDetector()
# CHAIR evaluation on a caption
result = detector.chair_evaluate(
caption="A person sitting on a bench in a park with their dog "
"and a red frisbee on the grass.",
gt_objects=["person", "bench", "frisbee"]
# "dog" and "grass" are not in GT -> hallucinated (if in COCO vocab)
)
print(f"CHAIR_s: {result['chair_s']:.2f}")
print(f"CHAIR_i: {result['chair_i']}")
print(f"Hallucinated: {result['hallucinated_objects']}")
References
Seminal papers and key works referenced in this article.
- Lin et al. "Video-LLaVA: Learning United Visual Representation by Alignment Before Projection." EMNLP, 2024. arXiv
- Hong et al. "CogAgent: A Visual Language Model for GUI Agents." CVPR, 2024. arXiv
- Li et al. "LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine." NeurIPS, 2023. arXiv
- Team Gemini. "Gemini: A Family of Highly Capable Multimodal Models." 2023. arXiv
- OpenAI. "GPT-4 Technical Report." 2023. arXiv