Qwen2-VL — Veanors

Chapter 0: The Problem

Imagine you have a beautiful 4K photograph of a document, and you need an AI model to read the tiny text. What happens? The model resizes it down to 224x224 pixels. All that detail is gone. The fine print is now a blur of pixels.

Or consider a panoramic landscape photo, wide and narrow. The model squishes it into a square, distorting every proportion. Trees become stumpy. Horizons warp.

This is the fundamental problem with most vision-language models circa 2023-2024. They encode every image at a fixed resolution -- typically 224x224 or 336x336 pixels -- regardless of the original image's size or shape. This approach has three serious consequences:

Information loss on high-resolution images. A 2000x1500 document image downsampled to 224x224 loses over 99% of its pixels. Fine text, small objects, and subtle details simply vanish.
Aspect ratio distortion. A 16:9 video frame crammed into a 1:1 square stretches or crops content, changing spatial relationships the model should understand.
Wasted computation on low-resolution images. A tiny 100x100 thumbnail gets upsampled to 224x224, padding with meaningless pixels and burning tokens the LLM has to process.

The core tension: Vision Transformers use fixed position embeddings that expect a specific grid size -- typically 16x16 patches of a 224x224 image = 196 patches. If you change the image size, the position embeddings don't know where anything is. Previous models like Qwen-VL, LLaVA, and InternVL all hit this wall, either by forcing a fixed resolution or by using clumsy slice-and-dice workarounds.

Why do most VLMs circa 2023 process all images at a fixed resolution like 224x224?

Because the Vision Transformer's absolute position embeddings expect a fixed grid size -- changing the resolution breaks the position encoding Because higher resolution images are too large to fit in GPU memory Because all images on the internet are 224x224

Chapter 1: The Key Insight

Qwen2-VL's central idea is deceptively simple: don't resize the image. Process it at its native resolution.

This sounds obvious -- of course we should preserve the original pixels. But it requires solving two hard problems that previous models dodged by fixing the resolution:

Variable-length token sequences. A 224x224 image produces 196 patches. A 1344x896 image produces 5,376 patches. The model must handle both without breaking. Qwen2-VL calls this Naive Dynamic Resolution.
Position encoding for arbitrary grids. Standard ViTs use learned absolute position embeddings -- one vector per patch position. These are fixed at training time. Qwen2-VL replaces them with 2D Rotary Position Embeddings (2D-RoPE), which generalize to any height and width.

On top of these two innovations, Qwen2-VL adds a third design choice: treating images and videos with a unified architecture. An image is just a video with one frame (actually two identical frames, for consistency with the 3D convolution). A video is a sequence of frames, each encoded the same way.

The insight in one sentence: Replace fixed position embeddings with 2D-RoPE so the ViT can handle any resolution, then let each image produce as many (or as few) visual tokens as its resolution demands. No resizing, no padding, no cropping -- just feed the pixels as they are.

The result: Qwen2-VL-72B matches or exceeds GPT-4o and Claude 3.5 Sonnet on benchmarks spanning document understanding, real-world QA, mathematical reasoning, and video comprehension -- all with a single, unified architecture.

What are the two key technical innovations that enable Qwen2-VL to process images at arbitrary resolutions?

Naive Dynamic Resolution (variable-length token sequences) and 2D-RoPE (position encoding for any grid size) Higher GPU memory and bigger batch sizes A larger ViT encoder and more training data

Chapter 2: Naive Dynamic Resolution

The name "Naive Dynamic Resolution" is the paper's own term, and they chose "naive" deliberately. There's nothing clever about it -- which is the point. The approach is: take the image at whatever resolution it comes in, divide it into patches, and produce one token per patch. No resizing, no special tiling schemes, no padding.

Here's how it works step by step:

Input: An image arrives at its original resolution, say 896x672 pixels.
Patch extraction: The ViT uses a patch size of 14x14 pixels. So this image becomes a 64x48 grid of patches = 3,072 patch tokens.
2D-RoPE encoding: Each patch gets a 2D position embedding based on its (row, column) position in the grid. No fixed grid required.
ViT processing: The 675M parameter Vision Transformer processes all patches through self-attention.
Token compression: An MLP merges every 2x2 block of adjacent tokens into a single token, reducing the count by 4x. Our 3,072 patches become 768 visual tokens.
Delimiting: Special tokens <|vision_start|> and <|vision_end|> wrap the compressed tokens before they enter the LLM.

So a 224x224 image produces (224/14)^2 / 4 = 64 tokens, while a 1344x896 image produces (96x64)/4 = 1,536 tokens. The LLM sees proportionally more detail for higher-resolution inputs.

Packing for efficiency: During inference, images of different resolutions are packed into a single sequence (like packing variable-length sentences in NLP). The total packed length is capped to control GPU memory. This means you can process a batch of mixed-resolution images without padding waste.

Why "Naive" works

Other models tried more complex approaches: slicing images into fixed-size tiles (LLaVA-1.6), using multi-scale feature pyramids, or applying adaptive pooling. These work, but they add complexity and often lose spatial relationships between tiles. Qwen2-VL shows that the simple approach -- just process the whole image as one grid with 2D-RoPE -- is both simpler and more effective.

How many visual tokens does a 672x448 image produce in Qwen2-VL (patch_size=14, 2x2 compression)?

(672/14) x (448/14) / 4 = 48 x 32 / 4 = 384 tokens 224 tokens (always fixed) 672 x 448 = 301,056 tokens

Chapter 3: 2D-RoPE

Standard Rotary Position Embeddings (RoPE) encode position as a rotation in the complex plane. For a 1D sequence of text tokens at position p, RoPE rotates each pair of embedding dimensions by an angle proportional to p:

RoPE(x, p) = x · e^i·p·θ

Where θ varies across dimensions, creating a multi-frequency encoding. The beauty of RoPE is that the dot product between two rotated vectors depends only on their relative position difference, which is exactly what attention needs.

But images are 2D. A patch at row 3, column 7 is not the same as a patch at row 7, column 3. Standard 1D-RoPE would flatten the 2D grid into a 1D sequence, losing the distinction between horizontal and vertical neighbors.

Extending RoPE to 2D

Qwen2-VL's solution: split the embedding dimensions into two halves and apply separate RoPE rotations for height and width:

2D-RoPE(x, h, w) = [x_{first half} · e^i·h·θ , x_{second half} · e^i·w·θ]

The first half of dimensions encodes vertical position h, the second half encodes horizontal position w. Now the attention pattern naturally captures both horizontal and vertical spatial relationships.

Why this matters for dynamic resolution

Learned absolute position embeddings are vectors stored in a lookup table: position 0 gets vector e₀, position 1 gets e₁, and so on. If you trained with a 16x16 grid (256 positions), you have exactly 256 vectors. A 32x48 grid needs 1,536 positions -- there are no learned vectors for those.

2D-RoPE has no lookup table. It computes positions on the fly from the (h, w) coordinates using sinusoidal rotations. Any height, any width, any aspect ratio -- the position encoding is always well-defined. This is what makes Naive Dynamic Resolution possible.

The key property: RoPE encodes relative positions, not absolute ones. A patch 3 rows above and 2 columns to the left always gets the same relative encoding, regardless of the total image size. This means the ViT can generalize to resolutions never seen during training.

Why can't learned absolute position embeddings handle variable-resolution images?

They are stored as a fixed-size lookup table -- positions beyond the training grid have no learned vectors They require too much GPU memory They are only defined for 1D sequences

Chapter 4: Unified Image-Video

Most VLMs treat images and videos as fundamentally different modalities. Images go through a 2D ViT; videos through a separate 3D video encoder or a frame-by-frame pipeline with temporal aggregation bolted on. Qwen2-VL uses a single pathway for both.

The trick: 3D convolutions with depth 2

Instead of processing 2D patches, the ViT's patch embedding layer uses 3D convolutions that consume 2 frames at a time. Each "patch" is actually a 3D tube: 14 pixels wide x 14 pixels tall x 2 frames deep.

For videos sampled at 2 fps, this means every pair of consecutive frames is merged into one set of spatial tokens. A 10-second video (20 frames) produces only 10 temporal positions worth of tokens, not 20.

For images, the solution is elegant: treat each image as two identical frames. The 3D convolution processes the duplicated pair just like a video pair, producing one set of spatial tokens. The model doesn't need to distinguish "image mode" from "video mode" -- everything is tubes.

Multimodal RoPE (M-RoPE)

To handle the combined text + image + video sequences in the LLM, Qwen2-VL extends the 2D-RoPE to three dimensions by decomposing the RoPE embedding into three components:

Temporal component: For text tokens, all three IDs are the same (collapsing to 1D-RoPE). For image tokens, the temporal ID stays constant (one "time step"). For video tokens, the temporal ID increments per frame.
Height component: Encodes the row position of each visual token within its frame.
Width component: Encodes the column position of each visual token within its frame.

When the model sees a sequence like [text, image, text, video, text], the position IDs for each modality start by incrementing the maximum position of the preceding modality by one. This ensures all tokens have unique, well-ordered positions.

Bonus: reduced position IDs for extrapolation. Because 2D/3D RoPE encodes spatial positions separately from temporal ones, a 1024x1024 image only needs position IDs up to 73 (1024/14) instead of 5,329 (73x73 flattened). Smaller position IDs mean easier extrapolation to longer sequences at inference time.

How does Qwen2-VL process a single image through its video-native ViT?

It duplicates the image into two identical frames and processes them as a 3D tube through the 3D convolution layer It switches to a separate 2D processing mode It pads the temporal dimension with zeros

Chapter 5: Architecture

Qwen2-VL follows the standard VLM pipeline: vision encoder → cross-modal connector → LLM. But each component has specific design choices that make the system work together.

Vision Transformer (675M parameters)

A standard ViT, but with two critical modifications:

Absolute position embeddings removed -- replaced with 2D-RoPE
3D convolution patch embedding -- patch_size = 14x14x2 (height x width x temporal)

Crucially, the same 675M parameter ViT is used across all model sizes (2B, 7B, 72B). This keeps the vision computation constant -- only the LLM scales.

Vision-Language Merger (cross-attention MLP)

After the ViT, a simple MLP layer compresses every 2x2 block of adjacent visual tokens into a single token. This 4x reduction is essential: without it, a high-resolution image would flood the LLM with thousands of tokens. After compression, special delimiter tokens <|vision_start|> and <|vision_end|> are added.

Language Model (Qwen2 series)

Three sizes, all initialized from the pretrained Qwen2 LLM:

Qwen2-VL-2B: 1.5B LLM -- designed for on-device use
Qwen2-VL-7B: 7.6B LLM -- balanced performance/cost
Qwen2-VL-72B: 72B LLM -- maximum capability

Token budget example: A 224x224 image has (224/14)^2 = 256 ViT patches. After 2x2 compression: 64 tokens. With the start/end delimiters: 66 tokens total entering the LLM. A 1344x896 image: (96x64)/4 = 1,536 tokens + 2 = 1,538 tokens. The model dynamically allocates its token budget based on the image's information content.

Why does Qwen2-VL use the same 675M parameter ViT across all model sizes (2B, 7B, 72B)?

To keep the visual computation constant regardless of LLM scale -- only the language model grows, while the vision encoder stays fixed Because 675M is the minimum size for a ViT To save disk space when distributing the models

Chapter 6: Training

Qwen2-VL follows a three-stage training pipeline, progressively unlocking more parameters and adding more diverse data at each stage.

Stage 1: Pre-training the ViT (600B tokens)

Only the Vision Transformer is trained. The LLM is frozen (initialized from pretrained Qwen2). The ViT is initialized from DFN (Data Filtering Networks), but with the original absolute position embeddings replaced by 2D-RoPE.

Training data: image-text pairs, OCR data, image classification tasks. The goal is to align the ViT's visual representations with the LLM's text space -- teaching the vision encoder to produce features the language model can understand.

Stage 2: Multi-task pre-training (800B tokens)

All parameters are unfrozen -- ViT, merger, and LLM all train together. The data mix expands dramatically:

Mixed image-text content (interleaved articles)
Visual question answering datasets
Multi-task datasets
Pure text data (to maintain language capability)

Total across both pre-training stages: 1.4 trillion tokens (both text and image tokens). Supervision is only on text tokens -- image tokens are inputs, not prediction targets.

Stage 3: Instruction fine-tuning

The ViT is frozen again. Only the LLM is fine-tuned on instruction-following data in ChatML format. The data includes:

Image QA and document parsing
Multi-image comparison
Video comprehension and dialogue
Agent-based interactions (GUI control, tool use)
Pure text dialogue (to maintain chat ability)

The freeze-unfreeze-freeze pattern: Stage 1 freezes the LLM to train alignment. Stage 2 unfreezes everything for deep co-adaptation. Stage 3 freezes the ViT to fine-tune behavior without corrupting learned visual representations. This progression is key -- unfreezing everything too early or too late both degrade performance.

In the three-stage training pipeline, which components are trained in Stage 2?

All parameters: ViT, merger, and LLM are all unfrozen and trained jointly Only the LLM is trained Only the ViT is trained

Chapter 7: Results

Qwen2-VL-72B achieves results competitive with GPT-4o and Claude 3.5 Sonnet across a broad range of benchmarks, while the 7B model significantly outperforms most open-source alternatives.

Headline numbers (Qwen2-VL-72B)

DocVQA: 96.5 (previous SOTA: 94.1) -- reading documents at any resolution pays off
OCRBench: 877 (previous SOTA: 852) -- fine-grained text recognition
MTVQA: 30.9 (GPT-4o: 27.8) -- multilingual visual text understanding
RealWorldQA: 77.8 (previous SOTA: 72.2) -- real-world spatial reasoning
MathVista: 70.5 (previous SOTA: 69.0) -- mathematical visual reasoning
MMBench-EN: 86.5 -- matching previous SOTA exactly

Where dynamic resolution helps most

The biggest wins come on benchmarks that demand fine-grained visual detail: document understanding (DocVQA, InfoVQA), OCR tasks, and multilingual text recognition. These are exactly the tasks where fixed-resolution models lose information by downsampling.

The ablation study confirms this: removing dynamic resolution degrades DocVQA performance significantly, while tasks that don't require fine detail (general VQA) are less affected.

Scaling law finding: Performance scales log-linearly with both model size and data size. Going from 2B to 7B to 72B parameters shows consistent gains, and each doubling of training data yields a roughly constant improvement. This suggests the Qwen2-VL architecture hasn't yet hit a scaling ceiling.

On which category of benchmarks does Qwen2-VL show the largest improvements over fixed-resolution models?

Document understanding and OCR benchmarks -- tasks requiring fine-grained visual detail that fixed-resolution models lose by downsampling General knowledge QA benchmarks Audio understanding benchmarks

Chapter 8: Agent Capabilities

Beyond standard VQA, Qwen2-VL is designed as a visual agent -- a model that can look at a screen, understand the UI, and take actions to accomplish tasks. This is where dynamic resolution becomes essential in practice: real phone and desktop screenshots are high-resolution, with small text and tiny buttons that a 224x224 model could never read.

GUI grounding

The model can identify and localize UI elements on screen. Given a screenshot and an instruction like "Find the search bar," it outputs bounding box coordinates normalized to [0, 1000):

<|object_ref_start|>search bar<|object_ref_end|>
<|box_start|>(245,89),(756,134)<|box_end|>

Sequential decision making

Complex tasks are decomposed into multi-step action sequences. The model observes a screenshot, reasons about the next action, executes it (tap, scroll, type), receives a new screenshot, and repeats. This is formulated as a function-calling loop:

Observe the current screenshot
Reason about what action to take next
Output a function call (Tap, Scroll, Type, Home, etc.) with parameters
Receive the result (new screenshot)
Repeat until the task is complete

Why native resolution matters for agents: Mobile phone screenshots are typically 1080x2400 or higher. A "Settings" icon might be 48x48 pixels in a 1080-wide screen -- less than 5% of the width. At 224x224, that icon is about 10x10 pixels, completely unreadable. Qwen2-VL processes these screenshots at full resolution, making small UI elements legible and groundable.

The model supports diverse agent tasks: phone operation, web browsing, robotic control, game playing, and navigation. Each task defines a set of permissible actions, and Qwen2-VL chains them through reasoning.

Why is dynamic resolution particularly important for GUI agent tasks?

Real screenshots are high-resolution with small UI elements -- fixed 224x224 processing would make tiny buttons and text completely unreadable GUI tasks require faster inference speed The model needs to generate images of the GUI

Chapter 9: Connections

Qwen2-VL sits at a particular point in the evolution of vision-language models. Here's how it relates to the broader landscape:

Direct predecessor: Qwen-VL (2023)

The original Qwen-VL used a fixed 448x448 resolution with learned absolute position embeddings. Qwen2-VL's three main upgrades -- dynamic resolution, 2D-RoPE, and unified image-video processing -- directly address Qwen-VL's limitations.

Competing approaches to dynamic resolution

LLaVA-1.6 / LLaVA-NeXT: Splits high-res images into fixed tiles (e.g., 2x2 or 3x2 grids of 336x336 patches). Each tile is processed independently, then concatenated. This handles higher resolution but loses cross-tile spatial relationships and introduces grid artifacts.
InternVL 1.5/2.0: Similar tiling approach with a dynamic number of tiles. More flexible than LLaVA but still fundamentally tile-based.
Qwen2-VL: No tiling at all. The entire image is processed as one grid with 2D-RoPE. Simpler and preserves global spatial relationships.

Position encoding evolution

ViT (2020): Learned absolute position embeddings, fixed at 16x16 = 256 positions
SwinV2 (2022): Relative position bias in attention, some resolution flexibility
NaViT (2023, Google): Introduced the concept of "native resolution" ViTs with flexible packing -- a key inspiration for Qwen2-VL
Qwen2-VL (2024): 2D-RoPE for fully flexible position encoding

Closed-source competitors

GPT-4o (OpenAI): Details undisclosed, but likely uses some form of dynamic resolution. Qwen2-VL-72B matches or exceeds it on most benchmarks.
Claude 3.5 Sonnet (Anthropic): Also handles variable resolutions. Qwen2-VL-72B is competitive across the board.
Gemini 1.5 Pro (Google): Native multimodal with long-context video. Different architectural philosophy (early fusion vs. Qwen2-VL's encoder-decoder split).

The bigger picture: Qwen2-VL represents a trend in VLMs toward removing artificial constraints. Fixed resolution was a simplification that limited performance. As the field matures, models are becoming more "native" -- processing inputs in their natural form rather than forcing them into predetermined formats. The same philosophy drives native-resolution processing in NaViT, flexible context lengths in LLMs, and end-to-end multimodal models like Gemini.

How does Qwen2-VL's approach to high-resolution images differ from LLaVA-NeXT's tile-based approach?

Qwen2-VL processes the entire image as one continuous grid with 2D-RoPE, preserving global spatial relationships, while LLaVA-NeXT splits images into independent fixed-size tiles Qwen2-VL uses smaller patches than LLaVA-NeXT There is no difference -- both use tiling

Qwen2-VL: Perception at Any Resolution