Wang, Bai, Tan, Wang, Fan, Bai et al. (Alibaba Qwen Team) — 2024

Qwen2-VL: Perception at Any Resolution

A vision-language model that processes images at their native resolution and aspect ratio, using 2D rotary position embeddings and a unified image-video pipeline to achieve state-of-the-art multimodal understanding.

Prerequisites: Vision Transformers + Rotary Position Embeddings + Multimodal LLMs
10
Chapters
5+
Simulations

Chapter 0: The Problem

Imagine you have a beautiful 4K photograph of a document, and you need an AI model to read the tiny text. What happens? The model resizes it down to 224x224 pixels. All that detail is gone. The fine print is now a blur of pixels.

Or consider a panoramic landscape photo, wide and narrow. The model squishes it into a square, distorting every proportion. Trees become stumpy. Horizons warp.

This is the fundamental problem with most vision-language models circa 2023-2024. They encode every image at a fixed resolution -- typically 224x224 or 336x336 pixels -- regardless of the original image's size or shape. This approach has three serious consequences:

The core tension: Vision Transformers use fixed position embeddings that expect a specific grid size -- typically 16x16 patches of a 224x224 image = 196 patches. If you change the image size, the position embeddings don't know where anything is. Previous models like Qwen-VL, LLaVA, and InternVL all hit this wall, either by forcing a fixed resolution or by using clumsy slice-and-dice workarounds.
Why do most VLMs circa 2023 process all images at a fixed resolution like 224x224?

Chapter 1: The Key Insight

Qwen2-VL's central idea is deceptively simple: don't resize the image. Process it at its native resolution.

This sounds obvious -- of course we should preserve the original pixels. But it requires solving two hard problems that previous models dodged by fixing the resolution:

  1. Variable-length token sequences. A 224x224 image produces 196 patches. A 1344x896 image produces 5,376 patches. The model must handle both without breaking. Qwen2-VL calls this Naive Dynamic Resolution.
  2. Position encoding for arbitrary grids. Standard ViTs use learned absolute position embeddings -- one vector per patch position. These are fixed at training time. Qwen2-VL replaces them with 2D Rotary Position Embeddings (2D-RoPE), which generalize to any height and width.

On top of these two innovations, Qwen2-VL adds a third design choice: treating images and videos with a unified architecture. An image is just a video with one frame (actually two identical frames, for consistency with the 3D convolution). A video is a sequence of frames, each encoded the same way.

The insight in one sentence: Replace fixed position embeddings with 2D-RoPE so the ViT can handle any resolution, then let each image produce as many (or as few) visual tokens as its resolution demands. No resizing, no padding, no cropping -- just feed the pixels as they are.

The result: Qwen2-VL-72B matches or exceeds GPT-4o and Claude 3.5 Sonnet on benchmarks spanning document understanding, real-world QA, mathematical reasoning, and video comprehension -- all with a single, unified architecture.

What are the two key technical innovations that enable Qwen2-VL to process images at arbitrary resolutions?

Chapter 2: Naive Dynamic Resolution

The name "Naive Dynamic Resolution" is the paper's own term, and they chose "naive" deliberately. There's nothing clever about it -- which is the point. The approach is: take the image at whatever resolution it comes in, divide it into patches, and produce one token per patch. No resizing, no special tiling schemes, no padding.

Here's how it works step by step:

  1. Input: An image arrives at its original resolution, say 896x672 pixels.
  2. Patch extraction: The ViT uses a patch size of 14x14 pixels. So this image becomes a 64x48 grid of patches = 3,072 patch tokens.
  3. 2D-RoPE encoding: Each patch gets a 2D position embedding based on its (row, column) position in the grid. No fixed grid required.
  4. ViT processing: The 675M parameter Vision Transformer processes all patches through self-attention.
  5. Token compression: An MLP merges every 2x2 block of adjacent tokens into a single token, reducing the count by 4x. Our 3,072 patches become 768 visual tokens.
  6. Delimiting: Special tokens <|vision_start|> and <|vision_end|> wrap the compressed tokens before they enter the LLM.

So a 224x224 image produces (224/14)^2 / 4 = 64 tokens, while a 1344x896 image produces (96x64)/4 = 1,536 tokens. The LLM sees proportionally more detail for higher-resolution inputs.

Packing for efficiency: During inference, images of different resolutions are packed into a single sequence (like packing variable-length sentences in NLP). The total packed length is capped to control GPU memory. This means you can process a batch of mixed-resolution images without padding waste.

Why "Naive" works

Other models tried more complex approaches: slicing images into fixed-size tiles (LLaVA-1.6), using multi-scale feature pyramids, or applying adaptive pooling. These work, but they add complexity and often lose spatial relationships between tiles. Qwen2-VL shows that the simple approach -- just process the whole image as one grid with 2D-RoPE -- is both simpler and more effective.

How many visual tokens does a 672x448 image produce in Qwen2-VL (patch_size=14, 2x2 compression)?

Chapter 3: 2D-RoPE

Standard Rotary Position Embeddings (RoPE) encode position as a rotation in the complex plane. For a 1D sequence of text tokens at position p, RoPE rotates each pair of embedding dimensions by an angle proportional to p:

RoPE(x, p) = x · ei·p·θ

Where θ varies across dimensions, creating a multi-frequency encoding. The beauty of RoPE is that the dot product between two rotated vectors depends only on their relative position difference, which is exactly what attention needs.

But images are 2D. A patch at row 3, column 7 is not the same as a patch at row 7, column 3. Standard 1D-RoPE would flatten the 2D grid into a 1D sequence, losing the distinction between horizontal and vertical neighbors.

Extending RoPE to 2D

Qwen2-VL's solution: split the embedding dimensions into two halves and apply separate RoPE rotations for height and width:

2D-RoPE(x, h, w) = [xfirst half · ei·h·θ , xsecond half · ei·w·θ]

The first half of dimensions encodes vertical position h, the second half encodes horizontal position w. Now the attention pattern naturally captures both horizontal and vertical spatial relationships.

Why this matters for dynamic resolution

Learned absolute position embeddings are vectors stored in a lookup table: position 0 gets vector e0, position 1 gets e1, and so on. If you trained with a 16x16 grid (256 positions), you have exactly 256 vectors. A 32x48 grid needs 1,536 positions -- there are no learned vectors for those.

2D-RoPE has no lookup table. It computes positions on the fly from the (h, w) coordinates using sinusoidal rotations. Any height, any width, any aspect ratio -- the position encoding is always well-defined. This is what makes Naive Dynamic Resolution possible.

The key property: RoPE encodes relative positions, not absolute ones. A patch 3 rows above and 2 columns to the left always gets the same relative encoding, regardless of the total image size. This means the ViT can generalize to resolutions never seen during training.
Why can't learned absolute position embeddings handle variable-resolution images?

Chapter 4: Unified Image-Video

Most VLMs treat images and videos as fundamentally different modalities. Images go through a 2D ViT; videos through a separate 3D video encoder or a frame-by-frame pipeline with temporal aggregation bolted on. Qwen2-VL uses a single pathway for both.

The trick: 3D convolutions with depth 2

Instead of processing 2D patches, the ViT's patch embedding layer uses 3D convolutions that consume 2 frames at a time. Each "patch" is actually a 3D tube: 14 pixels wide x 14 pixels tall x 2 frames deep.

For videos sampled at 2 fps, this means every pair of consecutive frames is merged into one set of spatial tokens. A 10-second video (20 frames) produces only 10 temporal positions worth of tokens, not 20.

For images, the solution is elegant: treat each image as two identical frames. The 3D convolution processes the duplicated pair just like a video pair, producing one set of spatial tokens. The model doesn't need to distinguish "image mode" from "video mode" -- everything is tubes.

Multimodal RoPE (M-RoPE)

To handle the combined text + image + video sequences in the LLM, Qwen2-VL extends the 2D-RoPE to three dimensions by decomposing the RoPE embedding into three components:

When the model sees a sequence like [text, image, text, video, text], the position IDs for each modality start by incrementing the maximum position of the preceding modality by one. This ensures all tokens have unique, well-ordered positions.

Bonus: reduced position IDs for extrapolation. Because 2D/3D RoPE encodes spatial positions separately from temporal ones, a 1024x1024 image only needs position IDs up to 73 (1024/14) instead of 5,329 (73x73 flattened). Smaller position IDs mean easier extrapolation to longer sequences at inference time.
How does Qwen2-VL process a single image through its video-native ViT?

Chapter 5: Architecture

Qwen2-VL follows the standard VLM pipeline: vision encoder → cross-modal connector → LLM. But each component has specific design choices that make the system work together.

Vision Transformer (675M parameters)

A standard ViT, but with two critical modifications:

Crucially, the same 675M parameter ViT is used across all model sizes (2B, 7B, 72B). This keeps the vision computation constant -- only the LLM scales.

Vision-Language Merger (cross-attention MLP)

After the ViT, a simple MLP layer compresses every 2x2 block of adjacent visual tokens into a single token. This 4x reduction is essential: without it, a high-resolution image would flood the LLM with thousands of tokens. After compression, special delimiter tokens <|vision_start|> and <|vision_end|> are added.

Language Model (Qwen2 series)

Three sizes, all initialized from the pretrained Qwen2 LLM:

Token budget example: A 224x224 image has (224/14)^2 = 256 ViT patches. After 2x2 compression: 64 tokens. With the start/end delimiters: 66 tokens total entering the LLM. A 1344x896 image: (96x64)/4 = 1,536 tokens + 2 = 1,538 tokens. The model dynamically allocates its token budget based on the image's information content.
Why does Qwen2-VL use the same 675M parameter ViT across all model sizes (2B, 7B, 72B)?

Chapter 6: Training

Qwen2-VL follows a three-stage training pipeline, progressively unlocking more parameters and adding more diverse data at each stage.

Stage 1: Pre-training the ViT (600B tokens)

Only the Vision Transformer is trained. The LLM is frozen (initialized from pretrained Qwen2). The ViT is initialized from DFN (Data Filtering Networks), but with the original absolute position embeddings replaced by 2D-RoPE.

Training data: image-text pairs, OCR data, image classification tasks. The goal is to align the ViT's visual representations with the LLM's text space -- teaching the vision encoder to produce features the language model can understand.

Stage 2: Multi-task pre-training (800B tokens)

All parameters are unfrozen -- ViT, merger, and LLM all train together. The data mix expands dramatically:

Total across both pre-training stages: 1.4 trillion tokens (both text and image tokens). Supervision is only on text tokens -- image tokens are inputs, not prediction targets.

Stage 3: Instruction fine-tuning

The ViT is frozen again. Only the LLM is fine-tuned on instruction-following data in ChatML format. The data includes:

The freeze-unfreeze-freeze pattern: Stage 1 freezes the LLM to train alignment. Stage 2 unfreezes everything for deep co-adaptation. Stage 3 freezes the ViT to fine-tune behavior without corrupting learned visual representations. This progression is key -- unfreezing everything too early or too late both degrade performance.
In the three-stage training pipeline, which components are trained in Stage 2?

Chapter 7: Results

Qwen2-VL-72B achieves results competitive with GPT-4o and Claude 3.5 Sonnet across a broad range of benchmarks, while the 7B model significantly outperforms most open-source alternatives.

Headline numbers (Qwen2-VL-72B)

Where dynamic resolution helps most

The biggest wins come on benchmarks that demand fine-grained visual detail: document understanding (DocVQA, InfoVQA), OCR tasks, and multilingual text recognition. These are exactly the tasks where fixed-resolution models lose information by downsampling.

The ablation study confirms this: removing dynamic resolution degrades DocVQA performance significantly, while tasks that don't require fine detail (general VQA) are less affected.

Scaling law finding: Performance scales log-linearly with both model size and data size. Going from 2B to 7B to 72B parameters shows consistent gains, and each doubling of training data yields a roughly constant improvement. This suggests the Qwen2-VL architecture hasn't yet hit a scaling ceiling.
On which category of benchmarks does Qwen2-VL show the largest improvements over fixed-resolution models?

Chapter 8: Agent Capabilities

Beyond standard VQA, Qwen2-VL is designed as a visual agent -- a model that can look at a screen, understand the UI, and take actions to accomplish tasks. This is where dynamic resolution becomes essential in practice: real phone and desktop screenshots are high-resolution, with small text and tiny buttons that a 224x224 model could never read.

GUI grounding

The model can identify and localize UI elements on screen. Given a screenshot and an instruction like "Find the search bar," it outputs bounding box coordinates normalized to [0, 1000):

<|object_ref_start|>search bar<|object_ref_end|>
<|box_start|>(245,89),(756,134)<|box_end|>

Sequential decision making

Complex tasks are decomposed into multi-step action sequences. The model observes a screenshot, reasons about the next action, executes it (tap, scroll, type), receives a new screenshot, and repeats. This is formulated as a function-calling loop:

  1. Observe the current screenshot
  2. Reason about what action to take next
  3. Output a function call (Tap, Scroll, Type, Home, etc.) with parameters
  4. Receive the result (new screenshot)
  5. Repeat until the task is complete
Why native resolution matters for agents: Mobile phone screenshots are typically 1080x2400 or higher. A "Settings" icon might be 48x48 pixels in a 1080-wide screen -- less than 5% of the width. At 224x224, that icon is about 10x10 pixels, completely unreadable. Qwen2-VL processes these screenshots at full resolution, making small UI elements legible and groundable.

The model supports diverse agent tasks: phone operation, web browsing, robotic control, game playing, and navigation. Each task defines a set of permissible actions, and Qwen2-VL chains them through reasoning.

Why is dynamic resolution particularly important for GUI agent tasks?

Chapter 9: Connections

Qwen2-VL sits at a particular point in the evolution of vision-language models. Here's how it relates to the broader landscape:

Direct predecessor: Qwen-VL (2023)

The original Qwen-VL used a fixed 448x448 resolution with learned absolute position embeddings. Qwen2-VL's three main upgrades -- dynamic resolution, 2D-RoPE, and unified image-video processing -- directly address Qwen-VL's limitations.

Competing approaches to dynamic resolution

Position encoding evolution

Closed-source competitors

The bigger picture: Qwen2-VL represents a trend in VLMs toward removing artificial constraints. Fixed resolution was a simplification that limited performance. As the field matures, models are becoming more "native" -- processing inputs in their natural form rather than forcing them into predetermined formats. The same philosophy drives native-resolution processing in NaViT, flexible context lengths in LLMs, and end-to-end multimodal models like Gemini.
How does Qwen2-VL's approach to high-resolution images differ from LLaVA-NeXT's tile-based approach?