Introduction

Consider what happens when you look at a cluttered kitchen counter and someone says "pass me the red mug near the toaster." You do several things simultaneously: you parse the referring expression, identify candidate objects, evaluate spatial relationships ("near the toaster"), disambiguate among red objects, and finally fixate on the correct region of your visual field. This entire process — visual grounding — takes you about 200 milliseconds.

For most of the history of vision-language models, this capability was entirely absent. CLIP (Radford et al., 2021) can tell you whether an image matches a caption, but it cannot tell you where the described object is. GPT-4V can write eloquent descriptions, but until recently could not output a bounding box. The VLM understood the scene globally but was blind locally — it saw the forest, not the trees.

This article covers the technical machinery that gives VLMs spatial competence. We start with the formal grounding task and its benchmarks. We derive the three main approaches to bounding box prediction: discrete tokenization, regression heads, and natural-language coordinates. We examine region-level feature extraction (RoI Align, Grounding DINO, SAM). We analyze why spatial relationship reasoning remains so difficult. And we study the architectures that are beginning to crack these problems: Kosmos-2, Shikra, and Ferret.

ℹ What this article covers

Visual grounding and referring expression comprehension. Three approaches to bounding box prediction in VLMs. Region-level feature extraction with RoI Align, Grounding DINO, and SAM. Spatial relationship understanding and why it is hard. The Kosmos-2 hyperlink format for grounded language. Shikra and Ferret for referential dialogue. Grounded conversation generation. The persistent counting problem. Complete code examples for each technique.

Visual Grounding

Visual grounding is the task of localizing an image region given a natural language description. This is distinct from object detection, which localizes instances of predefined categories ("dog", "car"). Grounding is open-vocabulary and relational: "the man in the blue shirt standing to the left of the woman" requires parsing attributes, spatial relations, and resolving reference.

Referring Expression Comprehension (REC)

The canonical formulation is Referring Expression Comprehension: given an image I and a referring expression e (a natural language phrase), predict the bounding box [x1, y1, x2, y2] that tightly encloses the referred object. The input is the (image, text) pair; the output is four numbers.

The inverse task is Referring Expression Generation (REG): given an image and a bounding box, produce a natural language expression that uniquely identifies the boxed region. REG is harder to evaluate because there are many valid descriptions for any given region, but it is equally important for grounded dialogue systems.

A key subtlety: REC is not just about recognizing objects. It requires disambiguation. If the expression is "the dog," there might be three dogs in the scene. The expression implicitly or explicitly provides enough context to select one. This means the model must understand:

  • Attributes: "the large dog," "the brown dog"
  • Spatial relations: "the dog on the left," "the dog next to the cat"
  • Actions: "the dog catching the frisbee"
  • Ordinal reference: "the second dog from the right"

RefCOCO Benchmarks

The standard benchmarks for REC are the RefCOCO family, all built on top of MS COCO images:

Benchmark Expressions Characteristics Key Challenge
RefCOCO (Yu et al., 2016) 142,209 Short, often 2–4 words. Collected in a timed game (ReferIt). Spatial relations ("left one," "right side")
RefCOCO+ 141,564 Spatial words forbidden during collection. Appearance-only disambiguation
RefCOCOg (Mao et al., 2016) 95,010 Longer expressions, 8+ words average. Written, not timed. Complex multi-attribute descriptions

The evaluation metric is Accuracy@0.5: the predicted box is correct if its IoU (Intersection over Union) with the ground truth box exceeds 0.5. This is strict — the model must get both the location and the size roughly right. Some recent work also reports Accuracy@0.75 and mIoU (mean IoU).

⚠ RefCOCO is saturating

State-of-the-art models now exceed 90% accuracy on RefCOCO val. The benchmark's weakness is that expressions are short and mostly distinguish objects by a single attribute or spatial cue. Real-world grounding requires handling long, compositional, and potentially ambiguous descriptions. Newer benchmarks like GRIT (Gupta et al., 2022) and Flickr30k Entities are gaining traction for evaluating more realistic grounding.

Visual Grounding — Text to Bounding Box Interactive

Select a referring expression to see the model ground it to a bounding box in the scene. The scene contains multiple objects; the model must disambiguate using attributes and spatial cues.

Select an expression to ground

Bounding Box Prediction

The core technical question: how does a VLM output four numbers (a bounding box)? Language models produce tokens. Vision models produce feature maps. Neither naturally produces coordinates. Three fundamentally different approaches have emerged, each with distinct tradeoffs.

Approach (a): Discrete token prediction

The most elegant approach treats coordinates as tokens in a vocabulary. Take the continuous bounding box [x1, y1, x2, y2], normalize each coordinate to [0, 1] (divide by image width or height), then quantize to integers in [0, N−1] where N is the number of spatial bins (typically N = 1000).

This transforms a coordinate regression problem into a classification problem. Each coordinate becomes a token from a vocabulary of 1000 location tokens, which are appended to the language model's existing vocabulary. A bounding box becomes a sequence of four tokens: <loc_342> <loc_156> <loc_587> <loc_423>.

Pix2Seq (Chen et al., 2022) pioneered this approach for object detection, casting it entirely as sequence generation. The model takes an image and autoregressively generates a sequence of bounding boxes as tokens: [class_token, y1, x1, y2, x2] for each detected object. The training objective is standard cross-entropy on the token sequence.

Kosmos-2 (Peng et al., 2023) extended this to grounded VLMs. Bounding boxes are embedded directly in the text output using special tokens: <box><loc_x1><loc_y1><loc_x2><loc_y2></box>. The model can interleave natural language with spatial references seamlessly.

The key insight: quantization to N = 1000 bins gives sub-pixel precision for typical image sizes. A 1000×1000 image has 1-pixel resolution; a 224×224 image has ~0.224-pixel resolution per bin. The quantization error is negligible compared to annotation noise in the training data.

ℹ Why 1000 bins?

The choice of N = 1000 is a practical sweet spot. Fewer bins (e.g., 100) cause visible quantization artifacts — boxes snap to a coarse grid. More bins (e.g., 10000) bloat the vocabulary without improving accuracy, since annotation noise in datasets like COCO is already ~2–5 pixels. At 1000 bins, the added vocabulary is manageable (1000 special tokens vs. 32K+ language tokens), the quantization error is below annotation noise, and the model can leverage standard categorical cross-entropy training — no need for regression losses.

Approach (b): Regression heads

The classical approach: add an MLP head on top of the model's hidden representation that directly regresses four continuous values. This is how Faster R-CNN (Ren et al., 2015) works, and it carries over to VLMs that bolt a detection head onto a frozen language model.

Concretely, given the final hidden state h of the [CLS] token or a pooled representation, the regression head computes:

# Regression head for bounding box
bbox = MLP(h)  # h: (batch, dim) -> bbox: (batch, 4)
# bbox = [x1, y1, x2, y2] normalized to [0, 1]

The training loss is typically smooth-L1 (Huber loss) or L1 loss on the predicted coordinates, often combined with a GIoU (Generalized Intersection over Union) loss:

L = Lsmooth-L1(, b) + λ · LGIoU(, b)

GIoU loss is critical because L1 loss alone does not capture the geometric quality of the prediction. Two predictions can have the same L1 error but vastly different overlap with the ground truth.

The disadvantage of regression heads: they add architectural complexity, cannot be trained with standard language modeling objectives, and break the unified "everything is a token" paradigm that modern VLMs are moving toward.

Approach (c): Natural language coordinates

The simplest approach: just output coordinates as plain text. The model generates a string like <box>342, 156, 587, 423</box> using standard digit tokens. No special vocabulary, no regression head — just language.

This works surprisingly well. Models like Qwen-VL and Shikra use this approach with coordinates normalized to the [0, 1000] integer range. The model simply needs to learn that digits in this context represent spatial locations rather than counts or dates.

The subtle cost: digits are multi-token in most tokenizers. The number "587" might be tokenized as ["5", "87"] or ["587"], depending on the tokenizer. This means the model must learn to compose multi-digit numbers correctly, which requires implicit arithmetic. In practice, models handle this well with sufficient training data, but it can cause occasional off-by-one or digit-transposition errors at inference time.

Approach A

Discrete Tokens

Special <loc_k> tokens in vocabulary. One token per coordinate. Exact quantization. Used by: Pix2Seq, Kosmos-2, OFA.

Approach B

Regression Head

MLP on pooled features. Continuous output. Requires extra loss. Used by: MDETR, OWL-ViT, Grounding DINO.

Approach C

Natural Language

Digit tokens in text. No special vocabulary. Multi-token numbers. Used by: Shikra, Qwen-VL, Ferret.

Coordinate Tokenization — Continuous Box to Discrete Tokens Interactive

Drag the bounding box corners to see how continuous coordinates are quantized to discrete tokens in the [0, 1000] range, then decoded back. The reconstruction error is shown.

Drag box corners — 1000 bins, sub-pixel precision

Region-Level Features

Grounding requires more than global image features — the model must extract representations for specific regions of the image. This is the domain of region-level feature extraction, a line of work stretching from R-CNN (Girshick et al., 2014) to modern open-vocabulary detectors.

RoI Pooling and RoI Align

Region of Interest (RoI) Pooling (Girshick, 2015) solves a fundamental problem: given a feature map of fixed spatial resolution and a bounding box of arbitrary size and position, extract a fixed-size feature vector for that box. The mechanism:

  1. Project the bounding box onto the feature map (divide coordinates by the spatial stride of the CNN backbone, typically 16 or 32).
  2. Divide the projected box into a fixed grid (e.g., 7×7).
  3. Max-pool within each grid cell to produce a fixed-size output (e.g., 7×7×C).

The problem: step 1 involves rounding fractional coordinates to integer positions on the feature map. This quantization causes misalignment between the box and the actual features. For small objects, being off by one feature map cell means being off by 16–32 pixels in the original image.

RoI Align (He et al., 2017, from Mask R-CNN) fixes this with bilinear interpolation. Instead of snapping to integer coordinates, it samples feature values at exact fractional positions using bilinear interpolation of the four nearest feature map cells. This eliminates quantization error entirely and improved mask prediction accuracy by ~10–15% relative.

Mathematically, for a sampling point at fractional position (x, y) on the feature map with values f:

f(x, y) = ∑ij fij · max(0, 1 − |xi|) · max(0, 1 − |yj|)

This is differentiable everywhere, so gradients flow cleanly through the spatial sampling operation. RoI Align remains the standard feature extraction mechanism in modern detection systems.

Grounding DINO

Grounding DINO (Liu et al., 2023) represents the current state of the art in open-set object detection and phrase grounding. It combines three powerful components:

  1. DINO detector backbone: The DETR-family detector (DEtection TRansformer) with deformable attention, contrastive denoising, and mixed query selection. DINO achieves state-of-the-art closed-set detection.
  2. Language encoder: A BERT-based text encoder that processes the input text (category names, referring expressions, or arbitrary phrases).
  3. Cross-modality fusion: A feature enhancer that performs bidirectional cross-attention between visual and language features at multiple scales, enabling the detector's queries to attend to the relevant language tokens.

The architecture's key innovation is the language-guided query selection: instead of using learnable static queries (as in vanilla DETR), Grounding DINO selects query positions based on the alignment between visual features and the input text. This means the detector actively searches for regions that match the text description.

Grounding DINO achieves 52.5 AP on the COCO detection benchmark in a zero-shot setting (trained on O365, GoldG, Cap4M, without seeing COCO training images) and 63.0 AP on Flickr30k Entities for phrase grounding. It is the backbone of many grounded VLM pipelines.

Segment Anything (SAM)

SAM (Kirillov et al., 2023) takes grounding one step further: instead of bounding boxes, it produces pixel-precise segmentation masks. SAM was trained on SA-1B, a dataset of over 1 billion masks on 11 million images, making it the largest segmentation dataset ever created.

SAM's architecture has three components: (1) a ViT-H image encoder that produces image embeddings, (2) a prompt encoder that accepts points, boxes, or text as prompts, and (3) a lightweight mask decoder that combines the image and prompt embeddings to predict masks.

The combination of Grounding DINO + SAM creates a powerful grounding pipeline: text → Grounding DINO → bounding boxes → SAM → pixel-precise masks. This pipeline, sometimes called "Grounded SAM," enables open-vocabulary segmentation from natural language descriptions — you describe what you want, and the system segments it at pixel precision.

ℹ Why not just predict masks directly?

Predicting masks from VLMs is fundamentally harder than predicting bounding boxes. A mask requires per-pixel binary classification — for a 224×224 image, that's 50,176 binary decisions. A bounding box is just 4 numbers. The two-stage pipeline (VLM → boxes → SAM → masks) decomposes the problem: the VLM handles semantic understanding and localization; SAM handles geometric precision. This division of labor works better than end-to-end mask generation in practice.

Spatial Relationship Understanding

Spatial reasoning — understanding the geometric relationships between objects in a scene — is one of the most persistent weaknesses of current VLMs. Humans effortlessly perceive that the coffee is on the table, the lamp is next to the bed, and the bird is above the tree. VLMs struggle with all of these.

Spatial predicates

Spatial relationships can be categorized by type:

Category Predicates Geometric Basis
Projective above, below, left of, right of Relative position of bounding box centers along x/y axes in the image plane
Topological inside, outside, overlapping, touching Containment and intersection of bounding boxes or masks
Proximity near, far from, next to, between Euclidean distance between object centers or boundaries
Orientation facing, behind, in front of 3D reasoning required — cannot be determined from 2D coordinates alone

The projective and topological predicates are the easiest because they can be verified from 2D bounding boxes alone. "A is above B" corresponds to the center y-coordinate of A being less than that of B (in image coordinates, where y=0 is the top). "A is inside B" corresponds to A's box being contained within B's box.

The proximity and orientation predicates are harder. "Near" is ambiguous — near relative to what scale? "Behind" requires understanding depth and 3D layout from a 2D image, which is fundamentally ill-posed without additional assumptions.

The VSR Benchmark

Visual Spatial Reasoning (VSR) (Liu et al., 2023) is a benchmark designed specifically to test VLMs on spatial understanding. It presents image-text pairs where the text describes a spatial relationship, and the model must judge whether the relationship holds (true/false classification).

Examples from VSR:

  • "The dog is to the left of the cat." — requires comparing x-coordinates
  • "The person is between the two trees." — requires comparing relative positions of three objects
  • "The car is in front of the building." — requires depth reasoning

Human accuracy on VSR is ~95%. CLIP-based models score ~55–60%, barely above chance. Even large VLMs like LLaVA-1.5 and InstructBLIP score only 60–65%. This is one of the clearest gaps between human and machine visual understanding.

Why CLIP fails at spatial relations

The failure mode is well-understood and deeply rooted in how CLIP was trained. CLIP's contrastive objective matches whole images to whole captions. It learns a joint embedding space where images and their captions are close, and non-matching pairs are far apart. But this objective is invariant to word order and spatial structure within the caption.

The classic demonstration: CLIP gives nearly identical similarity scores to "a dog on top of a horse" and "a horse on top of a dog." Both captions contain the same words, and CLIP's bag-of-words tendency means the order barely matters to the learned embedding. The text encoder has positional encodings and self-attention — in principle, it can distinguish word order. In practice, the contrastive objective does not provide enough gradient signal to learn this distinction, because alt-text captions on the internet rarely hinge on spatial word order for image-text matching.

This is not a model size problem — scaling CLIP to larger architectures does not fix spatial reasoning. It is a training data and objective problem. The solution requires either:

  • Hard negative mining: Explicitly creating negative pairs that differ only in spatial relations (Yuksekgonul et al., 2023, ARO benchmark).
  • Layout-aware pretraining: Including bounding box annotations during pretraining so the model must learn to associate text tokens with spatial regions.
  • Grounded instruction tuning: Fine-tuning with instructions that explicitly require spatial reasoning ("Which object is to the left?").
Spatial Relationships — Predicates on Object Positions Interactive

Click to place objects on the grid, then see which spatial predicates hold between them. Predicates are computed from bounding box positions. Notice how ambiguous some relationships become.

Select a scenario to explore spatial predicates

Kosmos-2

Kosmos-2 (Peng et al., 2023) from Microsoft Research introduced one of the most elegant solutions to grounded language generation: the hyperlink format. Instead of treating grounding as a separate task, Kosmos-2 embeds spatial references directly into the text stream, much like hyperlinks in HTML.

The format uses special tokens to wrap entity mentions with their locations:

# Kosmos-2 hyperlink format
# Entity mention wrapped with phrase tags, linked to box coordinates
"<p>a red cup</p><box><loc_342><loc_156><loc_587><loc_423></box> is on the table"

# Multiple grounded entities in one sentence
"<p>The woman</p><box><loc_100><loc_50><loc_300><loc_450></box> is holding
 <p>a book</p><box><loc_250><loc_200><loc_380><loc_350></box>"

The <p>...</p> tags delimit the referring phrase. The <box>...</box> tags contain four location tokens from a vocabulary of 1000 (so <loc_0> through <loc_999>). Coordinates are normalized: (x1, y1) is the top-left corner and (x2, y2) is the bottom-right, each in [0, 999].

Training data: GrIT. Kosmos-2 was trained on the Grounded Image-Text (GrIT) dataset, constructed by mining entity-bounding box pairs from existing datasets and web-crawled image-text pairs. The pipeline:

  1. Start with image-caption pairs from web crawl data.
  2. Run a noun phrase extractor on each caption to identify entity mentions.
  3. Run an open-vocabulary detector (e.g., GLIP) to locate each noun phrase in the image.
  4. Filter by detection confidence to remove noisy annotations.
  5. Format the caption in the hyperlink format with the detected boxes.

This produced approximately 91 million grounded image-text pairs from COYO-700M and LAION-2B. The key insight is that grounding data does not need manual annotation — it can be bootstrapped from existing detectors, creating a virtuous cycle where better detectors enable better grounding data, which trains better grounded models.

Grounded VQA. Kosmos-2 can answer questions while pointing to the relevant regions:

# Grounded VQA: question -> answer with spatial references
Question: "What is the person on the right doing?"
Answer: "<p>The person on the right</p><box><loc_600><loc_50><loc_900><loc_500></box>
         is reading <p>a newspaper</p><box><loc_650><loc_200><loc_850><loc_400></box>"

This is a qualitatively different capability from standard VQA. The model does not just answer — it shows its work by pointing to the evidence in the image. This makes the model's reasoning auditable and its errors diagnosable.

Shikra and Ferret

Beijing Institute of Technology, 2023

Shikra

Referential dialogue model. Shikra extends a standard VLM (Vicuna backbone + CLIP-ViT encoder) with the ability to both input and output spatial coordinates as natural-language numbers. The user can point to a region (by giving coordinates) and ask about it, or ask the model to locate something.

Coordinates are expressed as decimal numbers in [0, 1], represented as text tokens: [0.342, 0.156, 0.587, 0.423]. This avoids adding any special tokens to the vocabulary — the model uses its existing digit tokens. Shikra supports:

  • REC: "Where is the red cup?" → [x1, y1, x2, y2]
  • REG: Given [x1, y1, x2, y2], "What is this?" → "a red cup"
  • Pointwise QA: "What is at [0.5, 0.3]?" → "a clock"
  • Grounded captioning: Describe with box references
Apple, 2023

Ferret

Refer and ground anything, any shape. Ferret's key innovation is any-shape referring: instead of limiting spatial input to rectangular bounding boxes, Ferret accepts points, boxes, scribbles, and free-form regions as input. The user can circle an object, draw an arrow, or click a point.

The architecture uses a spatial-aware visual sampler that extracts features from arbitrary-shaped regions. Given a set of points defining a region (a polygon, scribble, or bounding box), Ferret:

  • Samples visual features at the specified points using bilinear interpolation
  • Aggregates them via a learned pooling operation
  • Injects the pooled region feature into the language model as a special token
  • This gives region-level understanding beyond rectangular boxes

Both Shikra and Ferret demonstrate that grounding can be achieved without special architectural changes — the key ingredients are (1) training data with spatial annotations and (2) a consistent format for encoding coordinates. But they also reveal an important asymmetry: inputting spatial references (pointing to regions and asking about them) is easier than outputting them (describing a scene with grounded references). The input case has a clear target — there is a specific region to attend to. The output case requires the model to decide autonomously which entities are worth grounding and where they are.

Ferret-v2 (Zhang et al., 2024) extended the original Ferret with higher-resolution processing: it uses a DINOv2 encoder at higher resolution in addition to the CLIP encoder, and fuses features at multiple scales. This improved fine-grained grounding, especially for small objects and objects with complex shapes.

Grounded Conversation Generation

Standard VLM conversation is "flat" — the model describes what it sees in text, but the text floats free of the image. Grounded conversation anchors every entity mention to a specific image region, creating a structured link between language and vision.

The challenge is generating training data. Human annotators writing grounded conversations are expensive and slow. The practical solution is GPT-4V-assisted annotation:

  1. Start with an image and its existing detection annotations (bounding boxes + labels).
  2. Feed the image and annotations to GPT-4V with a prompt: "Generate a detailed conversation about this image. When you mention an object, include its bounding box coordinates in the format [x1, y1, x2, y2]."
  3. GPT-4V generates rich, natural conversations with spatial references.
  4. Post-process to verify that the referenced coordinates actually correspond to the correct objects (filter hallucinated coordinates).

This bootstrap-and-verify pipeline is how most grounded instruction-tuning datasets are created. Examples include LLaVA-Grounding (Zhang et al., 2023), Shikra's SpatialRGPT data, and the grounded subset of ShareGPT4V.

Grounded captioning vs. standard captioning. The difference is not just the presence of coordinates — it changes the nature of what the model describes. Standard captioning optimizes for fluency and global accuracy: "A man walks his dog in a park." Grounded captioning forces specificity: "A man [120, 80, 340, 450] walks a golden retriever [360, 200, 500, 420] along a gravel path [0, 350, 700, 500] in a park with oak trees [500, 0, 700, 300]." The model must commit to precise locations for every entity it mentions, which makes hallucination much more detectable — if the model says there is a dog at [360, 200, 500, 420] but that region contains a bench, the error is immediately verifiable.

ℹ Grounding as hallucination detection

Grounding turns hallucination from an open problem into a verification problem. When a VLM says "there is a cat on the table," you cannot easily verify this claim against the image (you would need another VLM or a human). When a grounded VLM says "there is a cat [200, 150, 400, 350] on the table [50, 300, 650, 500]," you can check: (1) does the region [200, 150, 400, 350] contain a cat? (2) is that region geometrically above the table region? Both checks are mechanistic. This is why grounding is not just a feature — it is an accountability mechanism.

Counting and Numerical Reasoning

Ask a VLM "how many birds are in this image?" and you will get an answer. It will often be wrong. Counting is one of the most embarrassingly persistent failures of modern VLMs. Models that can write poetry about a scene, identify rare species, and answer complex reasoning questions will confidently state there are 4 birds when there are 7.

Why VLMs cannot count reliably. The root cause is architectural. The standard VLM pipeline (image encoder → projection → language model) processes images at relatively low resolution (224×224 or 336×336) and represents the entire image as a fixed-size set of tokens (e.g., 576 tokens for a 24×24 patch grid). This representation is excellent for what and where questions but terrible for how many questions. Here is why:

  • Fixed resolution squashes count information. Ten birds in a 224×224 image might each occupy only ~20 pixels. After patch embedding (14×14 patches at 16-pixel stride), each bird might overlap with just 1–2 patches. The representation conflates multiple birds into the same patch tokens.
  • Global pooling destroys count. If the model uses a [CLS] token or global average pooling, count information is explicitly destroyed. The average of features from 5 birds is indistinguishable from the average of features from 8 birds.
  • Language model priors. Language models have priors about typical counts. "Birds on a wire" triggers a prior for "a few" (3–5). The model might output a plausible count rather than actually counting.
  • Training data rarely requires exact counting. Captioning datasets use vague quantifiers ("several," "many," "a group of"). Exact counts are rare in VQA training data. The model is not trained to count precisely because the data does not demand it.

Approaches to fixing counting:

Approach 1

Explicit Counting Modules

Add a dedicated counting head that detects and counts object instances separately from the language pipeline. CountGD (Jiang et al., 2024) combines Grounding DINO with a counting-specific loss. The detector identifies all instances; the count is the number of detections.

Approach 2

Density Maps

Predict a density map where the integral over any region gives the count. CounTR (Liu et al., 2022) uses a transformer to predict density maps from few-shot exemplars. The model learns a continuous density field rather than detecting discrete instances.

Approach 3

Set Prediction

Predict a set of bounding boxes (one per instance) and count the set. This is what DETR-family models do naturally. The count is a byproduct of detection. Works well for countable objects but requires the full detection pipeline.

The honest assessment: counting remains an unsolved problem for general-purpose VLMs. Specialized counting models (CounTR, CountGD) work well for specific domains (crowd counting, cell counting) but do not generalize. The most promising direction is higher-resolution processing: models like Monkey (Li et al., 2024) and InternVL-1.5 process images at up to 1344×1344, which gives the model enough spatial resolution to distinguish individual instances. But even these models degrade beyond ~15–20 objects.

Code Examples

Let's implement the key techniques from this article. These examples are designed to be runnable and modifiable — change the inputs, inspect the intermediates, build intuition.

Bounding box tokenization

python
import torch
import numpy as np

def normalize_bbox(bbox, img_w, img_h):
    """Normalize bounding box to [0, 1] range.

    Args:
        bbox: [x1, y1, x2, y2] in pixel coordinates
        img_w, img_h: image dimensions
    Returns:
        normalized bbox in [0, 1]
    """
    x1, y1, x2, y2 = bbox
    return [x1 / img_w, y1 / img_h, x2 / img_w, y2 / img_h]

def quantize_bbox(bbox_norm, num_bins=1000):
    """Quantize normalized bbox to discrete tokens.

    Maps [0, 1] -> [0, num_bins - 1] integers.
    Each coordinate becomes a single token ID.
    """
    tokens = [int(round(c * (num_bins - 1))) for c in bbox_norm]
    tokens = [max(0, min(num_bins - 1, t)) for t in tokens]
    return tokens

def dequantize_bbox(tokens, num_bins=1000):
    """Decode discrete tokens back to normalized coordinates."""
    return [t / (num_bins - 1) for t in tokens]

def bbox_to_token_string(bbox, img_w, img_h, num_bins=1000):
    """Full pipeline: pixel coords -> Kosmos-2 style token string."""
    norm = normalize_bbox(bbox, img_w, img_h)
    tokens = quantize_bbox(norm, num_bins)
    return f"<box><loc_{tokens[0]}><loc_{tokens[1]}><loc_{tokens[2]}><loc_{tokens[3]}></box>"

# Example: image is 640x480, object at pixel coords [150, 100, 400, 350]
bbox = [150, 100, 400, 350]
img_w, img_h = 640, 480

# Step 1: Normalize
norm = normalize_bbox(bbox, img_w, img_h)
print(f"Normalized: {[f'{c:.4f}' for c in norm]}")
# -> ['0.2344', '0.2083', '0.6250', '0.7292']

# Step 2: Quantize to 1000 bins
tokens = quantize_bbox(norm, num_bins=1000)
print(f"Tokens: {tokens}")
# -> [234, 208, 625, 729]

# Step 3: Decode back
decoded = dequantize_bbox(tokens, num_bins=1000)
print(f"Decoded: {[f'{c:.4f}' for c in decoded]}")
# -> ['0.2342', '0.2082', '0.6256', '0.7297']

# Quantization error
error = [abs(n - d) * max(img_w, img_h) for n, d in zip(norm, decoded)]
print(f"Max pixel error: {max(error):.2f} px")
# -> Max pixel error: ~0.4 px (sub-pixel precision!)

# Compare with fewer bins
for bins in [32, 100, 500, 1000]:
    tok = quantize_bbox(norm, bins)
    dec = dequantize_bbox(tok, bins)
    err = max(abs(n - d) * max(img_w, img_h) for n, d in zip(norm, dec))
    print(f"  {bins:4d} bins: tokens={tok}, max error={err:.2f} px")

Grounding inference with Kosmos-2

python
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch
import re

# Load Kosmos-2
model_id = "microsoft/kosmos-2-patch14-224"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(model_id)

image = Image.open("scene.jpg")

# Task 1: Grounded captioning
# The model generates a caption with hyperlinked entity references
prompt = "<grounding> An image of"
inputs = processor(text=prompt, images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        num_beams=3,
    )

# Decode the output
text = processor.batch_decode(outputs, skip_special_tokens=False)[0]
print(f"Raw output:\n{text}")
# Example: "<grounding> An image of <p>a woman</p><box><loc_102>...</box>
#           sitting on <p>a bench</p><box><loc_200>...</box>"

# Task 2: Referring Expression Comprehension
# Input a phrase, get back coordinates
prompt = "<grounding><p> the red cup on the left </p>"
inputs = processor(text=prompt, images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=64)

text = processor.batch_decode(outputs, skip_special_tokens=False)[0]

# Parse the bounding box from the output
# Extract location tokens: <loc_XXX>
loc_pattern = r"<loc_(\d+)>"
locations = [int(x) for x in re.findall(loc_pattern, text)]

if len(locations) >= 4:
    x1, y1, x2, y2 = locations[:4]
    # Convert from [0, 999] back to pixel coordinates
    w, h = image.size
    bbox_pixels = [x1 * w / 999, y1 * h / 999, x2 * w / 999, y2 * h / 999]
    print(f"Grounded box: {bbox_pixels}")
else:
    print("No bounding box found in output")

Spatial relationship evaluation

python
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple

@dataclass
class BBox:
    """Bounding box in normalized [0, 1] coordinates."""
    x1: float
    y1: float
    x2: float
    y2: float
    label: str = ""

    @property
    def cx(self): return (self.x1 + self.x2) / 2

    @property
    def cy(self): return (self.y1 + self.y2) / 2

    @property
    def width(self): return self.x2 - self.x1

    @property
    def height(self): return self.y2 - self.y1

    @property
    def area(self): return self.width * self.height

def compute_iou(a: BBox, b: BBox) -> float:
    """Intersection over Union between two boxes."""
    inter_x1 = max(a.x1, b.x1)
    inter_y1 = max(a.y1, b.y1)
    inter_x2 = min(a.x2, b.x2)
    inter_y2 = min(a.y2, b.y2)

    inter_area = max(0, inter_x2 - inter_x1) * max(0, inter_y2 - inter_y1)
    union_area = a.area + b.area - inter_area
    return inter_area / union_area if union_area > 0 else 0

def spatial_predicates(a: BBox, b: BBox, threshold=0.1) -> dict:
    """Compute all spatial predicates between box A and box B.

    Returns dict of predicate -> (bool, confidence).
    Threshold controls margin for directional predicates.
    """
    predicates = {}

    # Projective relations (in image coords: y increases downward)
    predicates["A is left of B"] = a.cx < b.cx - threshold
    predicates["A is right of B"] = a.cx > b.cx + threshold
    predicates["A is above B"] = a.cy < b.cy - threshold  # lower y = higher in image
    predicates["A is below B"] = a.cy > b.cy + threshold

    # Topological relations
    predicates["A contains B"] = (a.x1 <= b.x1 and a.y1 <= b.y1 and
                                   a.x2 >= b.x2 and a.y2 >= b.y2)
    predicates["A inside B"] = (b.x1 <= a.x1 and b.y1 <= a.y1 and
                                 b.x2 >= a.x2 and b.y2 >= a.y2)
    predicates["A overlaps B"] = compute_iou(a, b) > 0

    # Proximity relations
    dist = np.sqrt((a.cx - b.cx)**2 + (a.cy - b.cy)**2)
    avg_size = (a.width + a.height + b.width + b.height) / 4
    predicates["A near B"] = dist < avg_size * 2
    predicates["A far from B"] = dist > avg_size * 4

    return predicates

# Example: evaluate spatial relations
cup = BBox(0.1, 0.3, 0.25, 0.5, "cup")
table = BBox(0.05, 0.5, 0.9, 0.8, "table")
plate = BBox(0.4, 0.35, 0.6, 0.5, "plate")

print(f"=== {cup.label} vs {table.label} ===")
for pred, val in spatial_predicates(cup, table).items():
    if val:
        pred_str = pred.replace("A", cup.label).replace("B", table.label)
        print(f"  TRUE: {pred_str}")

print(f"\n=== {cup.label} vs {plate.label} ===")
for pred, val in spatial_predicates(cup, plate).items():
    if val:
        pred_str = pred.replace("A", cup.label).replace("B", plate.label)
        print(f"  TRUE: {pred_str}")

# VSR-style evaluation: given a statement, verify against boxes
def verify_spatial_statement(statement: str, objects: dict) -> bool:
    """Verify a spatial statement against known object positions.

    Simplified parser for demonstration.
    """
    statement = statement.lower()

    # Parse "X is [relation] Y"
    relations = {
        "left of": lambda a, b: a.cx < b.cx,
        "right of": lambda a, b: a.cx > b.cx,
        "above": lambda a, b: a.cy < b.cy,
        "below": lambda a, b: a.cy > b.cy,
        "on": lambda a, b: a.cy < b.cy and compute_iou(a, b) > 0,
    }

    for rel_name, rel_fn in relations.items():
        if rel_name in statement:
            # Find subject and object
            for name_a, box_a in objects.items():
                for name_b, box_b in objects.items():
                    if name_a != name_b:
                        if name_a in statement and name_b in statement:
                            idx_a = statement.index(name_a)
                            idx_rel = statement.index(rel_name)
                            if idx_a < idx_rel:
                                return rel_fn(box_a, box_b)
    return False

objects = {"cup": cup, "table": table, "plate": plate}
statements = [
    "The cup is above the table",
    "The cup is left of the plate",
    "The plate is below the cup",
]

for s in statements:
    result = verify_spatial_statement(s, objects)
    print(f"  '{s}' -> {result}")

Grounded captioning pipeline

python
"""
Grounded captioning pipeline: Grounding DINO + SAM + LLM
Detects objects, segments them, then generates a spatially grounded caption.
"""
import torch
import numpy as np
from PIL import Image
from transformers import (
    AutoProcessor,
    AutoModelForZeroShotObjectDetection,
    SamModel,
    SamProcessor,
)

# Step 1: Detect objects with Grounding DINO
def detect_objects(image, text_prompt, det_threshold=0.3):
    """Open-vocabulary detection: text -> bounding boxes."""
    det_processor = AutoProcessor.from_pretrained(
        "IDEA-Research/grounding-dino-base"
    )
    det_model = AutoModelForZeroShotObjectDetection.from_pretrained(
        "IDEA-Research/grounding-dino-base"
    )

    inputs = det_processor(
        images=image, text=text_prompt, return_tensors="pt"
    )

    with torch.no_grad():
        outputs = det_model(**inputs)

    results = det_processor.post_process_grounded_object_detection(
        outputs,
        inputs["input_ids"],
        threshold=det_threshold,
        target_sizes=[image.size[::-1]],  # (H, W)
    )[0]

    return results["boxes"], results["labels"], results["scores"]

# Step 2: Segment detected objects with SAM
def segment_objects(image, boxes):
    """Given bounding boxes, produce pixel-precise masks via SAM."""
    sam_processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
    sam_model = SamModel.from_pretrained("facebook/sam-vit-huge")

    inputs = sam_processor(
        images=image,
        input_boxes=[boxes.tolist()],
        return_tensors="pt",
    )

    with torch.no_grad():
        outputs = sam_model(**inputs)

    masks = sam_processor.image_processor.post_process_masks(
        outputs.pred_masks,
        inputs["original_sizes"],
        inputs["reshaped_input_sizes"],
    )[0]

    return masks  # (N, 1, H, W) binary masks

# Step 3: Build grounded caption
def grounded_caption(image, text_prompt):
    """Full pipeline: detect + segment + format grounded text."""
    boxes, labels, scores = detect_objects(image, text_prompt)

    img_w, img_h = image.size
    grounded_entities = []

    for box, label, score in zip(boxes, labels, scores):
        x1, y1, x2, y2 = box.tolist()
        # Normalize to [0, 999] for Kosmos-2 style output
        norm_box = [
            int(x1 / img_w * 999),
            int(y1 / img_h * 999),
            int(x2 / img_w * 999),
            int(y2 / img_h * 999),
        ]
        grounded_entities.append({
            "label": label,
            "box_pixels": [x1, y1, x2, y2],
            "box_normalized": norm_box,
            "confidence": score.item(),
            "token_str": (f"<p>{label}</p><box>"
                         f"<loc_{norm_box[0]}><loc_{norm_box[1]}>"
                         f"<loc_{norm_box[2]}><loc_{norm_box[3]}>"
                         f"</box>"),
        })

    # Generate a structured grounded caption
    caption_parts = []
    for e in grounded_entities:
        caption_parts.append(
            f"{e['label']} [{e['box_pixels'][0]:.0f}, {e['box_pixels'][1]:.0f}, "
            f"{e['box_pixels'][2]:.0f}, {e['box_pixels'][3]:.0f}]"
        )

    return {
        "entities": grounded_entities,
        "grounded_caption": "Scene contains: " + ", ".join(caption_parts),
    }

# Example usage
image = Image.open("kitchen.jpg")
result = grounded_caption(image, "cup . plate . person . table")
print(result["grounded_caption"])
for e in result["entities"]:
    print(f"  {e['label']}: conf={e['confidence']:.2f}, "
          f"box={e['box_pixels']}, token={e['token_str']}")

References

Seminal papers and key works referenced in this article.

  1. Peng et al. "Kosmos-2: Grounding Multimodal Large Language Models to the World." ICLR, 2024. arXiv
  2. Liu et al. "Grounding DINO: Marrying DINO with Grounded Pre-Training." ECCV, 2024. arXiv
  3. Kirillov et al. "Segment Anything." ICCV, 2023. arXiv
  4. You et al. "Ferret: Refer and Ground Anything Anywhere at Any Granularity." ICLR, 2024. arXiv
  5. Liu et al. "Visual Spatial Reasoning." TACL, 2023. arXiv