SpatialVLM — Veanors

Chapter 0: The Problem

Ask GPT-4V "how far is the table from the sofa?" and you'll get a hedge: "I'm unable to physically interact with environments, but I can provide some insights...". Ask "can a 1-meter-wide robot fit between the sofa and the chairs?" and it guesses, vaguely, that there "may not be enough space."

This is not a rare failure mode. It is the norm for today's vision-language models. VLMs can describe scenes beautifully — they name objects, read text, infer moods. But ask a quantitative spatial question and they collapse. Distances, sizes, spatial affordances: all beyond reach.

Why? Because VLMs are trained on image-caption pairs scraped from the internet. Captions say "a cat on a table" but never "the cat is 0.6 meters from the edge of the table." The training data is spatially impoverished. There is no 3D grounding. No metric scale. No depth.

The gap: VLMs handle qualitative spatial reasoning (left/right, above/below) poorly, and quantitative spatial reasoning (distances in meters, size comparisons) not at all. GPT-4V produces valid distance numbers only 1% of the time. SpatialVLM achieves 99%.

The authors' hypothesis is bold: the bottleneck is data, not architecture. VLMs are perfectly capable of spatial reasoning — they've just never seen spatial reasoning training data at scale. Give them 2 billion spatial QA pairs and they learn.

Why do current VLMs struggle with quantitative spatial questions like "how far is A from B?"

Their training data (internet image-caption pairs) contains almost no metric spatial information — captions describe what objects are, not where they are in 3D space Their architectures cannot process spatial information They lack sufficient parameters

Chapter 1: The Key Insight

If the problem is lack of spatial data, the solution is obvious: generate it. But where do you get millions of images with accurate 3D spatial annotations? Human labeling is expensive and doesn't scale. Synthetic renderers have limited visual diversity.

SpatialVLM's key insight: use off-the-shelf vision models to automatically extract 3D spatial information from existing 2D internet images. By combining metric depth estimation, open-vocabulary object detection, and semantic segmentation, you can reconstruct approximate 3D scenes from flat photos — and then ask spatial questions about the reconstructed 3D world.

The recipe in one sentence: Take 10 million internet images, run depth estimation to get 3D point clouds, detect and segment objects, compute 3D bounding boxes, then generate billions of spatial question-answer pairs from the extracted 3D geometry — all automatically, no human labeling required.

The resulting dataset is noisy — monocular depth estimation isn't perfect, detection misses objects, captions can be ambiguous. But it turns out VLMs are remarkably robust to noise. Train on enough noisy data and the model learns genuine spatial common sense.

This is the first internet-scale 3D spatial reasoning dataset in metric space. 10 million images. 2 billion QA pairs. 50% qualitative questions (left/right, behind/in front), 50% quantitative (distances in meters, sizes in centimeters).

What is SpatialVLM's key insight for obtaining spatial reasoning training data?

Use off-the-shelf vision models (depth estimation, detection, segmentation) to automatically extract 3D spatial annotations from existing internet images, then generate QA pairs from the reconstructed 3D geometry Render synthetic 3D scenes with known geometry Hire human annotators to label spatial relationships

Chapter 2: Data Generation Pipeline

The pipeline transforms a flat 2D image into a rich set of spatial question-answer pairs through five stages. Let's walk through each one.

Stage 1: Semantic Filtering

Not every internet image is useful. Product photos, screenshots, memes — these don't contain meaningful spatial scenes. SpatialVLM uses a CLIP-based classifier to filter the dataset, keeping only scene-level photos with multiple objects and spatial context. Starting from a massive web-scraped corpus, this narrows down to ~10 million usable images.

Stage 2: 2D Context Extraction

For each surviving image, a suite of expert models extracts object-centric information:

Region proposals — find candidate object regions in the image
FlexCap (region captioning) — generate fine-grained captions like "cake shaped like a house" rather than just "cake"
Semantic segmentation — get pixel-level masks for each detected object

Stage 3: 2D to 3D Lifting

This is the critical step. A metric depth estimator (ZoeDepth) predicts per-pixel depth, converting flat pixels into a 3D point cloud in metric space (real meters). The camera coordinate system is canonicalized by detecting horizontal surfaces (floors, tabletops) and aligning the ground plane. Each segmented object now has a 3D bounding box with real-world dimensions.

Stage 4: Ambiguity Resolution

When multiple similar objects appear ("cake" and another "cake"), questions become ambiguous. Two solutions: (1) FlexCap generates rich, distinctive captions (e.g., "cake shaped like a house" vs. "cupcake in plastic container"), and (2) a CLIP-based clustering algorithm further disambiguates or rejects still-ambiguous captions.

Stage 5: QA Synthesis

With 3D bounding boxes and unambiguous object captions, spatial QA pairs are generated from 38 question types, each with ~20 question templates and ~10 answer templates. Answers include human-aligned rounding (e.g., "about half a meter" instead of "0.487 meters").

Scale: This pipeline produces 2 billion direct spatial reasoning QA pairs across 10 million images. 50% qualitative, 50% quantitative. The diversity of object captions, question phrasing, and distance units ensures the model doesn't just memorize templates.

What is the critical step that enables metric-scale spatial reasoning from 2D images?

Metric depth estimation lifts 2D pixels into 3D point clouds with real-world scale, enabling computation of actual distances between objects Object detection finds all objects in the scene Template-based QA generation creates diverse questions

Chapter 3: Direct vs Chain-of-Thought

Direct spatial reasoning is the base capability: the VLM sees an image and a question ("How far is the sofa from the table?") and outputs an answer ("About 1.5 meters") in a single forward pass. This is what the synthetic data trains.

But real-world spatial tasks are often multi-step. Consider: "Can a 1-meter-wide robot fit between the sofa and the chairs?" This requires:

Estimate the width of the path between the sofa and the chairs
Compare that width to the robot's width (1 meter)
Make the yes/no judgment

SpatialVLM enables chain-of-thought (CoT) spatial reasoning by combining the VLM's direct spatial answers with an LLM orchestrator. The LLM (e.g., GPT-4) breaks complex spatial questions into simple sub-queries, sends them to SpatialVLM, collects the metric answers, and reasons over them.

Example from the paper: "Do the blue can, orange can, and silver can roughly form an isosceles triangle?" The LLM asks SpatialVLM three distance questions (blue-to-orange: 0.4m, orange-to-silver: 0.48m, blue-to-silver: 0.41m), then computes: longest − shortest = 0.08m < 0.1m threshold. Answer: yes, roughly isosceles.

This is the power of quantitative output. Because SpatialVLM returns actual numbers (not just "close" or "far"), the LLM can do arithmetic over spatial relationships. No other VLM at the time could support this workflow — they either refused to give numbers or gave numbers in meaningless "pixel units."

Why does chain-of-thought spatial reasoning require quantitative (numeric) outputs from the VLM rather than qualitative (left/right/near/far)?

Because the LLM orchestrator needs actual metric values to perform arithmetic comparisons and multi-step reasoning (e.g., "is 1.56m > 1m?") Because qualitative answers are always wrong Because LLMs cannot process text descriptions

Chapter 4: Spatial VQA Tasks

SpatialVLM handles three categories of spatial questions, spanning a spectrum from simple binary judgments to complex affordance reasoning.

Qualitative spatial questions

Binary predicates about relative position: "Is the chair to the left or right of the table?" "Is the lamp above or below the shelf?" "Which object is closer to the camera?" These require only ordinal judgments — no numbers, just comparisons.

Quantitative spatial questions

Metric measurements: "How far is the sofa from the wall?" "How wide is the doorway?" "What is the height of the counter?" These demand actual numeric answers in real-world units (meters, centimeters). This is where SpatialVLM breaks new ground — no prior VLM could reliably produce metric distances.

Spatial affordance questions

Applied reasoning that combines spatial measurement with world knowledge: "Can a 1-meter-wide robot fit through that gap?" "Can a toddler reach the counter?" These chain quantitative estimation with common-sense constraints. SpatialVLM answers them via CoT: first measure the gap, then compare to the constraint.

The 38 question types span horizontal position (left/right), depth (closer/farther from camera), vertical position (above/below), 3D distance (Euclidean between object centers), object dimensions (width/height/depth of bounding boxes), and relative sizes. Each type has ~20 question templates and ~10 answer templates, yielding enormous phrasing diversity.

What distinguishes spatial affordance questions from simple quantitative questions?

Affordance questions combine metric spatial measurement with real-world constraints (e.g., "can a 1m robot fit?"), requiring both estimation and comparison reasoning Affordance questions are easier because they only need yes/no answers Affordance questions use different units of measurement

Chapter 5: Training

SpatialVLM uses the PaLM-E architecture — a vision-language model that encodes images through a Vision Transformer (ViT) and feeds visual tokens alongside text tokens into a large language model backbone. The specific backbone is PaLM 2-S, a smaller variant of PaLM 2.

Training mix

The model is trained on a mixture of the original PaLM-E dataset (captioning, VQA, embodied planning) and the new spatial VQA dataset. Only 5% of training tokens are dedicated to spatial reasoning tasks. This is enough — and crucially, it doesn't hurt performance on other VQA benchmarks. In fact, VQA v2 performance improves by 2.4% with the spatial data, suggesting VLMs are generally underfitting on spatial-adjacent tasks.

Unfreezing the ViT

A critical training decision: should you freeze or unfreeze the pretrained ViT encoder? The paper finds that unfreezing the ViT improves fine-grained distance estimation significantly. A frozen ViT (trained with contrastive/classification loss) is lossy in fine-grained spatial information. With an unfrozen ViT, the model achieves 8.4% accuracy in the strict [90%, 110%] range of ground truth — remarkable given that human annotations themselves are noisy.

Robustness to noise

The training data is inherently noisy — monocular depth is approximate, detection can miss objects, and the 3D lifting introduces errors. The paper studies this by deliberately adding Gaussian noise to accurate robotic manipulation data. Result: VLMs learn generalizable spatial common sense even from moderately noisy data. Performance barely degrades up to 0.3m standard deviation noise.

Human-aligned rounding: Raw depth measurements give answers like "0.487 meters" which no human would say. The pipeline rounds to human-like expressions: "about half a meter", "roughly 2 meters", "20 centimeters". This trains the model to output natural, useful answers rather than false-precision numbers.

Why does unfreezing the ViT encoder improve spatial reasoning performance?

The pretrained ViT (trained with contrastive loss) is lossy in fine-grained spatial information — unfreezing lets it adapt its representations to encode depth and distance cues Unfreezing makes the model train faster Frozen ViTs cannot process images

Chapter 6: Results

The paper evaluates SpatialVLM against five baselines: GPT-4V, LLaVA-1.5, InstructBLIP, PaLI, and the base PaLM-E / PaLM 2-E models. Evaluation uses a human-annotated benchmark with 331 qualitative and 215 quantitative spatial reasoning QA pairs.

Qualitative spatial VQA

SpatialVLM achieves 75.2% accuracy on binary predicate tasks (left/right, above/below, closer/farther), compared to 71.3% for LLaVA-1.5, 68.0% for GPT-4V, and 50.4% for the base PaLM 2-E. The gain over GPT-4V is significant — and GPT-4V is orders of magnitude larger.

Quantitative spatial VQA

This is where the gap is dramatic. SpatialVLM outputs valid numeric distances 99.0% of the time. GPT-4V manages only 1.0% — it almost always refuses to estimate distance. Of the valid outputs, 37.2% of SpatialVLM's estimates fall within half-to-twice of the human-annotated ground truth, versus 0% for GPT-4V (since it almost never gives a number).

No degradation on general VQA

Despite devoting 5% of training tokens to spatial tasks, general VQA performance is preserved: OKVQA drops only 0.4%, and VQA v2 increases 2.4%. Spatial training is complementary, not competitive.

The punchline: The bottleneck really was data, not architecture. The same model (PaLM 2-E), trained with vs. without spatial data, jumps from 50.4% to 75.2% on qualitative and from 33.9% to 37.2% on quantitative. Architecture hasn't changed. Only the training data has.

What is GPT-4V's success rate at producing valid numeric distance estimates, and what does this reveal?

Only 1.0% — GPT-4V almost always refuses to estimate distances, revealing that without spatial training data, even the most capable VLMs cannot do quantitative spatial reasoning 50% — GPT-4V is mediocre at spatial tasks 99% — GPT-4V is excellent at distance estimation

Chapter 7: Robotics Application

The killer application of quantitative spatial reasoning: robotics. A robot that can see a scene and estimate real-world distances can plan actions that require spatial awareness — reaching, navigating, placing objects.

Dense reward annotation

Traditional robot learning requires hand-crafted reward functions. With SpatialVLM, you can define a task in natural language ("pick up the orange tea bottle") and use the VLM as a reward annotator. For each frame in a trajectory, ask: "What is the distance between the gripper and the orange tea bottle?" As the gripper approaches, SpatialVLM reports monotonically decreasing distances, providing a smooth, dense reward signal.

The paper demonstrates this across three manipulation tasks: picking a tea bottle, putting an apple in a bowl, and picking up an apple. In each case, the VLM-generated reward correctly decreases as the robot makes progress toward the goal.

Navigation affordances

For navigation, SpatialVLM enables spatial affordance reasoning via CoT. "Can the robot fit through the gap between the sofa and the table?" The VLM estimates the gap width (1.56m), the LLM compares to the robot width (1m), and answers yes. This works from a single monocular image — no depth sensor, no LIDAR, no 3D reconstruction at inference time.

Why this matters for robotics: Previous VLM-based reward annotators (like CLIP-based rewards) were limited to semantic similarity — "does this look like the goal?" SpatialVLM provides metric rewards — "how many centimeters is the gripper from the object?" This enables much denser and more informative feedback for policy learning.

How does SpatialVLM serve as a dense reward annotator for robot manipulation tasks?

By estimating the metric distance between the gripper and target object at each frame — as the robot approaches the goal, the VLM reports decreasing distances, providing a smooth reward signal By classifying each frame as success or failure By generating captions of the scene

Chapter 8: Limitations

SpatialVLM pushes VLMs into new territory, but several limitations constrain its applicability.

Depth estimation errors

The entire pipeline depends on monocular depth estimation (ZoeDepth), which is imperfect. It works best at medium range (1–10 meters) and degrades at very close range or long distances. The model inherits these biases: it is most accurate for indoor scenes with objects at moderate distances, and less reliable for outdoor scenes with large depth ranges.

Single-view limitations

The system works from single images only. Occluded objects, objects partially visible at frame edges, and complex multi-room layouts are poorly handled. Multi-view consistency — where the same scene is observed from different angles — is not enforced.

Scale ambiguity

Monocular depth estimation can confuse scale. A photo of a miniature model room and a real room can produce similar depth maps. The pipeline mitigates this through the diversity of training images, but fundamental scale-depth ambiguity persists for unusual scenes.

Template-based QA generation

The 38 question types with templates don't cover all possible spatial queries. Novel question formulations, compositional multi-hop reasoning, and spatial questions about unseen object categories may not generalize from the template distribution. The CoT approach partially addresses this by decomposing complex questions into template-like sub-queries.

Evaluation difficulty

Ground truth for spatial reasoning is hard. Human annotators disagree on distances ("Is that 2 meters or 3?"). The 8.4% accuracy at [90%, 110%] of ground truth is remarkable partly because human annotations are themselves noisy. Better benchmarks are needed.

The core tension: The training data is noisy (imperfect depth, approximate detections), and the evaluation data is noisy (imperfect human labels). Yet the system still works. This suggests that spatial common sense — knowing a table is roughly 1 meter wide, a room is roughly 4 meters across — is learnable even from imperfect supervision, because the distribution of noisy estimates centers on the truth.

Why does SpatialVLM perform best at medium-range distances (1-10 meters)?

Because the monocular depth estimator (ZoeDepth) is most accurate at medium range — the model inherits the biases of its data generation pipeline Because internet images only show medium-range scenes Because the VLM architecture has a resolution limit

Chapter 9: Connections

PaLM-E (Driess et al., 2023)

SpatialVLM builds directly on PaLM-E's architecture and training recipe. PaLM-E showed that VLMs can be grounded in embodied data for robot planning. SpatialVLM extends this with spatial grounding — not just "plan a path to the kitchen" but "the kitchen door is 3 meters away and 0.8 meters wide."

SayCan (Ahn et al., 2022)

SayCan combines LLM planning with grounded affordance functions. SpatialVLM provides a much richer affordance interface: instead of binary "can/cannot" from pretrained skills, it offers continuous metric distance estimates that enable dense reward shaping.

GPT-4V (OpenAI, 2023)

The paper positions GPT-4V as the primary foil. GPT-4V is vastly larger and more capable at general vision-language tasks, but its near-total failure at quantitative spatial reasoning (1% valid output rate) demonstrates that scale alone doesn't solve spatial grounding — you need the right data.

3D-LLM and Point Cloud VLMs

An alternative approach: feed 3D point clouds directly to the VLM as input. SpatialVLM takes the opposite path — it receives only 2D images but has been trained on data derived from 3D reconstructions. At inference, no 3D input is needed. This makes deployment much simpler (just a camera, no depth sensor).

ZoeDepth (Bhat et al., 2023)

The metric depth estimator that powers the data pipeline. ZoeDepth achieves zero-shot metric depth from monocular images by combining relative and metric depth. Its accuracy directly bounds SpatialVLM's data quality.

Theory of Space

SpatialVLM connects to broader questions about spatial cognition. Humans develop spatial reasoning through embodied experience — we learn distances by walking, sizes by grasping. SpatialVLM takes a shortcut: instead of embodied experience, it uses reconstructed 3D geometry from images. Whether this produces the same kind of spatial understanding as embodied learning remains an open question.

What advantage does SpatialVLM's approach (2D input, 3D-informed training) have over directly feeding 3D point clouds to a VLM at inference time?

At inference, SpatialVLM only needs a standard 2D image (just a camera) — no depth sensor, LIDAR, or 3D reconstruction required, making deployment far simpler Point cloud VLMs are always more accurate 3D point clouds are too large for any model