Endowing vision-language models with quantitative spatial reasoning — by generating 2 billion synthetic spatial QA pairs from 3D scene reconstructions of internet images.
Ask GPT-4V "how far is the table from the sofa?" and you'll get a hedge: "I'm unable to physically interact with environments, but I can provide some insights...". Ask "can a 1-meter-wide robot fit between the sofa and the chairs?" and it guesses, vaguely, that there "may not be enough space."
This is not a rare failure mode. It is the norm for today's vision-language models. VLMs can describe scenes beautifully — they name objects, read text, infer moods. But ask a quantitative spatial question and they collapse. Distances, sizes, spatial affordances: all beyond reach.
Why? Because VLMs are trained on image-caption pairs scraped from the internet. Captions say "a cat on a table" but never "the cat is 0.6 meters from the edge of the table." The training data is spatially impoverished. There is no 3D grounding. No metric scale. No depth.
The authors' hypothesis is bold: the bottleneck is data, not architecture. VLMs are perfectly capable of spatial reasoning — they've just never seen spatial reasoning training data at scale. Give them 2 billion spatial QA pairs and they learn.
If the problem is lack of spatial data, the solution is obvious: generate it. But where do you get millions of images with accurate 3D spatial annotations? Human labeling is expensive and doesn't scale. Synthetic renderers have limited visual diversity.
SpatialVLM's key insight: use off-the-shelf vision models to automatically extract 3D spatial information from existing 2D internet images. By combining metric depth estimation, open-vocabulary object detection, and semantic segmentation, you can reconstruct approximate 3D scenes from flat photos — and then ask spatial questions about the reconstructed 3D world.
The resulting dataset is noisy — monocular depth estimation isn't perfect, detection misses objects, captions can be ambiguous. But it turns out VLMs are remarkably robust to noise. Train on enough noisy data and the model learns genuine spatial common sense.
This is the first internet-scale 3D spatial reasoning dataset in metric space. 10 million images. 2 billion QA pairs. 50% qualitative questions (left/right, behind/in front), 50% quantitative (distances in meters, sizes in centimeters).
The pipeline transforms a flat 2D image into a rich set of spatial question-answer pairs through five stages. Let's walk through each one.
Not every internet image is useful. Product photos, screenshots, memes — these don't contain meaningful spatial scenes. SpatialVLM uses a CLIP-based classifier to filter the dataset, keeping only scene-level photos with multiple objects and spatial context. Starting from a massive web-scraped corpus, this narrows down to ~10 million usable images.
For each surviving image, a suite of expert models extracts object-centric information:
This is the critical step. A metric depth estimator (ZoeDepth) predicts per-pixel depth, converting flat pixels into a 3D point cloud in metric space (real meters). The camera coordinate system is canonicalized by detecting horizontal surfaces (floors, tabletops) and aligning the ground plane. Each segmented object now has a 3D bounding box with real-world dimensions.
When multiple similar objects appear ("cake" and another "cake"), questions become ambiguous. Two solutions: (1) FlexCap generates rich, distinctive captions (e.g., "cake shaped like a house" vs. "cupcake in plastic container"), and (2) a CLIP-based clustering algorithm further disambiguates or rejects still-ambiguous captions.
With 3D bounding boxes and unambiguous object captions, spatial QA pairs are generated from 38 question types, each with ~20 question templates and ~10 answer templates. Answers include human-aligned rounding (e.g., "about half a meter" instead of "0.487 meters").
Direct spatial reasoning is the base capability: the VLM sees an image and a question ("How far is the sofa from the table?") and outputs an answer ("About 1.5 meters") in a single forward pass. This is what the synthetic data trains.
But real-world spatial tasks are often multi-step. Consider: "Can a 1-meter-wide robot fit between the sofa and the chairs?" This requires:
SpatialVLM enables chain-of-thought (CoT) spatial reasoning by combining the VLM's direct spatial answers with an LLM orchestrator. The LLM (e.g., GPT-4) breaks complex spatial questions into simple sub-queries, sends them to SpatialVLM, collects the metric answers, and reasons over them.
This is the power of quantitative output. Because SpatialVLM returns actual numbers (not just "close" or "far"), the LLM can do arithmetic over spatial relationships. No other VLM at the time could support this workflow — they either refused to give numbers or gave numbers in meaningless "pixel units."
SpatialVLM handles three categories of spatial questions, spanning a spectrum from simple binary judgments to complex affordance reasoning.
Binary predicates about relative position: "Is the chair to the left or right of the table?" "Is the lamp above or below the shelf?" "Which object is closer to the camera?" These require only ordinal judgments — no numbers, just comparisons.
Metric measurements: "How far is the sofa from the wall?" "How wide is the doorway?" "What is the height of the counter?" These demand actual numeric answers in real-world units (meters, centimeters). This is where SpatialVLM breaks new ground — no prior VLM could reliably produce metric distances.
Applied reasoning that combines spatial measurement with world knowledge: "Can a 1-meter-wide robot fit through that gap?" "Can a toddler reach the counter?" These chain quantitative estimation with common-sense constraints. SpatialVLM answers them via CoT: first measure the gap, then compare to the constraint.
SpatialVLM uses the PaLM-E architecture — a vision-language model that encodes images through a Vision Transformer (ViT) and feeds visual tokens alongside text tokens into a large language model backbone. The specific backbone is PaLM 2-S, a smaller variant of PaLM 2.
The model is trained on a mixture of the original PaLM-E dataset (captioning, VQA, embodied planning) and the new spatial VQA dataset. Only 5% of training tokens are dedicated to spatial reasoning tasks. This is enough — and crucially, it doesn't hurt performance on other VQA benchmarks. In fact, VQA v2 performance improves by 2.4% with the spatial data, suggesting VLMs are generally underfitting on spatial-adjacent tasks.
A critical training decision: should you freeze or unfreeze the pretrained ViT encoder? The paper finds that unfreezing the ViT improves fine-grained distance estimation significantly. A frozen ViT (trained with contrastive/classification loss) is lossy in fine-grained spatial information. With an unfrozen ViT, the model achieves 8.4% accuracy in the strict [90%, 110%] range of ground truth — remarkable given that human annotations themselves are noisy.
The training data is inherently noisy — monocular depth is approximate, detection can miss objects, and the 3D lifting introduces errors. The paper studies this by deliberately adding Gaussian noise to accurate robotic manipulation data. Result: VLMs learn generalizable spatial common sense even from moderately noisy data. Performance barely degrades up to 0.3m standard deviation noise.
The paper evaluates SpatialVLM against five baselines: GPT-4V, LLaVA-1.5, InstructBLIP, PaLI, and the base PaLM-E / PaLM 2-E models. Evaluation uses a human-annotated benchmark with 331 qualitative and 215 quantitative spatial reasoning QA pairs.
SpatialVLM achieves 75.2% accuracy on binary predicate tasks (left/right, above/below, closer/farther), compared to 71.3% for LLaVA-1.5, 68.0% for GPT-4V, and 50.4% for the base PaLM 2-E. The gain over GPT-4V is significant — and GPT-4V is orders of magnitude larger.
This is where the gap is dramatic. SpatialVLM outputs valid numeric distances 99.0% of the time. GPT-4V manages only 1.0% — it almost always refuses to estimate distance. Of the valid outputs, 37.2% of SpatialVLM's estimates fall within half-to-twice of the human-annotated ground truth, versus 0% for GPT-4V (since it almost never gives a number).
Despite devoting 5% of training tokens to spatial tasks, general VQA performance is preserved: OKVQA drops only 0.4%, and VQA v2 increases 2.4%. Spatial training is complementary, not competitive.
The killer application of quantitative spatial reasoning: robotics. A robot that can see a scene and estimate real-world distances can plan actions that require spatial awareness — reaching, navigating, placing objects.
Traditional robot learning requires hand-crafted reward functions. With SpatialVLM, you can define a task in natural language ("pick up the orange tea bottle") and use the VLM as a reward annotator. For each frame in a trajectory, ask: "What is the distance between the gripper and the orange tea bottle?" As the gripper approaches, SpatialVLM reports monotonically decreasing distances, providing a smooth, dense reward signal.
The paper demonstrates this across three manipulation tasks: picking a tea bottle, putting an apple in a bowl, and picking up an apple. In each case, the VLM-generated reward correctly decreases as the robot makes progress toward the goal.
For navigation, SpatialVLM enables spatial affordance reasoning via CoT. "Can the robot fit through the gap between the sofa and the table?" The VLM estimates the gap width (1.56m), the LLM compares to the robot width (1m), and answers yes. This works from a single monocular image — no depth sensor, no LIDAR, no 3D reconstruction at inference time.
SpatialVLM pushes VLMs into new territory, but several limitations constrain its applicability.
The entire pipeline depends on monocular depth estimation (ZoeDepth), which is imperfect. It works best at medium range (1–10 meters) and degrades at very close range or long distances. The model inherits these biases: it is most accurate for indoor scenes with objects at moderate distances, and less reliable for outdoor scenes with large depth ranges.
The system works from single images only. Occluded objects, objects partially visible at frame edges, and complex multi-room layouts are poorly handled. Multi-view consistency — where the same scene is observed from different angles — is not enforced.
Monocular depth estimation can confuse scale. A photo of a miniature model room and a real room can produce similar depth maps. The pipeline mitigates this through the diversity of training images, but fundamental scale-depth ambiguity persists for unusual scenes.
The 38 question types with templates don't cover all possible spatial queries. Novel question formulations, compositional multi-hop reasoning, and spatial questions about unseen object categories may not generalize from the template distribution. The CoT approach partially addresses this by decomposing complex questions into template-like sub-queries.
Ground truth for spatial reasoning is hard. Human annotators disagree on distances ("Is that 2 meters or 3?"). The 8.4% accuracy at [90%, 110%] of ground truth is remarkable partly because human annotations are themselves noisy. Better benchmarks are needed.
SpatialVLM builds directly on PaLM-E's architecture and training recipe. PaLM-E showed that VLMs can be grounded in embodied data for robot planning. SpatialVLM extends this with spatial grounding — not just "plan a path to the kitchen" but "the kitchen door is 3 meters away and 0.8 meters wide."
SayCan combines LLM planning with grounded affordance functions. SpatialVLM provides a much richer affordance interface: instead of binary "can/cannot" from pretrained skills, it offers continuous metric distance estimates that enable dense reward shaping.
The paper positions GPT-4V as the primary foil. GPT-4V is vastly larger and more capable at general vision-language tasks, but its near-total failure at quantitative spatial reasoning (1% valid output rate) demonstrates that scale alone doesn't solve spatial grounding — you need the right data.
An alternative approach: feed 3D point clouds directly to the VLM as input. SpatialVLM takes the opposite path — it receives only 2D images but has been trained on data derived from 3D reconstructions. At inference, no 3D input is needed. This makes deployment much simpler (just a camera, no depth sensor).
The metric depth estimator that powers the data pipeline. ZoeDepth achieves zero-shot metric depth from monocular images by combining relative and metric depth. Its accuracy directly bounds SpatialVLM's data quality.
SpatialVLM connects to broader questions about spatial cognition. Humans develop spatial reasoning through embodied experience — we learn distances by walking, sizes by grasping. SpatialVLM takes a shortcut: instead of embodied experience, it uses reconstructed 3D geometry from images. Whether this produces the same kind of spatial understanding as embodied learning remains an open question.