Self-evolving spatial intelligence via deterministic geometric environments — replacing model consensus with exact geometric computation for zero-noise 3D spatial reasoning training.
Spatial reasoning over 3D scenes is a core capability for embodied intelligence. A robot navigating a kitchen needs to know: "Is the mug closer than the plate?" "Am I facing the door or the window?" "How far is the table?" These questions require understanding 3D geometry from visual observations.
Training vision-language models (VLMs) to answer these questions is bottlenecked by one thing: geometric annotations are expensive. Somebody has to look at a 3D scene, write spatial questions, and compute the correct answers. This is slow, expensive, and produces a static dataset that can never adapt to what the model is actually struggling with.
Self-evolving methods seem like the answer: let the model generate its own training questions and answers, creating an endless stream of practice problems. This works beautifully for math (you can check if 2+3=5) and code (you can run the program). But for spatial reasoning, there is a devastating problem.
Without a ground-truth oracle, self-evolving systems fall back on model consensus — asking the model itself (or multiple copies) to vote on the correct answer. The majority vote becomes the "pseudo-label." But if the model systematically gets a type of spatial question wrong, the majority vote reinforces the error. The model trains on its own mistakes, entrenching them deeper with every iteration.
To ground this concretely: a VLM for spatial reasoning takes multi-view RGB images (typically 4-8 views of a scene, each 448x448 pixels) plus a natural language question as input. The ViT backbone encodes each image into ~256 visual tokens. These are concatenated with text tokens and processed through the VLM's transformer layers. The output is a natural language answer — "1.43 meters" or "left" or "yes."
The problem isn't the architecture — it's the training signal. Where does the ground-truth answer come from? For static datasets: expensive human annotation. For consensus methods: the model's own (possibly wrong) majority vote. For SpatialEvo: exact geometric computation from point clouds.
Watch how model consensus reinforces errors over iterations. The model votes on spatial answers, but systematic biases get amplified. Click "Step" to advance one self-evolution iteration.
Here is the observation that unlocks everything: 3D spatial ground truth is deterministic.
Unlike natural language tasks ("Is this poem good?") or even general vision tasks ("What emotion does this face show?"), spatial questions have answers that are computable from pure geometry. Given a dense point cloud, calibrated camera poses, and a well-formed question, the correct answer can be computed exactly — with zero noise, zero ambiguity, and zero model involvement.
The physical world is an exact and impartial judge. Every unannotated 3D scene is an inexhaustible source of noise-free supervision, waiting to be converted into a training signal. You don't need humans. You don't need model votes. You need geometry.
This is the key difference between spatial reasoning and other VLM tasks. For "Describe this image" there is no computable ground truth. For "How far apart are these two objects?" there is. SpatialEvo exploits this property to build the first self-evolving spatial reasoning framework that doesn't depend on model consensus at all.
The Deterministic Geometric Environment (DGE) is the engine that makes SpatialEvo work. It is a system that takes unannotated 3D scene data (point clouds, camera poses, semantic labels) and converts it into a zero-noise interactive oracle — a judge that can validate any spatial question and compute its exact answer.
The DGE has two jobs:
Let's trace a single question end-to-end:
The entire pipeline runs in <50ms per question on CPU. No GPU needed for the oracle — it's pure geometry.
The beauty of this design is extensibility: supporting a new spatial task category requires only specifying the corresponding validation rules and geometric computation logic. The DGE framework handles the rest — parsing, verification, ground-truth synthesis — automatically.
The DGE converts a raw 3D scene into a zero-noise oracle. Watch a question flow through entity parsing, legality verification, and ground-truth computation. Toggle between valid and invalid questions.
This is the showcase idea of SpatialEvo. A single model plays two roles simultaneously: questioner and solver. The same parameters, the same weights — one model that learns to both ask and answer spatial questions, co-evolving under the constraints of the DGE.
Given multi-view RGB images of a 3D scene and a task type (e.g., "distance estimation"), the model generates a spatially valid question. The questioner must perceive the global 3D layout and produce a question that is physically grounded — not a hallucinated question about objects that don't exist.
Given the same images and a question (generated by the questioner), the model derives a precise answer. The solver must perform explicit geometric reasoning — step-by-step spatial derivation grounded in visual evidence — and its answer is checked against the DGE's exact ground truth.
Parameter sharing creates a virtuous cycle:
Watch the dual-role self-evolution cycle. A single model alternates between Questioner and Solver roles. The DGE provides deterministic feedback at each step. Click "Cycle" to advance.
Both roles share identical LoRA adapters on top of a frozen Qwen2.5-VL backbone. The only difference is the prompt prefix: the questioner sees "Generate a spatial question about this scene" while the solver sees "Answer this spatial question." GRPO updates the same LoRA parameters from both signals simultaneously. Each training step processes a batch of 8 scenes, generating n=8 candidate questions per scene and n=8 candidate answers per valid question — producing 64 question samples and up to 512 answer samples per batch.
When the questioner generates an invalid question, the DGE returns the specific reason for invalidity. The solver is then asked to explain why the question is invalid — and receives a reward for correct explanations. This means every question, valid or not, produces learning signal. Nothing is wasted.
The DGE covers 16 spatial reasoning tasks organized into three categories by observational granularity. Each task has explicit geometric validation rules and a deterministic computation pipeline.
These require integrating global 3D layout information across multiple camera frames:
These assess understanding of single-frame perspective geometry:
These focus on geometrically consistent inference across viewpoints:
The computational complexity varies dramatically across the 16 tasks:
Even the most expensive computation is far cheaper than a single VLM inference pass (~100ms at 7B). The DGE is never the bottleneck.
Hover over each task category to see the geometric validation method used by the DGE.
The questioner's job is deceptively hard. It must look at multi-view RGB images and generate a spatial question that is:
The questioner receives a reward that combines format compliance and substantive quality:
Where α = 0.1, and:
The fobs score assesses whether the questioner demonstrates a natural perceptual hierarchy: from global scene layout to local target identification. A high fobs means the questioner first describes the overall spatial arrangement ("I see a bedroom with a desk against the north wall and a bed near the window"), then zeroes in on the specific objects relevant to the question ("The desk lamp and the bookshelf are both visible in frames 3 and 7"). This encourages genuine spatial reasoning rather than blind template generation.
SpatialEvo trains entirely via online reinforcement learning using the GRPO framework — no supervised fine-tuning stage at all. The model learns from scratch through interaction with the DGE.
For each training scene:
Advantages are computed within each GRPO group independently:
This is critical because spatial tasks have inherent difficulty variation across scenes. A "distance estimation" question in a cluttered room is harder than in a sparse one. By normalizing within each group, the advantage reflects relative performance, not absolute difficulty.
The solver receives different rewards depending on whether the question was valid:
For valid questions, facc checks agreement with DGE ground truth. For invalid questions, fexplain scores whether the solver correctly identifies why the question is invalid. This ensures every question — valid or not — contributes learning signal.
SpatialEvo trains on 8 NVIDIA A100 GPUs (80GB) for approximately 3,000 GRPO steps. Each step involves:
Total training time: ~48 hours at the 7B scale. The DGE computation (run on CPU) takes negligible time compared to model inference. The scene database covers ~4,000 3D scans from ScanNet (1,513 scenes), ScanNet++ (460 scenes), and ARKitScenes (2,257 scenes) — all freely available with dense point clouds and camera poses.
Key hyperparameters: learning rate 1e-5 with cosine schedule, KL coefficient β=0.04, LoRA rank 64, α=0.1 (format weight), group size n=8, temperature 0.7 for sampling.
The DGE can give incorrect ground truth in specific edge cases: (1) severely incomplete point clouds where an object has <50 points, (2) mislabeled semantic segments in the source dataset, (3) ambiguous object boundaries where "nearest point" depends on point cloud density. The paper reports ~3% of computed answers fall outside human-agreement tolerance. The GRPO advantage normalization partially mitigates this — a noisy reward in one sample gets averaged out within the group of 8. But systematic dataset errors (e.g., a consistently mislabeled "desk" that's actually a table) can still inject bad signal.
The scheduler maintains a running accuracy estimate for each task category and increases sampling weight for categories where the model is weakest. A minimum exploration weight δ prevents mastered categories from being entirely excluded. The result: a fully adaptive curriculum that emerges from the model's own performance without any hand-designed difficulty sequence.
SpatialEvo was evaluated across nine benchmarks covering spatial reasoning and general visual understanding, using Qwen2.5-VL at both 3B and 7B scales. The DGE was constructed from ~4K scenes across ScanNet, ScanNet++, and ARKitScenes.
SpatialEvo achieves the highest average score at both scales: 51.1 (3B) and 54.7 (7B), outperforming all baselines including SpatialLadder, SpaceR, ViLaSR, and Spatial-SSRL.
On VSI-Bench (the primary spatial reasoning benchmark), SpatialEvo scores 39.2 (3B) and 46.1 (7B) — up from baselines of 28.1 and 31.1 respectively. That is a +11 to +15 point improvement from self-evolution alone.
Unlike other spatial specialization methods that collapse on general benchmarks, SpatialEvo maintains competitive performance on MMStar (55.2 at 3B, 62.5 at 7B) and RealWorldQA (66.5 at 3B, 66.7 at 7B). SpatialLadder and ViLaSR, by contrast, suffer severe drops on V-STAR (falling to ~36 from baselines of 74-78). This preservation is attributable to the LoRA-only training — the base VLM weights remain frozen, so general capabilities are never overwritten.
Average performance across 9 benchmarks at the 7B scale. SpatialEvo achieves the best overall score while maintaining general capabilities.
Not all 16 tasks benefit equally from self-evolution. The largest gains come from tasks where VLMs have systematic biases:
The ablation study reveals what matters most:
Let us compare three training paradigms to understand exactly where SpatialEvo fits and why determinism is the crucial ingredient.
Train on a fixed, human-annotated dataset. Problems: (1) the dataset is frozen at creation time, so it can never adapt to the model's weaknesses; (2) annotation is expensive and doesn't scale; (3) the training distribution is whatever the annotators happened to create, not what the model needs.
Let the model generate questions and answers, using majority voting to create pseudo-labels. Problems: (1) systematic errors become the training signal — the model trains on its own mistakes; (2) errors compound over iterations; (3) there is no external corrective force. This is like studying for an exam by only checking your answers against your own previous guesses.
Let the model generate questions, but verify answers against exact geometric computation. Advantages: (1) zero noise in the training signal; (2) errors are always corrected by physics; (3) the training distribution adapts dynamically to the model's weaknesses; (4) unlimited training data from any unannotated 3D scene.
This insight has a specific scope. Deterministic ground truth exists when:
Spatial reasoning fits perfectly. Math and code also fit (you can check correctness). Natural language generation, aesthetic judgment, and open-ended reasoning do not — these still require human feedback or model consensus. SpatialEvo works precisely because it identified a domain where the ground truth is computable.
At inference time, SpatialEvo is just a standard VLM forward pass — no DGE, no geometric computation. The LoRA adapters add negligible overhead (~2% extra parameters). Inference speed is identical to base Qwen2.5-VL: ~120ms per question at 7B on a single A100. The model processes multi-view images (4-8 views at 448x448) and generates a text answer in a single autoregressive pass.
Scaling from 3B to 7B yields +3.6 points on average (+6.9 on VSI-Bench specifically). The DGE and self-evolution framework are model-agnostic — they should transfer to any VLM backbone without modification. The paper demonstrates this by training both Qwen2.5-VL-3B and Qwen2.5-VL-7B with identical pipelines.
Compare how accuracy evolves over training iterations under deterministic geometric feedback (SpatialEvo) vs model consensus. Watch consensus degrade while deterministic feedback improves monotonically.
SpatialVLM (Chen et al., 2024): Pioneered large-scale spatial annotation datasets for VLM fine-tuning. SpatialEvo replaces the static annotation approach with dynamic, self-evolving training — no annotations needed.
GRPO (Shao et al., 2024): Group Relative Policy Optimization, the RL algorithm SpatialEvo uses. GRPO simplified PPO by removing the value network and using group-based advantage normalization. SpatialEvo extends GRPO to the dual questioner-solver setting with DGE-anchored rewards.
AlphaGo / Self-Play (Silver et al., 2016-2017): The idea of a model playing against itself to improve. Like Go, spatial reasoning has deterministic rules — the geometry is the "rules of the game." SpatialEvo is self-play for spatial reasoning, with the DGE as the game engine.
RLVR (RL with Verifiable Rewards): The broader class of RL methods where rewards come from a verifiable oracle (math checker, code executor, geometry engine) rather than a learned reward model. SpatialEvo is RLVR for spatial reasoning, with the DGE as the verifier.
Self-Improving LLMs (STaR, Self-Taught Reasoner): Models that improve by generating and filtering their own training data. SpatialEvo advances this by replacing self-filtering (consensus) with an external oracle (DGE), eliminating the error reinforcement problem.
3D Foundation Models: The broader effort to build models that understand 3D spatial structure. SpatialEvo contributes by showing that spatial reasoning can be trained without annotations, potentially unlocking the vast supply of unannotated 3D scan data (ScanNet, ARKitScenes, etc.) as free training fuel.