Li, Zhao, Cheng et al. (Zhejiang University & StepFun) — 2026

SpatialEvo

Self-evolving spatial intelligence via deterministic geometric environments — replacing model consensus with exact geometric computation for zero-noise 3D spatial reasoning training.

Prerequisites: VLMs + GRPO / RL basics + 3D geometry intuition
10
Chapters
5+
Simulations

Chapter 0: The Problem

Spatial reasoning over 3D scenes is a core capability for embodied intelligence. A robot navigating a kitchen needs to know: "Is the mug closer than the plate?" "Am I facing the door or the window?" "How far is the table?" These questions require understanding 3D geometry from visual observations.

Training vision-language models (VLMs) to answer these questions is bottlenecked by one thing: geometric annotations are expensive. Somebody has to look at a 3D scene, write spatial questions, and compute the correct answers. This is slow, expensive, and produces a static dataset that can never adapt to what the model is actually struggling with.

The self-evolving hope

Self-evolving methods seem like the answer: let the model generate its own training questions and answers, creating an endless stream of practice problems. This works beautifully for math (you can check if 2+3=5) and code (you can run the program). But for spatial reasoning, there is a devastating problem.

Without a ground-truth oracle, self-evolving systems fall back on model consensus — asking the model itself (or multiple copies) to vote on the correct answer. The majority vote becomes the "pseudo-label." But if the model systematically gets a type of spatial question wrong, the majority vote reinforces the error. The model trains on its own mistakes, entrenching them deeper with every iteration.

The consensus trap: Imagine a model that consistently confuses "left" and "right" from a particular camera angle. Model consensus will label these confused answers as correct, and the model will train on them, becoming even more confident in its wrong spatial intuitions. The errors compound — this is the fundamental failure mode of consensus-based self-evolution for spatial reasoning.

What data actually flows in

To ground this concretely: a VLM for spatial reasoning takes multi-view RGB images (typically 4-8 views of a scene, each 448x448 pixels) plus a natural language question as input. The ViT backbone encodes each image into ~256 visual tokens. These are concatenated with text tokens and processed through the VLM's transformer layers. The output is a natural language answer — "1.43 meters" or "left" or "yes."

The problem isn't the architecture — it's the training signal. Where does the ground-truth answer come from? For static datasets: expensive human annotation. For consensus methods: the model's own (possibly wrong) majority vote. For SpatialEvo: exact geometric computation from point clouds.

Consensus Error Reinforcement

Watch how model consensus reinforces errors over iterations. The model votes on spatial answers, but systematic biases get amplified. Click "Step" to advance one self-evolution iteration.

Iteration 0
Why does model consensus fail as a training signal for 3D spatial reasoning?

Chapter 1: The Key Insight

Here is the observation that unlocks everything: 3D spatial ground truth is deterministic.

Unlike natural language tasks ("Is this poem good?") or even general vision tasks ("What emotion does this face show?"), spatial questions have answers that are computable from pure geometry. Given a dense point cloud, calibrated camera poses, and a well-formed question, the correct answer can be computed exactly — with zero noise, zero ambiguity, and zero model involvement.

Examples of deterministic spatial computation

The physical world is an exact and impartial judge. Every unannotated 3D scene is an inexhaustible source of noise-free supervision, waiting to be converted into a training signal. You don't need humans. You don't need model votes. You need geometry.

The core insight in one sentence: Because spatial ground truth is a deterministic consequence of geometry, you can replace model consensus with programmatic computation — turning every unannotated 3D scene into a zero-noise oracle that provides perfect training signal forever.

This is the key difference between spatial reasoning and other VLM tasks. For "Describe this image" there is no computable ground truth. For "How far apart are these two objects?" there is. SpatialEvo exploits this property to build the first self-evolving spatial reasoning framework that doesn't depend on model consensus at all.

What unique property of 3D spatial reasoning does SpatialEvo exploit?

Chapter 2: The Deterministic Geometric Environment

The Deterministic Geometric Environment (DGE) is the engine that makes SpatialEvo work. It is a system that takes unannotated 3D scene data (point clouds, camera poses, semantic labels) and converts it into a zero-noise interactive oracle — a judge that can validate any spatial question and compute its exact answer.

What the DGE does

The DGE has two jobs:

  1. Validate questions: When the model generates a spatial question, the DGE checks whether the question is physically valid — do the referenced objects exist? Are the camera frames real? Is the question geometrically well-posed?
  2. Compute ground truth: For valid questions, the DGE programmatically computes the exact answer using geometric operations on the 3D scene assets.

The three-stage pipeline

Stage 1: Entity Parsing
A lightweight LLM extracts structured entities from the natural language question — frame indices, object categories, spatial relationships — and normalizes them into machine-readable form.
Stage 2: Legality Verification
The DGE validates each entity against the scene: Does this object exist? Is this frame index valid? Is the point cloud dense enough for a reliable answer? Invalid questions get a negative reward + specific invalidity reason.
Stage 3: Ground-Truth Synthesis
For valid questions, the DGE runs exact geometric computation: rigid-body transforms, bounding-box fitting, depth projection, plane estimation. Output: exact answer + intermediate geometric states for interpretability.
Core geometric operators: The DGE's toolkit includes rigid-body coordinate transformation, point cloud bounding-box fitting, topological analysis, depth-map perspective projection, and planar normal estimation. These cover the full computational requirements of all 16 task categories across metric, topological, and orientation dimensions.

Concrete data flow through the DGE

Let's trace a single question end-to-end:

  1. Input: Natural language question "How far is the chair from the desk?" + scene ID (points to a ScanNet scene)
  2. Entity Parsing: A lightweight LLM (Qwen2.5-7B) extracts: {objects: ["chair", "desk"], relation: "distance", type: "absolute_distance"}
  3. Scene Assets Retrieved: The DGE loads the scene's dense point cloud (typically 100K-500K points), semantic segmentation masks, and calibrated camera extrinsics for all captured frames
  4. Legality Check: Are "chair" and "desk" in the scene's semantic labels? (If not → reject with reason "object not found")
  5. Ground Truth Computation: Extract point cloud subsets for chair and desk using semantic masks → fit axis-aligned bounding boxes → compute nearest-point distance between bounding box surfaces → output: 1.43 meters (float32, exact)

The entire pipeline runs in <50ms per question on CPU. No GPU needed for the oracle — it's pure geometry.

What if the scene data is noisy? ScanNet point clouds have ~2cm noise from depth sensor imprecision. For distance questions, this introduces ~4cm uncertainty (propagated through nearest-point computation). The paper handles this by rounding ground-truth distances to 0.01m precision and accepting answers within a tolerance band. For direction questions (left/right/above/below), the geometric computation is binary and noise-insensitive — a point is either left or right of a plane, regardless of 2cm jitter.

The beauty of this design is extensibility: supporting a new spatial task category requires only specifying the corresponding validation rules and geometric computation logic. The DGE framework handles the rest — parsing, verification, ground-truth synthesis — automatically.

DGE Pipeline

The DGE converts a raw 3D scene into a zero-noise oracle. Watch a question flow through entity parsing, legality verification, and ground-truth computation. Toggle between valid and invalid questions.

What happens when the DGE receives an invalid question (e.g., referencing a non-existent object)?

Chapter 3: Dual-Role Self-Evolution

This is the showcase idea of SpatialEvo. A single model plays two roles simultaneously: questioner and solver. The same parameters, the same weights — one model that learns to both ask and answer spatial questions, co-evolving under the constraints of the DGE.

The questioner role

Given multi-view RGB images of a 3D scene and a task type (e.g., "distance estimation"), the model generates a spatially valid question. The questioner must perceive the global 3D layout and produce a question that is physically grounded — not a hallucinated question about objects that don't exist.

The solver role

Given the same images and a question (generated by the questioner), the model derives a precise answer. The solver must perform explicit geometric reasoning — step-by-step spatial derivation grounded in visual evidence — and its answer is checked against the DGE's exact ground truth.

Why shared parameters?

Parameter sharing creates a virtuous cycle:

The self-evolution loop: (1) Task scheduler samples a task type based on the model's weakest categories. (2) Questioner generates n candidate questions. (3) DGE validates each question and computes ground truth. (4) Solver generates n candidate answers per valid question. (5) DGE scores answers against exact ground truth. (6) GRPO computes advantages and updates shared parameters. (7) Repeat — the model gets better at both asking and answering.
Self-Evolution Loop

Watch the dual-role self-evolution cycle. A single model alternates between Questioner and Solver roles. The DGE provides deterministic feedback at each step. Click "Cycle" to advance.

Ready

What actually changes in the weights

Both roles share identical LoRA adapters on top of a frozen Qwen2.5-VL backbone. The only difference is the prompt prefix: the questioner sees "Generate a spatial question about this scene" while the solver sees "Answer this spatial question." GRPO updates the same LoRA parameters from both signals simultaneously. Each training step processes a batch of 8 scenes, generating n=8 candidate questions per scene and n=8 candidate answers per valid question — producing 64 question samples and up to 512 answer samples per batch.

Invalid questions are useful too

When the questioner generates an invalid question, the DGE returns the specific reason for invalidity. The solver is then asked to explain why the question is invalid — and receives a reward for correct explanations. This means every question, valid or not, produces learning signal. Nothing is wasted.

Why does SpatialEvo use a single shared-parameter model for both questioner and solver roles?

Chapter 4: 16 Spatial Task Categories

The DGE covers 16 spatial reasoning tasks organized into three categories by observational granularity. Each task has explicit geometric validation rules and a deterministic computation pipeline.

Multi-image scene-level tasks (6 tasks)

These require integrating global 3D layout information across multiple camera frames:

Single-image tasks (3 tasks)

These assess understanding of single-frame perspective geometry:

Dual-image tasks (7 tasks)

These focus on geometrically consistent inference across viewpoints:

How geometric computation differs by task type

The computational complexity varies dramatically across the 16 tasks:

Even the most expensive computation is far cheaper than a single VLM inference pass (~100ms at 7B). The DGE is never the bottleneck.

Three semantic dimensions: Collectively, the 16 tasks span metric measurement (distances, sizes), topological relations (containment, ordering), and camera pose reasoning (orientation, motion). Every answer is deterministically computable from the scene's geometric assets — this is what makes the DGE possible.
16 Spatial Task Categories

Hover over each task category to see the geometric validation method used by the DGE.

How does the DGE compute the answer for a "depth ordering" question?

Chapter 5: Question Generation

The questioner's job is deceptively hard. It must look at multi-view RGB images and generate a spatial question that is:

  1. Physically valid — references real objects, uses valid frame indices, is geometrically well-posed
  2. Visually grounded — demonstrates genuine perception of the 3D scene layout, not just template-filling
  3. Task-consistent — matches the assigned task category from the scheduler

Questioner reward

The questioner receives a reward that combines format compliance and substantive quality:

rQ = α · ffmt + (1 − α) · fvalid · fobs

Where α = 0.1, and:

The gating semantics of fvalid · fobs: The multiplicative coupling is critical. A question gets a positive signal only when it satisfies both geometric validity and sufficient visual observation simultaneously. This prevents the model from generating "superficially valid" questions that conform to format but lack genuine spatial understanding — like asking about object distance without actually perceiving where the objects are in the scene.

What visual observation quality means

The fobs score assesses whether the questioner demonstrates a natural perceptual hierarchy: from global scene layout to local target identification. A high fobs means the questioner first describes the overall spatial arrangement ("I see a bedroom with a desk against the north wall and a bed near the window"), then zeroes in on the specific objects relevant to the question ("The desk lamp and the bookshelf are both visible in frames 3 and 7"). This encourages genuine spatial reasoning rather than blind template generation.

Why does the questioner reward use a multiplicative coupling fvalid · fobs rather than a sum?

Chapter 6: Training Pipeline

SpatialEvo trains entirely via online reinforcement learning using the GRPO framework — no supervised fine-tuning stage at all. The model learns from scratch through interaction with the DGE.

GRPO training procedure

For each training scene:

1. Task Selection
The scheduler infers feasible tasks for this scene, then samples a task type weighted toward the model's weakest categories.
2. Question Generation
Questioner generates n candidate questions {Q(i)}. DGE validates each and computes ground truth G(i) for valid ones.
3. Answer Generation
Questions deduplicated to m unique ones. Solver samples n candidate answers per question. DGE scores against exact ground truth.
4. Advantage Computation
GRPO computes advantages independently within groups to eliminate bias from scene difficulty variation.
5. Joint Update
Questioner and solver gradients applied jointly to the single shared model. Repeat.

Advantage normalization

Advantages are computed within each GRPO group independently:

(i) = (r(i) − mean{r(i)}) / (std{r(i)} + ε)

This is critical because spatial tasks have inherent difficulty variation across scenes. A "distance estimation" question in a cluttered room is harder than in a sparse one. By normalizing within each group, the advantage reflects relative performance, not absolute difficulty.

Solver reward design

The solver receives different rewards depending on whether the question was valid:

rA = α · ffmt + (1 − α) · facc   (valid question)
rA = α · ffmt + (1 − α) · fexplain   (invalid question)

For valid questions, facc checks agreement with DGE ground truth. For invalid questions, fexplain scores whether the solver correctly identifies why the question is invalid. This ensures every question — valid or not — contributes learning signal.

Training infrastructure and cost

SpatialEvo trains on 8 NVIDIA A100 GPUs (80GB) for approximately 3,000 GRPO steps. Each step involves:

Total training time: ~48 hours at the 7B scale. The DGE computation (run on CPU) takes negligible time compared to model inference. The scene database covers ~4,000 3D scans from ScanNet (1,513 scenes), ScanNet++ (460 scenes), and ARKitScenes (2,257 scenes) — all freely available with dense point clouds and camera poses.

Key hyperparameters: learning rate 1e-5 with cosine schedule, KL coefficient β=0.04, LoRA rank 64, α=0.1 (format weight), group size n=8, temperature 0.7 for sampling.

What happens when the oracle is wrong?

The DGE can give incorrect ground truth in specific edge cases: (1) severely incomplete point clouds where an object has <50 points, (2) mislabeled semantic segments in the source dataset, (3) ambiguous object boundaries where "nearest point" depends on point cloud density. The paper reports ~3% of computed answers fall outside human-agreement tolerance. The GRPO advantage normalization partially mitigates this — a noisy reward in one sample gets averaged out within the group of 8. But systematic dataset errors (e.g., a consistently mislabeled "desk" that's actually a table) can still inject bad signal.

Task-adaptive scheduling

The scheduler maintains a running accuracy estimate for each task category and increases sampling weight for categories where the model is weakest. A minimum exploration weight δ prevents mastered categories from being entirely excluded. The result: a fully adaptive curriculum that emerges from the model's own performance without any hand-designed difficulty sequence.

Why does SpatialEvo compute GRPO advantages within each group independently, rather than across all groups?

Chapter 7: Results

SpatialEvo was evaluated across nine benchmarks covering spatial reasoning and general visual understanding, using Qwen2.5-VL at both 3B and 7B scales. The DGE was constructed from ~4K scenes across ScanNet, ScanNet++, and ARKitScenes.

State-of-the-art spatial reasoning

SpatialEvo achieves the highest average score at both scales: 51.1 (3B) and 54.7 (7B), outperforming all baselines including SpatialLadder, SpaceR, ViLaSR, and Spatial-SSRL.

On VSI-Bench (the primary spatial reasoning benchmark), SpatialEvo scores 39.2 (3B) and 46.1 (7B) — up from baselines of 28.1 and 31.1 respectively. That is a +11 to +15 point improvement from self-evolution alone.

No degradation on general tasks

Unlike other spatial specialization methods that collapse on general benchmarks, SpatialEvo maintains competitive performance on MMStar (55.2 at 3B, 62.5 at 7B) and RealWorldQA (66.5 at 3B, 66.7 at 7B). SpatialLadder and ViLaSR, by contrast, suffer severe drops on V-STAR (falling to ~36 from baselines of 74-78). This preservation is attributable to the LoRA-only training — the base VLM weights remain frozen, so general capabilities are never overwritten.

SpatialEvo vs Baselines (9 Benchmarks)

Average performance across 9 benchmarks at the 7B scale. SpatialEvo achieves the best overall score while maintaining general capabilities.

Per-task breakdown

Not all 16 tasks benefit equally from self-evolution. The largest gains come from tasks where VLMs have systematic biases:

Ablation highlights

The ablation study reveals what matters most:

The ablation that says it all: Replacing DGE ground truth with majority-vote pseudo-labels drops VSI-Bench from 46.1 to 18.8 — a 27-point collapse, worse than the untrained baseline (31.1). Model consensus doesn't just fail to help — it actively destroys spatial reasoning by training on systematically biased pseudo-labels. This is the strongest validation of the deterministic geometric approach.
What is the single most impactful component in SpatialEvo's ablation study?

Chapter 8: Why Determinism Matters

Let us compare three training paradigms to understand exactly where SpatialEvo fits and why determinism is the crucial ingredient.

Paradigm 1: Static Data Tuning

Train on a fixed, human-annotated dataset. Problems: (1) the dataset is frozen at creation time, so it can never adapt to the model's weaknesses; (2) annotation is expensive and doesn't scale; (3) the training distribution is whatever the annotators happened to create, not what the model needs.

Paradigm 2: Consensus Self-Evolve

Let the model generate questions and answers, using majority voting to create pseudo-labels. Problems: (1) systematic errors become the training signal — the model trains on its own mistakes; (2) errors compound over iterations; (3) there is no external corrective force. This is like studying for an exam by only checking your answers against your own previous guesses.

Paradigm 3: SpatialEvo (Deterministic Self-Evolve)

Let the model generate questions, but verify answers against exact geometric computation. Advantages: (1) zero noise in the training signal; (2) errors are always corrected by physics; (3) the training distribution adapts dynamically to the model's weaknesses; (4) unlimited training data from any unannotated 3D scene.

The error reinforcement problem, quantified: In the ablation study, replacing DGE ground truth with majority-vote pseudo-labels drops VSI-Bench from 46.1 to 18.8 — below the untrained baseline of 31.1. The model doesn't just fail to improve; it actively degrades. After several iterations of training on its own consensus, the model is worse than if it had never trained at all. Determinism is not a nice-to-have — it is the difference between self-improvement and self-destruction.

When does determinism apply?

This insight has a specific scope. Deterministic ground truth exists when:

Spatial reasoning fits perfectly. Math and code also fit (you can check correctness). Natural language generation, aesthetic judgment, and open-ended reasoning do not — these still require human feedback or model consensus. SpatialEvo works precisely because it identified a domain where the ground truth is computable.

Inference cost and scaling behavior

At inference time, SpatialEvo is just a standard VLM forward pass — no DGE, no geometric computation. The LoRA adapters add negligible overhead (~2% extra parameters). Inference speed is identical to base Qwen2.5-VL: ~120ms per question at 7B on a single A100. The model processes multi-view images (4-8 views at 448x448) and generates a text answer in a single autoregressive pass.

Scaling from 3B to 7B yields +3.6 points on average (+6.9 on VSI-Bench specifically). The DGE and self-evolution framework are model-agnostic — they should transfer to any VLM backbone without modification. The paper demonstrates this by training both Qwen2.5-VL-3B and Qwen2.5-VL-7B with identical pipelines.

Deterministic vs Consensus Training

Compare how accuracy evolves over training iterations under deterministic geometric feedback (SpatialEvo) vs model consensus. Watch consensus degrade while deterministic feedback improves monotonically.

Why does training with model consensus pseudo-labels eventually produce a model WORSE than the untrained baseline?

Chapter 9: Connections

What SpatialEvo builds on

SpatialVLM (Chen et al., 2024): Pioneered large-scale spatial annotation datasets for VLM fine-tuning. SpatialEvo replaces the static annotation approach with dynamic, self-evolving training — no annotations needed.

GRPO (Shao et al., 2024): Group Relative Policy Optimization, the RL algorithm SpatialEvo uses. GRPO simplified PPO by removing the value network and using group-based advantage normalization. SpatialEvo extends GRPO to the dual questioner-solver setting with DGE-anchored rewards.

AlphaGo / Self-Play (Silver et al., 2016-2017): The idea of a model playing against itself to improve. Like Go, spatial reasoning has deterministic rules — the geometry is the "rules of the game." SpatialEvo is self-play for spatial reasoning, with the DGE as the game engine.

Related paradigms

RLVR (RL with Verifiable Rewards): The broader class of RL methods where rewards come from a verifiable oracle (math checker, code executor, geometry engine) rather than a learned reward model. SpatialEvo is RLVR for spatial reasoning, with the DGE as the verifier.

Self-Improving LLMs (STaR, Self-Taught Reasoner): Models that improve by generating and filtering their own training data. SpatialEvo advances this by replacing self-filtering (consensus) with an external oracle (DGE), eliminating the error reinforcement problem.

3D Foundation Models: The broader effort to build models that understand 3D spatial structure. SpatialEvo contributes by showing that spatial reasoning can be trained without annotations, potentially unlocking the vast supply of unannotated 3D scan data (ScanNet, ARKitScenes, etc.) as free training fuel.

Key takeaway

SpatialEvo's legacy will likely be the insight, not just the system: Wherever ground truth is a deterministic consequence of known physics, you can build a zero-noise oracle and bypass the consensus bottleneck entirely. This applies beyond spatial reasoning — any domain with computable ground truth (physics simulation, molecular dynamics, circuit design) could use the same pattern. The self-evolving paradigm + deterministic verification = unlimited noise-free training data.

Cheat sheet

Core insight
3D spatial ground truth is deterministic — computable exactly from geometry, no model needed
DGE
Deterministic Geometric Environment: 16 task types, geometric validation rules, zero-noise oracle
Dual-role
Single model co-evolves as Questioner + Solver under DGE constraints via GRPO
Results
SOTA at 3B (51.1) and 7B (54.7) avg across 9 benchmarks, no general capability loss
Key ablation
Replacing DGE with consensus pseudo-labels: VSI-Bench collapses 46.1 → 18.8
What broader principle does SpatialEvo demonstrate about self-evolving AI systems?