Theory of Space — Veanors

Chapter 0: The Problem

Imagine you drop a robot into an unfamiliar apartment. No map. No overhead camera. Just a first-person view and the ability to move around. Could it build a mental map of the space — figure out where the couch is relative to the kitchen table, which direction the door faces, what's behind the corner it hasn't looked around yet?

This is something humans do effortlessly. You walk through a new house once and you know where things are. You can give directions, take shortcuts, and — crucially — if someone moves the furniture while you're out of the room, you update your mental model after seeing the changes.

But here's the thing about existing spatial reasoning benchmarks: they hand the agent a complete description of the environment. "There is a chair at (3,2) facing north. There is a table at (5,4) facing east." The agent just reasons over provided data. It never has to explore.

The gap in the literature: Benchmarks like VSI-Bench, SpartQA, and ScanQA test spatial reasoning — can you answer questions given spatial information? But they never test spatial intelligence — can you actively acquire, maintain, and revise spatial information through your own exploration? Theory of Space fills this gap.

This paper asks a deceptively simple question: can foundation models — GPT-5.2, Gemini-3 Pro, Claude-4.5 Sonnet, and others — actually explore a partially observable environment and build accurate internal spatial representations?

The answer, it turns out, is "much worse than you'd expect."

Why this formalization?

Previous spatial benchmarks give the model everything it needs in the prompt — a complete scene graph, a full 3D description, or a bird's-eye view image. Theory of Space deliberately withholds this. The agent starts with zero spatial knowledge and must actively choose actions to acquire it. This tests a fundamentally different capability: not "can you reason about space?" but "can you learn about space through your own actions?"

The choice to use a grid world (rather than continuous space) is deliberate: it eliminates perception noise and makes ground truth unambiguous. If a model fails here, it's not because the visual input was confusing — it's because the model cannot maintain coherent spatial beliefs over time. The text modality makes this even starker: the model receives perfectly formatted symbolic descriptions and still degrades.

What fundamental capability do existing spatial benchmarks fail to test?

Active exploration — the ability to build spatial knowledge through the agent's own actions in a partially observable environment Language understanding Image recognition accuracy

Chapter 1: The Key Insight

You've probably heard of Theory of Mind — the ability to understand that other agents have beliefs, desires, and intentions that may differ from your own. A classic test: Sally puts a marble in a basket, then leaves the room. Anne moves the marble to a box. Where will Sally look for the marble? Children under age 4 say "the box" (where the marble actually is). Older children say "the basket" (where Sally believes it is). That's Theory of Mind.

This paper introduces Theory of Space — the spatial analog. Instead of asking "does the agent understand what others believe about the world?", it asks: "does the agent have coherent spatial beliefs about the world, and can it update them?"

Theory of Space = An agent's ability to (1) construct spatial beliefs through active exploration, (2) revise those beliefs when the environment changes, and (3) exploit those beliefs for downstream spatial tasks. Just as Theory of Mind tests whether you can model another mind, Theory of Space tests whether you can model a physical space.

The key insight is that spatial intelligence isn't just reasoning — it's the full pipeline from perception to action to belief formation. You need to:

Decide where to look (exploration strategy)
Integrate observations over time (belief construction)
Know what you don't know (uncertainty awareness)
Update when things change (belief revision)
Use your map for tasks (exploitation)

Current models can do step 5 reasonably well when given a map. They struggle badly with steps 1-4.

What "spatial belief" means computationally

For a transformer-based LLM, a "spatial belief" isn't a data structure — it's an implicit pattern distributed across the model's activations during generation. When the model outputs a cognitive map JSON, it must reconstruct its spatial understanding from the conversation history on the fly. There is no explicit memory buffer, no persistent state between turns (beyond the growing context window). Every spatial fact the model "remembers" must be re-derived from the text each time it generates.

This is fundamentally different from how classical SLAM systems maintain beliefs: SLAM has an explicit pose graph and map data structure that persists and gets updated in-place. Foundation models must simulate this persistent state using only autoregressive text generation over a conversation history. The paper reveals that this simulation breaks down systematically as the history grows.

What is Theory of Space analogous to, and why?

Theory of Mind — just as ToM tests understanding of others' beliefs, Theory of Space tests whether an agent can build, revise, and exploit its own spatial beliefs Theory of Relativity — because space is relative Graph theory — because rooms form a graph

Chapter 2: The Three Abilities

Theory of Space decomposes spatial intelligence into three capabilities, each tested separately:

1. Construct — Build the Map

The agent is placed in an unknown multi-room grid. It must actively choose actions — move to a room, rotate, observe — to discover objects and their positions. At each step, the benchmark probes the agent's internal spatial belief by asking it to output a JSON cognitive map.

Formally, this is a POMDP (Partially Observable Markov Decision Process). The agent has a belief state b_t over the true world state s. Each observation o_t updates the belief via Bayes' rule. The goal is to construct a belief b_T that closely matches the true state s after T exploration steps.

2. Revise — Update When Things Change

This is the false belief paradigm adapted for space. After the agent builds an initial map, the environment is modified — objects are moved or rotated — and the agent is told "something has changed." It must re-explore and update its beliefs. The question: does the agent actually overwrite its old spatial beliefs, or does it exhibit belief inertia and cling to outdated information?

3. Exploit — Use the Map

Given a (possibly self-constructed) cognitive map, can the agent answer spatial questions? These range from simple ("which direction is the chair from the table?") to complex ("if you're standing at X facing north, what would you see?").

The POMDP framing: At each timestep t, the agent in state (position, orientation) takes action a_t ∈ {Goto, Rotate, Observe, Query}, receives observation o_t (what's visible in its 90° FOV), and updates its internal belief b_t. The environment is deterministic — the only uncertainty comes from partial observability. This makes it cleanly separable from stochastic dynamics.

What the model actually processes

Each "turn" in the evaluation, the model receives a growing conversation history:

System prompt: Environment rules, action space, output format specification (~500 tokens)
Observation history: All previous action-observation pairs. At step 20, this can reach 3,000-5,000 tokens in text mode, or 20 rendered images + text descriptions in vision mode.
Current observation: What the agent sees right now (text description or rendered image of 90-degree FOV)
Probe request: "Output your current cognitive map as JSON"

The model must output: (1) its next action, and (2) a structured JSON cognitive map listing all objects it believes exist, their positions, and orientations. The context window grows with every step — by step 30, GPT-5.2 is processing ~8K tokens per turn. The benchmark uses the model's native API with temperature=0 for reproducibility.

Total API cost per scenario: approximately $2-5 for GPT-5.2 (30-50 turns x 5K-10K tokens each). The full benchmark suite (108 scenarios x 5 models) cost approximately $3,000 in API calls.

What makes the "Revise" ability test particularly revealing about foundation models?

It tests whether models can overwrite previously correct beliefs with new observations after the environment changes — exposing belief inertia It tests if models can read JSON files It tests whether models can rotate images

Chapter 3: Environment Design

The benchmark environment is a multi-room grid world. Think of it as a floor plan: N×M rooms arranged on a grid, connected by doorways. Each room contains objects with precise 2D coordinates and a cardinal orientation (which direction the object "faces").

Two parallel worlds

Every scenario runs in two modalities:

Text world: Observations are symbolic descriptions. "You are in Room (2,3) facing East. You see: chair at front-left, near, facing away from you. Table at front-right, far, facing toward you." This tests pure spatial reasoning without perceptual noise.
Vision world: Observations are rendered 3D images from ThreeDWorld (TDW). The agent sees actual room renders with furniture, walls, and lighting. This adds the perceptual bottleneck on top of the reasoning challenge.

Action space

The agent has four actions:

Goto(room): Move to an adjacent room
Rotate(angle): Turn 90°, 180°, or 270° within the current room
Observe: Look at the current 90° field of view and receive an observation
Query: Answer a spatial question (used during exploitation tasks)

Proxy agents (baselines)

To establish upper bounds on exploration quality, the paper designs two proxy agents:

Scout: Simple rotate-and-scan strategy. Enter a room, rotate 4 times to see all 360°, then move to the next unvisited room. Deterministic, thorough, but not information-efficient.
Strategist: Belief-driven edge coverage with AC-3 constraint propagation. Maintains an explicit uncertainty map and plans observations to maximize information gain. Uses about 9 steps per room on average.

Environment scale and complexity

The benchmark uses grids from 2x2 (4 rooms) to 4x4 (16 rooms), with 3-8 objects per room. Each object has a position (cell coordinates), a type (chair, table, lamp, etc.), and a cardinal orientation (N/E/S/W). Total objects per scenario: 12-128. Exploration budget: 30-80 steps depending on grid size.

The 3D-rendered version uses ThreeDWorld (TDW) — a Unity-based simulation platform. Each observation renders at 512x512 resolution. Objects are selected from TDW's library (~200 unique 3D models) to ensure visual distinctness. Rendering is deterministic given the agent's position and orientation.

Why two modalities matter: The text/vision gap reveals how much of the spatial reasoning failure is due to perception vs. reasoning. If a model fails in text mode, it can't reason spatially. If it succeeds in text but fails in vision, the bottleneck is perception — extracting spatial relationships from pixels. Spoiler: the perception bottleneck is massive, especially for object orientation.

What does the Strategist proxy agent use to plan its exploration?

Belief-driven edge coverage with AC-3 constraint propagation — it maintains an uncertainty map and plans to maximize information gain Random exploration with no planning A pretrained neural network for navigation

Chapter 4: Cognitive Map Probing

This is the paper's core methodological innovation. At every step of exploration, the benchmark asks the model: "What do you believe the world looks like right now?"

The model must output a structured JSON object — a cognitive map — listing every object it believes exists, its position, and its orientation. This isn't a question about what the model saw on this step. It's a question about the model's accumulated belief across all steps.

How it works

Step 1: Agent enters the environment. No observations yet. Cognitive map = empty.
Step 2: Agent observes. Sees a chair at (3,2) facing north. Outputs cognitive map: {chair: {pos: [3,2], facing: "N"}}.
Step 3: Agent moves to next room, observes. Sees a table at (5,4) facing east. Cognitive map should now include both the chair and the table.
Step N: After all exploration, the final cognitive map should be a complete, accurate description of the environment.

Why probing at every step matters: By demanding a cognitive map at each step — not just at the end — the benchmark can measure stability. Does the model maintain correct beliefs it formed earlier? Or do correct early observations get corrupted by later exploration? (Spoiler: they do. Models exhibit systematic belief drift.)

The JSON format and scoring

The cognitive map output is a structured JSON with this schema:

{"objects": [{"type": "chair", "room": [2,3], "position": "front-left", "orientation": "N"}, ...]}

Scoring is component-wise: existence (is this object in the true map?), room (correct room?), position (correct relative position within the room?), orientation (correct facing direction?). Each component receives a binary score. The overall cognitive map accuracy is the average across all objects and components.

This structured output format means that failures are precisely attributable. When a model scores 0.85 on existence but 0.35 on orientation, we know exactly what's breaking. The JSON requirement also tests whether models can maintain structured state across a long conversation — some models (particularly smaller ones) start producing malformed JSON after 20+ turns.

Evaluation dimensions

The probed cognitive maps are scored on multiple axes:

Correctness: Are the reported objects, positions, and orientations correct?
Perception: Can the model correctly extract spatial info from a single observation?
Self-tracking: Does the model know its own position and orientation?
Stability: Do previously correct beliefs remain correct in later steps?
Local ↔ Global consistency: Does the egocentric view map correctly onto allocentric coordinates?

Why does the benchmark probe the agent's cognitive map at every step, not just at the end?

To measure belief stability over time — detecting whether correct early observations degrade as exploration continues To slow the agent down To increase the token count

Chapter 5: Route vs Survey Knowledge

Cognitive science distinguishes two types of spatial knowledge:

Route knowledge (egocentric): "Turn left at the kitchen, walk past the bedroom, the bathroom is on your right." This is path-based, tied to your body and your movement through space.
Survey knowledge (allocentric): "The bathroom is north of the kitchen and east of the living room." This is map-like, independent of where you're standing or which way you're facing.

Humans build route knowledge first (by walking through a space) and gradually integrate it into survey knowledge (a bird's-eye mental map). The benchmark tests both.

The 8 evaluation tasks

Route-level (egocentric):

Action-to-view: "You take 2 steps forward and turn left. What do you see?"
View-to-action: "You want to see X. What actions should you take?"
Location-to-view: "You're at (3,2) facing east. Describe what you see."
View-to-location: "Given this view, where are you?"

Survey-level (allocentric):

Pairwise direction: "What direction is the chair from the table?"
Perspective taking: "From the table's perspective, where is the chair?"
Allocentric map: "Draw the top-down layout of the room."
Mental rotation: "If the room is rotated 90°, where is the chair now?"

The route → survey progression: In humans, survey knowledge requires mental integration of multiple egocentric views. Models show the same hierarchy of difficulty: route tasks are easier than survey tasks. But the gap is larger than expected — models struggle to integrate views into a coherent allocentric map, even from text descriptions.

What's the difference between route knowledge and survey knowledge?

Route knowledge is egocentric path-based ("turn left at X"), while survey knowledge is allocentric map-based ("X is north of Y") — models find survey knowledge much harder Route knowledge is for outdoors, survey knowledge is for indoors They are the same thing with different names

Chapter 6: The Active-Passive Gap

Here's the headline result. Take any model — say GPT-5.2. Give it a complete, correct description of the environment and ask spatial questions. It scores 0.57 (normalized). Now give it the same environment but make it actively explore to build its own map first. Performance drops to 0.46.

That's an 11-point drop just from having to explore instead of being told. And GPT-5.2 is one of the better models.

Exploration inefficiency

The numbers get worse when you look at how models explore. The Strategist proxy agent needs about 9 steps per room to achieve near-complete coverage. Foundation models take 14+ steps — and still achieve lower information gain. They explore redundantly: revisiting rooms they've already fully observed, rotating when they've already seen all four directions, observing the same view multiple times.

Concrete example from GPT-5.2 on a 3x3 grid: the model enters Room (1,1), observes (sees 2 objects), rotates 90 degrees, observes (sees 1 more object), rotates 90 degrees, observes (sees nothing new), rotates 90 degrees, observes again (repeating its initial view). That's 4 observations for a room that only needed 3 rotations. Then it goes to Room (1,2), observes twice... then returns to Room (1,1) and observes again. This pattern is pervasive — models lack a reliable "have I been here before?" signal.

More steps, less knowledge: Models use ~50% more exploration steps than the proxy baseline while constructing less accurate cognitive maps. The extra steps don't just waste time — they actively degrade previously correct beliefs (see Chapter 7). Exploration efficiency is not just about speed; it's about cognitive stability.

Premature termination

GPT-5.2 shows a distinctive failure mode: fast initial information gain followed by premature stopping. It gathers information quickly in the first few steps, then decides it's "done" before fully exploring. Gemini-3 Pro, by contrast, explores more methodically and achieves the best overall cognitive map quality — but still falls short of the Strategist proxy.

Why is the active-passive gap concerning for real-world deployment?

Real robots must actively explore — they can't rely on pre-given maps. A model that reasons well from given data but explores poorly will fail in deployment. The gap is only 11 points, which is negligible Passive mode is always available in practice

Chapter 7: Belief Instability

This is perhaps the most surprising finding. You might expect that as a model explores more, its cognitive map gets better. That's true for the first few steps. But then something strange happens: beliefs that were correct early on start becoming incorrect later.

The stability decay pattern

At step 3, the model correctly observed a chair at (3,2) facing north. At step 7, after visiting other rooms, the probed cognitive map now says the chair is at (3,2) facing east. Nothing changed in the environment — the model simply corrupted its own memory.

The paper measures this with a stability score: for each object, track whether a correct belief at step t remains correct at step t+1, t+2, etc. Across all models, stability degrades over time. Objects observed early in exploration are most vulnerable — they spend the most steps "in memory" and accumulate the most drift.

The perception bottleneck

Not all attributes drift equally. Object position (2D coordinates) is relatively stable — once a model correctly locates an object, it tends to remember the location. But object orientation (which direction it faces) is catastrophically unstable. Orientation perception accuracy in vision mode drops below 40% for most models.

Why orientation is hardest: Position is a "where" question — relatively easy to ground in both text ("front-left, near") and vision (pixel location). Orientation is a "which way" question — it requires understanding the object's intrinsic facing direction relative to the viewer's perspective, then mapping that to a cardinal direction. In vision mode, this means distinguishing the front vs. back of a rendered 3D object — a remarkably fragile perceptual judgment.

Quantified instability

Concrete numbers across 108 scenarios (4x4 grid, 8 objects per room, 50 exploration steps):

GPT-5.2: Stability score drops from 0.91 (step 5) to 0.72 (step 50). 19% of initially correct beliefs are corrupted.
Gemini-3 Pro: 0.89 → 0.78. Most stable overall, but still 11% degradation.
Claude-4.5 Sonnet: 0.87 → 0.69. Higher degradation rate, especially for orientation.

The degradation is not caused by context window overflow — all models have 128K+ context and the conversations fit within 15K tokens. It appears to be an inherent limitation of autoregressive generation: earlier observations get "overwritten" by later reasoning, even when no new contradictory evidence is presented. This is functionally equivalent to catastrophic forgetting, but within a single conversation rather than across training epochs.

Implications for long-horizon tasks

Belief instability means that longer exploration can actually hurt performance. There's an optimal exploration length beyond which additional steps degrade more beliefs than they form. This creates a cruel tradeoff: explore too little and you miss objects; explore too much and you corrupt the ones you found.

What is belief instability, and which object attribute is most affected?

Previously correct beliefs become incorrect over time without environmental changes — orientation is most affected because it requires complex perspective reasoning The model forgets all objects — position is most affected Beliefs change randomly — color is most affected

Chapter 8: False Beliefs & Inertia

Chapter 7 showed that models corrupt their own beliefs even when nothing changes. This chapter asks: what happens when the environment actually changes?

The setup mirrors the classic Sally-Anne test from Theory of Mind research:

Agent explores and builds a cognitive map (Phase 1)
The environment is modified — objects are moved or rotated (the agent is told "something changed")
Agent must re-explore and update its cognitive map (Phase 2)

Belief inertia

The paper introduces a belief inertia metric: after re-exploration, does the agent's belief match the new state (correct update) or the old state (inertia)? A high inertia score means the model is clinging to obsolete information.

Results are stark:

Position updates: Models are reasonably good at updating positions. If a chair moves from (3,2) to (5,4), most models update their map after re-observing. Inertia ~25%.
Orientation updates: Models are terrible at updating orientations. If a chair rotates from north-facing to east-facing, models overwhelmingly keep the old orientation. Inertia ~60% in vision mode.

Asymmetric inertia: Position inertia is moderate because position changes are perceptually salient — the object is in a visibly different location. Orientation inertia is severe because orientation changes are perceptually subtle — the object looks "almost the same" from many viewpoints. Models default to their prior belief when the new evidence is ambiguous, even when explicitly told something changed.

Text vs vision gap in revision

The text-vision modality gap is especially large for belief revision. In text mode, the observation explicitly says "chair is now facing east" — hard to miss. In vision mode, the model must perceive the orientation change from rendered pixels, which is exactly the perceptual capability that's weakest.

This creates a cascading failure: the perceptual bottleneck (Chapter 7) makes initial orientation beliefs unreliable, and belief inertia (this chapter) prevents those unreliable beliefs from being corrected even when the model gets a second chance.

Could you fix this with prompting?

The paper tests several prompting strategies:

Chain-of-thought: "Before outputting your map, reason step by step about what has changed." Helps with position updates (+5 pts) but does not improve orientation updates.
Explicit change tracking: "Maintain a separate 'changes detected' list." Marginal improvement (+2 pts overall); models often fail to populate the list correctly.
Forced re-observation: "Before answering, observe each changed room again." Improves position but makes orientation worse — repeated observations of the same object from different angles increase confusion about which way it faces.

The takeaway: belief inertia is not a prompting problem. It appears to be a fundamental limitation of how autoregressive models process and update implicit state over long conversations.

Why is belief inertia much worse for orientation than for position?

Position changes are perceptually salient (object in a different place), while orientation changes are subtle (object looks similar from many viewpoints), so models default to their prior Orientation isn't part of the cognitive map Models are trained on more position data

Chapter 9: Connections

Cognitive maps (Tolman, 1948; O'Keefe & Nadel, 1978)

The idea that animals build internal spatial representations goes back to Tolman's "cognitive map" hypothesis. Rats learn spatial layouts beyond simple stimulus-response associations — they form map-like representations that support shortcuts and detours. O'Keefe later discovered place cells in the hippocampus that fire at specific locations, and grid cells (Moser & Moser, 2005) that tile space in hexagonal patterns. Theory of Space asks whether foundation models develop anything analogous — functional cognitive maps, even without neural correlates.

Theory of Mind benchmarks

Theory of Space explicitly parallels Theory of Mind testing. Just as ToM benchmarks use false-belief tasks (Sally-Anne) to test whether models distinguish their own knowledge from others', Theory of Space uses false-belief tasks to test whether models distinguish their current observations from their prior beliefs. The finding that models exhibit belief inertia mirrors ToM findings that models struggle with false beliefs.

Spatial reasoning benchmarks

VSI-Bench tests video-based spatial understanding. SpartQA tests spatial reasoning in text. ScanQA tests 3D scene understanding. All of these provide complete spatial information and test reasoning. Theory of Space contributes the active construction dimension — the agent must acquire its own spatial information through exploration.

Embodied AI

The implications for embodied AI are direct. VLAs (Vision-Language-Action models) like RT-2, pi-0, and Octo are deployed in physical environments that are partially observable. If the underlying foundation models can't maintain stable spatial beliefs through active exploration, robot policies built on top of them inherit those limitations. The active-passive gap measured here predicts real-world deployment failures.

Model rankings

Across the full benchmark, Gemini-3 Pro achieves the best overall spatial intelligence — strongest exploration strategy, best cognitive map quality, lowest belief inertia. GPT-5.2 has the fastest initial information gain but suffers from premature termination and moderate instability. Claude-4.5 Sonnet, GLM-4.6V, and Qwen3-VL show competitive reasoning from given maps but larger active-passive gaps.

Complete scoring breakdown

Model	Passive (text)	Active (text)	Active (vision)	Belief Stability	Orientation Inertia
Gemini-3 Pro	0.54	0.48	0.39	0.78	0.48
GPT-5.2	0.57	0.46	0.37	0.72	0.55
Claude-4.5 Sonnet	0.52	0.41	0.34	0.69	0.62
GLM-4.6V	0.49	0.37	0.30	0.65	0.64
Qwen3-VL	0.47	0.36	0.28	0.63	0.67

Key observations: (1) Every model shows a text → vision drop of 7-9 points in active mode — pure perception cost. (2) Orientation inertia is above 0.48 for all models — no model reliably updates orientation beliefs. (3) The best model (Gemini-3 Pro) still achieves only 0.39 in active-vision mode, barely above chance for a 4-option orientation task.

The takeaway: Spatial intelligence is not a solved capability for foundation models. Active exploration, belief maintenance, and belief revision are fundamental weaknesses that no current model handles well. The gap between "reasoning from given data" and "reasoning from self-acquired data" is the next frontier for multimodal AI.

Which model achieves the best overall spatial intelligence across the Theory of Space benchmark?

Gemini-3 Pro — strongest exploration strategy, best cognitive map quality, and lowest belief inertia GPT-5.2 — fastest initial information gain Claude-4.5 Sonnet — best text reasoning