Zhang, Huang, Wang et al. — ICLR 2026

Theory of Space

Can foundation models construct spatial beliefs through active exploration? A benchmark revealing that multimodal models reason well from given data but fail to build, revise, and exploit internal cognitive maps.

Prerequisites: Multimodal LLMs + Spatial reasoning basics
10
Chapters
5+
Simulations

Chapter 0: The Problem

Imagine you drop a robot into an unfamiliar apartment. No map. No overhead camera. Just a first-person view and the ability to move around. Could it build a mental map of the space — figure out where the couch is relative to the kitchen table, which direction the door faces, what's behind the corner it hasn't looked around yet?

This is something humans do effortlessly. You walk through a new house once and you know where things are. You can give directions, take shortcuts, and — crucially — if someone moves the furniture while you're out of the room, you update your mental model after seeing the changes.

But here's the thing about existing spatial reasoning benchmarks: they hand the agent a complete description of the environment. "There is a chair at (3,2) facing north. There is a table at (5,4) facing east." The agent just reasons over provided data. It never has to explore.

The gap in the literature: Benchmarks like VSI-Bench, SpartQA, and ScanQA test spatial reasoning — can you answer questions given spatial information? But they never test spatial intelligence — can you actively acquire, maintain, and revise spatial information through your own exploration? Theory of Space fills this gap.

This paper asks a deceptively simple question: can foundation models — GPT-5.2, Gemini-3 Pro, Claude-4.5 Sonnet, and others — actually explore a partially observable environment and build accurate internal spatial representations?

The answer, it turns out, is "much worse than you'd expect."

Why this formalization?

Previous spatial benchmarks give the model everything it needs in the prompt — a complete scene graph, a full 3D description, or a bird's-eye view image. Theory of Space deliberately withholds this. The agent starts with zero spatial knowledge and must actively choose actions to acquire it. This tests a fundamentally different capability: not "can you reason about space?" but "can you learn about space through your own actions?"

The choice to use a grid world (rather than continuous space) is deliberate: it eliminates perception noise and makes ground truth unambiguous. If a model fails here, it's not because the visual input was confusing — it's because the model cannot maintain coherent spatial beliefs over time. The text modality makes this even starker: the model receives perfectly formatted symbolic descriptions and still degrades.

What fundamental capability do existing spatial benchmarks fail to test?

Chapter 1: The Key Insight

You've probably heard of Theory of Mind — the ability to understand that other agents have beliefs, desires, and intentions that may differ from your own. A classic test: Sally puts a marble in a basket, then leaves the room. Anne moves the marble to a box. Where will Sally look for the marble? Children under age 4 say "the box" (where the marble actually is). Older children say "the basket" (where Sally believes it is). That's Theory of Mind.

This paper introduces Theory of Space — the spatial analog. Instead of asking "does the agent understand what others believe about the world?", it asks: "does the agent have coherent spatial beliefs about the world, and can it update them?"

Theory of Space = An agent's ability to (1) construct spatial beliefs through active exploration, (2) revise those beliefs when the environment changes, and (3) exploit those beliefs for downstream spatial tasks. Just as Theory of Mind tests whether you can model another mind, Theory of Space tests whether you can model a physical space.

The key insight is that spatial intelligence isn't just reasoning — it's the full pipeline from perception to action to belief formation. You need to:

Current models can do step 5 reasonably well when given a map. They struggle badly with steps 1-4.

What "spatial belief" means computationally

For a transformer-based LLM, a "spatial belief" isn't a data structure — it's an implicit pattern distributed across the model's activations during generation. When the model outputs a cognitive map JSON, it must reconstruct its spatial understanding from the conversation history on the fly. There is no explicit memory buffer, no persistent state between turns (beyond the growing context window). Every spatial fact the model "remembers" must be re-derived from the text each time it generates.

This is fundamentally different from how classical SLAM systems maintain beliefs: SLAM has an explicit pose graph and map data structure that persists and gets updated in-place. Foundation models must simulate this persistent state using only autoregressive text generation over a conversation history. The paper reveals that this simulation breaks down systematically as the history grows.

What is Theory of Space analogous to, and why?

Chapter 2: The Three Abilities

Theory of Space decomposes spatial intelligence into three capabilities, each tested separately:

1. Construct — Build the Map

The agent is placed in an unknown multi-room grid. It must actively choose actions — move to a room, rotate, observe — to discover objects and their positions. At each step, the benchmark probes the agent's internal spatial belief by asking it to output a JSON cognitive map.

Formally, this is a POMDP (Partially Observable Markov Decision Process). The agent has a belief state bt over the true world state s. Each observation ot updates the belief via Bayes' rule. The goal is to construct a belief bT that closely matches the true state s after T exploration steps.

2. Revise — Update When Things Change

This is the false belief paradigm adapted for space. After the agent builds an initial map, the environment is modified — objects are moved or rotated — and the agent is told "something has changed." It must re-explore and update its beliefs. The question: does the agent actually overwrite its old spatial beliefs, or does it exhibit belief inertia and cling to outdated information?

3. Exploit — Use the Map

Given a (possibly self-constructed) cognitive map, can the agent answer spatial questions? These range from simple ("which direction is the chair from the table?") to complex ("if you're standing at X facing north, what would you see?").

The POMDP framing: At each timestep t, the agent in state (position, orientation) takes action at ∈ {Goto, Rotate, Observe, Query}, receives observation ot (what's visible in its 90° FOV), and updates its internal belief bt. The environment is deterministic — the only uncertainty comes from partial observability. This makes it cleanly separable from stochastic dynamics.

What the model actually processes

Each "turn" in the evaluation, the model receives a growing conversation history:

The model must output: (1) its next action, and (2) a structured JSON cognitive map listing all objects it believes exist, their positions, and orientations. The context window grows with every step — by step 30, GPT-5.2 is processing ~8K tokens per turn. The benchmark uses the model's native API with temperature=0 for reproducibility.

Total API cost per scenario: approximately $2-5 for GPT-5.2 (30-50 turns x 5K-10K tokens each). The full benchmark suite (108 scenarios x 5 models) cost approximately $3,000 in API calls.

What makes the "Revise" ability test particularly revealing about foundation models?

Chapter 3: Environment Design

The benchmark environment is a multi-room grid world. Think of it as a floor plan: N×M rooms arranged on a grid, connected by doorways. Each room contains objects with precise 2D coordinates and a cardinal orientation (which direction the object "faces").

Two parallel worlds

Every scenario runs in two modalities:

Action space

The agent has four actions:

Proxy agents (baselines)

To establish upper bounds on exploration quality, the paper designs two proxy agents:

Environment scale and complexity

The benchmark uses grids from 2x2 (4 rooms) to 4x4 (16 rooms), with 3-8 objects per room. Each object has a position (cell coordinates), a type (chair, table, lamp, etc.), and a cardinal orientation (N/E/S/W). Total objects per scenario: 12-128. Exploration budget: 30-80 steps depending on grid size.

The 3D-rendered version uses ThreeDWorld (TDW) — a Unity-based simulation platform. Each observation renders at 512x512 resolution. Objects are selected from TDW's library (~200 unique 3D models) to ensure visual distinctness. Rendering is deterministic given the agent's position and orientation.

Why two modalities matter: The text/vision gap reveals how much of the spatial reasoning failure is due to perception vs. reasoning. If a model fails in text mode, it can't reason spatially. If it succeeds in text but fails in vision, the bottleneck is perception — extracting spatial relationships from pixels. Spoiler: the perception bottleneck is massive, especially for object orientation.
What does the Strategist proxy agent use to plan its exploration?

Chapter 4: Cognitive Map Probing

This is the paper's core methodological innovation. At every step of exploration, the benchmark asks the model: "What do you believe the world looks like right now?"

The model must output a structured JSON object — a cognitive map — listing every object it believes exists, its position, and its orientation. This isn't a question about what the model saw on this step. It's a question about the model's accumulated belief across all steps.

How it works

  1. Step 1: Agent enters the environment. No observations yet. Cognitive map = empty.
  2. Step 2: Agent observes. Sees a chair at (3,2) facing north. Outputs cognitive map: {chair: {pos: [3,2], facing: "N"}}.
  3. Step 3: Agent moves to next room, observes. Sees a table at (5,4) facing east. Cognitive map should now include both the chair and the table.
  4. Step N: After all exploration, the final cognitive map should be a complete, accurate description of the environment.
Why probing at every step matters: By demanding a cognitive map at each step — not just at the end — the benchmark can measure stability. Does the model maintain correct beliefs it formed earlier? Or do correct early observations get corrupted by later exploration? (Spoiler: they do. Models exhibit systematic belief drift.)

The JSON format and scoring

The cognitive map output is a structured JSON with this schema:

{"objects": [{"type": "chair", "room": [2,3], "position": "front-left", "orientation": "N"}, ...]}

Scoring is component-wise: existence (is this object in the true map?), room (correct room?), position (correct relative position within the room?), orientation (correct facing direction?). Each component receives a binary score. The overall cognitive map accuracy is the average across all objects and components.

This structured output format means that failures are precisely attributable. When a model scores 0.85 on existence but 0.35 on orientation, we know exactly what's breaking. The JSON requirement also tests whether models can maintain structured state across a long conversation — some models (particularly smaller ones) start producing malformed JSON after 20+ turns.

Evaluation dimensions

The probed cognitive maps are scored on multiple axes:

Why does the benchmark probe the agent's cognitive map at every step, not just at the end?

Chapter 5: Route vs Survey Knowledge

Cognitive science distinguishes two types of spatial knowledge:

Humans build route knowledge first (by walking through a space) and gradually integrate it into survey knowledge (a bird's-eye mental map). The benchmark tests both.

The 8 evaluation tasks

Route-level (egocentric):

Survey-level (allocentric):

The route → survey progression: In humans, survey knowledge requires mental integration of multiple egocentric views. Models show the same hierarchy of difficulty: route tasks are easier than survey tasks. But the gap is larger than expected — models struggle to integrate views into a coherent allocentric map, even from text descriptions.
What's the difference between route knowledge and survey knowledge?

Chapter 6: The Active-Passive Gap

Here's the headline result. Take any model — say GPT-5.2. Give it a complete, correct description of the environment and ask spatial questions. It scores 0.57 (normalized). Now give it the same environment but make it actively explore to build its own map first. Performance drops to 0.46.

That's an 11-point drop just from having to explore instead of being told. And GPT-5.2 is one of the better models.

Exploration inefficiency

The numbers get worse when you look at how models explore. The Strategist proxy agent needs about 9 steps per room to achieve near-complete coverage. Foundation models take 14+ steps — and still achieve lower information gain. They explore redundantly: revisiting rooms they've already fully observed, rotating when they've already seen all four directions, observing the same view multiple times.

Concrete example from GPT-5.2 on a 3x3 grid: the model enters Room (1,1), observes (sees 2 objects), rotates 90 degrees, observes (sees 1 more object), rotates 90 degrees, observes (sees nothing new), rotates 90 degrees, observes again (repeating its initial view). That's 4 observations for a room that only needed 3 rotations. Then it goes to Room (1,2), observes twice... then returns to Room (1,1) and observes again. This pattern is pervasive — models lack a reliable "have I been here before?" signal.

More steps, less knowledge: Models use ~50% more exploration steps than the proxy baseline while constructing less accurate cognitive maps. The extra steps don't just waste time — they actively degrade previously correct beliefs (see Chapter 7). Exploration efficiency is not just about speed; it's about cognitive stability.

Premature termination

GPT-5.2 shows a distinctive failure mode: fast initial information gain followed by premature stopping. It gathers information quickly in the first few steps, then decides it's "done" before fully exploring. Gemini-3 Pro, by contrast, explores more methodically and achieves the best overall cognitive map quality — but still falls short of the Strategist proxy.

Why is the active-passive gap concerning for real-world deployment?

Chapter 7: Belief Instability

This is perhaps the most surprising finding. You might expect that as a model explores more, its cognitive map gets better. That's true for the first few steps. But then something strange happens: beliefs that were correct early on start becoming incorrect later.

The stability decay pattern

At step 3, the model correctly observed a chair at (3,2) facing north. At step 7, after visiting other rooms, the probed cognitive map now says the chair is at (3,2) facing east. Nothing changed in the environment — the model simply corrupted its own memory.

The paper measures this with a stability score: for each object, track whether a correct belief at step t remains correct at step t+1, t+2, etc. Across all models, stability degrades over time. Objects observed early in exploration are most vulnerable — they spend the most steps "in memory" and accumulate the most drift.

The perception bottleneck

Not all attributes drift equally. Object position (2D coordinates) is relatively stable — once a model correctly locates an object, it tends to remember the location. But object orientation (which direction it faces) is catastrophically unstable. Orientation perception accuracy in vision mode drops below 40% for most models.

Why orientation is hardest: Position is a "where" question — relatively easy to ground in both text ("front-left, near") and vision (pixel location). Orientation is a "which way" question — it requires understanding the object's intrinsic facing direction relative to the viewer's perspective, then mapping that to a cardinal direction. In vision mode, this means distinguishing the front vs. back of a rendered 3D object — a remarkably fragile perceptual judgment.

Quantified instability

Concrete numbers across 108 scenarios (4x4 grid, 8 objects per room, 50 exploration steps):

The degradation is not caused by context window overflow — all models have 128K+ context and the conversations fit within 15K tokens. It appears to be an inherent limitation of autoregressive generation: earlier observations get "overwritten" by later reasoning, even when no new contradictory evidence is presented. This is functionally equivalent to catastrophic forgetting, but within a single conversation rather than across training epochs.

Implications for long-horizon tasks

Belief instability means that longer exploration can actually hurt performance. There's an optimal exploration length beyond which additional steps degrade more beliefs than they form. This creates a cruel tradeoff: explore too little and you miss objects; explore too much and you corrupt the ones you found.

What is belief instability, and which object attribute is most affected?

Chapter 8: False Beliefs & Inertia

Chapter 7 showed that models corrupt their own beliefs even when nothing changes. This chapter asks: what happens when the environment actually changes?

The setup mirrors the classic Sally-Anne test from Theory of Mind research:

  1. Agent explores and builds a cognitive map (Phase 1)
  2. The environment is modified — objects are moved or rotated (the agent is told "something changed")
  3. Agent must re-explore and update its cognitive map (Phase 2)

Belief inertia

The paper introduces a belief inertia metric: after re-exploration, does the agent's belief match the new state (correct update) or the old state (inertia)? A high inertia score means the model is clinging to obsolete information.

Results are stark:

Asymmetric inertia: Position inertia is moderate because position changes are perceptually salient — the object is in a visibly different location. Orientation inertia is severe because orientation changes are perceptually subtle — the object looks "almost the same" from many viewpoints. Models default to their prior belief when the new evidence is ambiguous, even when explicitly told something changed.

Text vs vision gap in revision

The text-vision modality gap is especially large for belief revision. In text mode, the observation explicitly says "chair is now facing east" — hard to miss. In vision mode, the model must perceive the orientation change from rendered pixels, which is exactly the perceptual capability that's weakest.

This creates a cascading failure: the perceptual bottleneck (Chapter 7) makes initial orientation beliefs unreliable, and belief inertia (this chapter) prevents those unreliable beliefs from being corrected even when the model gets a second chance.

Could you fix this with prompting?

The paper tests several prompting strategies:

The takeaway: belief inertia is not a prompting problem. It appears to be a fundamental limitation of how autoregressive models process and update implicit state over long conversations.

Why is belief inertia much worse for orientation than for position?

Chapter 9: Connections

Cognitive maps (Tolman, 1948; O'Keefe & Nadel, 1978)

The idea that animals build internal spatial representations goes back to Tolman's "cognitive map" hypothesis. Rats learn spatial layouts beyond simple stimulus-response associations — they form map-like representations that support shortcuts and detours. O'Keefe later discovered place cells in the hippocampus that fire at specific locations, and grid cells (Moser & Moser, 2005) that tile space in hexagonal patterns. Theory of Space asks whether foundation models develop anything analogous — functional cognitive maps, even without neural correlates.

Theory of Mind benchmarks

Theory of Space explicitly parallels Theory of Mind testing. Just as ToM benchmarks use false-belief tasks (Sally-Anne) to test whether models distinguish their own knowledge from others', Theory of Space uses false-belief tasks to test whether models distinguish their current observations from their prior beliefs. The finding that models exhibit belief inertia mirrors ToM findings that models struggle with false beliefs.

Spatial reasoning benchmarks

VSI-Bench tests video-based spatial understanding. SpartQA tests spatial reasoning in text. ScanQA tests 3D scene understanding. All of these provide complete spatial information and test reasoning. Theory of Space contributes the active construction dimension — the agent must acquire its own spatial information through exploration.

Embodied AI

The implications for embodied AI are direct. VLAs (Vision-Language-Action models) like RT-2, pi-0, and Octo are deployed in physical environments that are partially observable. If the underlying foundation models can't maintain stable spatial beliefs through active exploration, robot policies built on top of them inherit those limitations. The active-passive gap measured here predicts real-world deployment failures.

Model rankings

Across the full benchmark, Gemini-3 Pro achieves the best overall spatial intelligence — strongest exploration strategy, best cognitive map quality, lowest belief inertia. GPT-5.2 has the fastest initial information gain but suffers from premature termination and moderate instability. Claude-4.5 Sonnet, GLM-4.6V, and Qwen3-VL show competitive reasoning from given maps but larger active-passive gaps.

Complete scoring breakdown

ModelPassive (text)Active (text)Active (vision)Belief StabilityOrientation Inertia
Gemini-3 Pro0.540.480.390.780.48
GPT-5.20.570.460.370.720.55
Claude-4.5 Sonnet0.520.410.340.690.62
GLM-4.6V0.490.370.300.650.64
Qwen3-VL0.470.360.280.630.67

Key observations: (1) Every model shows a text → vision drop of 7-9 points in active mode — pure perception cost. (2) Orientation inertia is above 0.48 for all models — no model reliably updates orientation beliefs. (3) The best model (Gemini-3 Pro) still achieves only 0.39 in active-vision mode, barely above chance for a 4-option orientation task.

The takeaway: Spatial intelligence is not a solved capability for foundation models. Active exploration, belief maintenance, and belief revision are fundamental weaknesses that no current model handles well. The gap between "reasoning from given data" and "reasoning from self-acquired data" is the next frontier for multimodal AI.
Which model achieves the best overall spatial intelligence across the Theory of Space benchmark?