Can foundation models construct spatial beliefs through active exploration? A benchmark revealing that multimodal models reason well from given data but fail to build, revise, and exploit internal cognitive maps.
Imagine you drop a robot into an unfamiliar apartment. No map. No overhead camera. Just a first-person view and the ability to move around. Could it build a mental map of the space — figure out where the couch is relative to the kitchen table, which direction the door faces, what's behind the corner it hasn't looked around yet?
This is something humans do effortlessly. You walk through a new house once and you know where things are. You can give directions, take shortcuts, and — crucially — if someone moves the furniture while you're out of the room, you update your mental model after seeing the changes.
But here's the thing about existing spatial reasoning benchmarks: they hand the agent a complete description of the environment. "There is a chair at (3,2) facing north. There is a table at (5,4) facing east." The agent just reasons over provided data. It never has to explore.
This paper asks a deceptively simple question: can foundation models — GPT-5.2, Gemini-3 Pro, Claude-4.5 Sonnet, and others — actually explore a partially observable environment and build accurate internal spatial representations?
The answer, it turns out, is "much worse than you'd expect."
Previous spatial benchmarks give the model everything it needs in the prompt — a complete scene graph, a full 3D description, or a bird's-eye view image. Theory of Space deliberately withholds this. The agent starts with zero spatial knowledge and must actively choose actions to acquire it. This tests a fundamentally different capability: not "can you reason about space?" but "can you learn about space through your own actions?"
The choice to use a grid world (rather than continuous space) is deliberate: it eliminates perception noise and makes ground truth unambiguous. If a model fails here, it's not because the visual input was confusing — it's because the model cannot maintain coherent spatial beliefs over time. The text modality makes this even starker: the model receives perfectly formatted symbolic descriptions and still degrades.
You've probably heard of Theory of Mind — the ability to understand that other agents have beliefs, desires, and intentions that may differ from your own. A classic test: Sally puts a marble in a basket, then leaves the room. Anne moves the marble to a box. Where will Sally look for the marble? Children under age 4 say "the box" (where the marble actually is). Older children say "the basket" (where Sally believes it is). That's Theory of Mind.
This paper introduces Theory of Space — the spatial analog. Instead of asking "does the agent understand what others believe about the world?", it asks: "does the agent have coherent spatial beliefs about the world, and can it update them?"
The key insight is that spatial intelligence isn't just reasoning — it's the full pipeline from perception to action to belief formation. You need to:
Current models can do step 5 reasonably well when given a map. They struggle badly with steps 1-4.
For a transformer-based LLM, a "spatial belief" isn't a data structure — it's an implicit pattern distributed across the model's activations during generation. When the model outputs a cognitive map JSON, it must reconstruct its spatial understanding from the conversation history on the fly. There is no explicit memory buffer, no persistent state between turns (beyond the growing context window). Every spatial fact the model "remembers" must be re-derived from the text each time it generates.
This is fundamentally different from how classical SLAM systems maintain beliefs: SLAM has an explicit pose graph and map data structure that persists and gets updated in-place. Foundation models must simulate this persistent state using only autoregressive text generation over a conversation history. The paper reveals that this simulation breaks down systematically as the history grows.
Theory of Space decomposes spatial intelligence into three capabilities, each tested separately:
The agent is placed in an unknown multi-room grid. It must actively choose actions — move to a room, rotate, observe — to discover objects and their positions. At each step, the benchmark probes the agent's internal spatial belief by asking it to output a JSON cognitive map.
Formally, this is a POMDP (Partially Observable Markov Decision Process). The agent has a belief state bt over the true world state s. Each observation ot updates the belief via Bayes' rule. The goal is to construct a belief bT that closely matches the true state s after T exploration steps.
This is the false belief paradigm adapted for space. After the agent builds an initial map, the environment is modified — objects are moved or rotated — and the agent is told "something has changed." It must re-explore and update its beliefs. The question: does the agent actually overwrite its old spatial beliefs, or does it exhibit belief inertia and cling to outdated information?
Given a (possibly self-constructed) cognitive map, can the agent answer spatial questions? These range from simple ("which direction is the chair from the table?") to complex ("if you're standing at X facing north, what would you see?").
Each "turn" in the evaluation, the model receives a growing conversation history:
The model must output: (1) its next action, and (2) a structured JSON cognitive map listing all objects it believes exist, their positions, and orientations. The context window grows with every step — by step 30, GPT-5.2 is processing ~8K tokens per turn. The benchmark uses the model's native API with temperature=0 for reproducibility.
Total API cost per scenario: approximately $2-5 for GPT-5.2 (30-50 turns x 5K-10K tokens each). The full benchmark suite (108 scenarios x 5 models) cost approximately $3,000 in API calls.
The benchmark environment is a multi-room grid world. Think of it as a floor plan: N×M rooms arranged on a grid, connected by doorways. Each room contains objects with precise 2D coordinates and a cardinal orientation (which direction the object "faces").
Every scenario runs in two modalities:
The agent has four actions:
To establish upper bounds on exploration quality, the paper designs two proxy agents:
The benchmark uses grids from 2x2 (4 rooms) to 4x4 (16 rooms), with 3-8 objects per room. Each object has a position (cell coordinates), a type (chair, table, lamp, etc.), and a cardinal orientation (N/E/S/W). Total objects per scenario: 12-128. Exploration budget: 30-80 steps depending on grid size.
The 3D-rendered version uses ThreeDWorld (TDW) — a Unity-based simulation platform. Each observation renders at 512x512 resolution. Objects are selected from TDW's library (~200 unique 3D models) to ensure visual distinctness. Rendering is deterministic given the agent's position and orientation.
This is the paper's core methodological innovation. At every step of exploration, the benchmark asks the model: "What do you believe the world looks like right now?"
The model must output a structured JSON object — a cognitive map — listing every object it believes exists, its position, and its orientation. This isn't a question about what the model saw on this step. It's a question about the model's accumulated belief across all steps.
The cognitive map output is a structured JSON with this schema:
Scoring is component-wise: existence (is this object in the true map?), room (correct room?), position (correct relative position within the room?), orientation (correct facing direction?). Each component receives a binary score. The overall cognitive map accuracy is the average across all objects and components.
This structured output format means that failures are precisely attributable. When a model scores 0.85 on existence but 0.35 on orientation, we know exactly what's breaking. The JSON requirement also tests whether models can maintain structured state across a long conversation — some models (particularly smaller ones) start producing malformed JSON after 20+ turns.
The probed cognitive maps are scored on multiple axes:
Cognitive science distinguishes two types of spatial knowledge:
Humans build route knowledge first (by walking through a space) and gradually integrate it into survey knowledge (a bird's-eye mental map). The benchmark tests both.
Route-level (egocentric):
Survey-level (allocentric):
Here's the headline result. Take any model — say GPT-5.2. Give it a complete, correct description of the environment and ask spatial questions. It scores 0.57 (normalized). Now give it the same environment but make it actively explore to build its own map first. Performance drops to 0.46.
That's an 11-point drop just from having to explore instead of being told. And GPT-5.2 is one of the better models.
The numbers get worse when you look at how models explore. The Strategist proxy agent needs about 9 steps per room to achieve near-complete coverage. Foundation models take 14+ steps — and still achieve lower information gain. They explore redundantly: revisiting rooms they've already fully observed, rotating when they've already seen all four directions, observing the same view multiple times.
Concrete example from GPT-5.2 on a 3x3 grid: the model enters Room (1,1), observes (sees 2 objects), rotates 90 degrees, observes (sees 1 more object), rotates 90 degrees, observes (sees nothing new), rotates 90 degrees, observes again (repeating its initial view). That's 4 observations for a room that only needed 3 rotations. Then it goes to Room (1,2), observes twice... then returns to Room (1,1) and observes again. This pattern is pervasive — models lack a reliable "have I been here before?" signal.
GPT-5.2 shows a distinctive failure mode: fast initial information gain followed by premature stopping. It gathers information quickly in the first few steps, then decides it's "done" before fully exploring. Gemini-3 Pro, by contrast, explores more methodically and achieves the best overall cognitive map quality — but still falls short of the Strategist proxy.
This is perhaps the most surprising finding. You might expect that as a model explores more, its cognitive map gets better. That's true for the first few steps. But then something strange happens: beliefs that were correct early on start becoming incorrect later.
At step 3, the model correctly observed a chair at (3,2) facing north. At step 7, after visiting other rooms, the probed cognitive map now says the chair is at (3,2) facing east. Nothing changed in the environment — the model simply corrupted its own memory.
The paper measures this with a stability score: for each object, track whether a correct belief at step t remains correct at step t+1, t+2, etc. Across all models, stability degrades over time. Objects observed early in exploration are most vulnerable — they spend the most steps "in memory" and accumulate the most drift.
Not all attributes drift equally. Object position (2D coordinates) is relatively stable — once a model correctly locates an object, it tends to remember the location. But object orientation (which direction it faces) is catastrophically unstable. Orientation perception accuracy in vision mode drops below 40% for most models.
Concrete numbers across 108 scenarios (4x4 grid, 8 objects per room, 50 exploration steps):
The degradation is not caused by context window overflow — all models have 128K+ context and the conversations fit within 15K tokens. It appears to be an inherent limitation of autoregressive generation: earlier observations get "overwritten" by later reasoning, even when no new contradictory evidence is presented. This is functionally equivalent to catastrophic forgetting, but within a single conversation rather than across training epochs.
Belief instability means that longer exploration can actually hurt performance. There's an optimal exploration length beyond which additional steps degrade more beliefs than they form. This creates a cruel tradeoff: explore too little and you miss objects; explore too much and you corrupt the ones you found.
Chapter 7 showed that models corrupt their own beliefs even when nothing changes. This chapter asks: what happens when the environment actually changes?
The setup mirrors the classic Sally-Anne test from Theory of Mind research:
The paper introduces a belief inertia metric: after re-exploration, does the agent's belief match the new state (correct update) or the old state (inertia)? A high inertia score means the model is clinging to obsolete information.
Results are stark:
The text-vision modality gap is especially large for belief revision. In text mode, the observation explicitly says "chair is now facing east" — hard to miss. In vision mode, the model must perceive the orientation change from rendered pixels, which is exactly the perceptual capability that's weakest.
This creates a cascading failure: the perceptual bottleneck (Chapter 7) makes initial orientation beliefs unreliable, and belief inertia (this chapter) prevents those unreliable beliefs from being corrected even when the model gets a second chance.
The paper tests several prompting strategies:
The takeaway: belief inertia is not a prompting problem. It appears to be a fundamental limitation of how autoregressive models process and update implicit state over long conversations.
The idea that animals build internal spatial representations goes back to Tolman's "cognitive map" hypothesis. Rats learn spatial layouts beyond simple stimulus-response associations — they form map-like representations that support shortcuts and detours. O'Keefe later discovered place cells in the hippocampus that fire at specific locations, and grid cells (Moser & Moser, 2005) that tile space in hexagonal patterns. Theory of Space asks whether foundation models develop anything analogous — functional cognitive maps, even without neural correlates.
Theory of Space explicitly parallels Theory of Mind testing. Just as ToM benchmarks use false-belief tasks (Sally-Anne) to test whether models distinguish their own knowledge from others', Theory of Space uses false-belief tasks to test whether models distinguish their current observations from their prior beliefs. The finding that models exhibit belief inertia mirrors ToM findings that models struggle with false beliefs.
VSI-Bench tests video-based spatial understanding. SpartQA tests spatial reasoning in text. ScanQA tests 3D scene understanding. All of these provide complete spatial information and test reasoning. Theory of Space contributes the active construction dimension — the agent must acquire its own spatial information through exploration.
The implications for embodied AI are direct. VLAs (Vision-Language-Action models) like RT-2, pi-0, and Octo are deployed in physical environments that are partially observable. If the underlying foundation models can't maintain stable spatial beliefs through active exploration, robot policies built on top of them inherit those limitations. The active-passive gap measured here predicts real-world deployment failures.
Across the full benchmark, Gemini-3 Pro achieves the best overall spatial intelligence — strongest exploration strategy, best cognitive map quality, lowest belief inertia. GPT-5.2 has the fastest initial information gain but suffers from premature termination and moderate instability. Claude-4.5 Sonnet, GLM-4.6V, and Qwen3-VL show competitive reasoning from given maps but larger active-passive gaps.
| Model | Passive (text) | Active (text) | Active (vision) | Belief Stability | Orientation Inertia |
|---|---|---|---|---|---|
| Gemini-3 Pro | 0.54 | 0.48 | 0.39 | 0.78 | 0.48 |
| GPT-5.2 | 0.57 | 0.46 | 0.37 | 0.72 | 0.55 |
| Claude-4.5 Sonnet | 0.52 | 0.41 | 0.34 | 0.69 | 0.62 |
| GLM-4.6V | 0.49 | 0.37 | 0.30 | 0.65 | 0.64 |
| Qwen3-VL | 0.47 | 0.36 | 0.28 | 0.63 | 0.67 |
Key observations: (1) Every model shows a text → vision drop of 7-9 points in active mode — pure perception cost. (2) Orientation inertia is above 0.48 for all models — no model reliably updates orientation beliefs. (3) The best model (Gemini-3 Pro) still achieves only 0.39 in active-vision mode, barely above chance for a 4-option orientation task.