The computational structure of Spatial AI systems — a visionary roadmap for evolving SLAM into general-purpose, always-on spatial intelligence for every device, co-designed with specialized processors and close-to-sensor computation.
Imagine the future of augmented reality. Not a bulky headset tethered to a gaming PC, but lightweight spectacles you wear all day. They need to understand the 3D world around you in real time — where the walls are, where the table is, where your coffee mug sits — so they can overlay digital information precisely anchored to reality.
Now consider the constraints. These glasses must run on a battery smaller than your thumbnail. They must consume less than one watt of power. They must process every frame from every camera in milliseconds, not seconds. And they must do all of this while fitting in a form factor indistinguishable from regular eyewear.
In 2018, when Andrew Davison wrote this paper, the gap between what existed and what was needed was enormous. SLAM (Simultaneous Localization and Mapping) — the core technology for real-time 3D understanding — ran on desktop GPUs consuming 200+ watts. The algorithms were brilliant but power-hungry. The hardware was capable but massive.
This paper is not a typical methods paper. It doesn't propose a new algorithm or report benchmark numbers. It's a manifesto — a detailed vision of what Spatial AI systems will look like when SLAM, deep learning, and specialized hardware are co-designed from the ground up. Davison calls this vision FutureMapping.
The key question that drives everything: can we design systems where the computational graph structure of the algorithms matches the physical graph structure of the processor hardware, so that data barely has to move?
Davison's central insight is that Spatial AI is not just a software problem. You can't solve it by writing a better SLAM algorithm and running it on commodity hardware. The path forward requires co-design across three layers simultaneously:
The reason these three must be co-designed comes down to a single physical fact: moving data costs more energy than computing on it. In modern chips, a floating-point multiply costs about 1 picojoule. Moving that same number across a chip costs 100 picojoules. Sending it off-chip costs 1,000 picojoules. The energy hierarchy is brutal:
This means the architecture of your processor — where data physically lives relative to where it's processed — determines your power budget far more than the algorithm's computational complexity. Two algorithms with identical FLOP counts can differ by 100x in power consumption depending on how much data movement they require.
This is why Davison calls the paper "FutureMapping" — it's about mapping the future of the entire system stack, not just the mapping algorithms. The future is not a better SLAM running on a GPU. It's a fundamentally new kind of computing system where algorithms, processors, and sensors are designed as one integrated whole.
SLAM has evolved through three distinct levels of capability, each building on the last. Understanding this progression reveals where we're headed.
The original SLAM systems (MonoSLAM, PTAM, ORB-SLAM) track a sparse set of point features — corners, edges, distinctive patches. They estimate camera pose and a cloud of 3D points. The map is geometrically useful (you can localize) but visually empty — just dots floating in space. You can't tell if there's a wall or a table; you just see points.
Systems like DTAM and KinectFusion reconstruct every surface. Using depth sensors or dense stereo, they build volumetric or surfel-based maps where you can see walls, floors, furniture shapes. The map is geometrically rich — you can measure distances, plan paths, detect obstacles. But the map still doesn't know what anything is. A chair and a table are just differently shaped surfaces.
Systems like SemanticFusion and Mask-SLAM add CNN-based object recognition to dense maps. Now the system knows "that's a chair" and "that's a table." But the semantics are typically bolted on as a separate processing stage — run a CNN on the image, project labels into the 3D map, done. The geometry and semantics live in separate worlds.
The progression from Level 1 to Level 4 isn't just about adding more data to the map. It's a fundamental shift in what the map is. In Level 1, the map is a bag of coordinates. In Level 4, the map is a graph of learned concepts — each node a compressed, multi-modal representation that can be decoded into geometry, appearance, or semantic labels on demand.
Davison stakes the entire vision on two hypotheses. These aren't proven facts — they're bets about what the right design choices are for Spatial AI. And they're deliberately controversial.
There's a tempting shortcut: skip the 3D map entirely. Modern end-to-end learning can go directly from pixels to actions (for a robot) or pixels to rendered overlays (for AR). Why bother reconstructing 3D geometry at all?
Davison argues this is wrong. A general, persistent, close-to-metric 3D scene representation is essential because:
If Spatial AI systems build general 3D maps, then their quality can be measured by a small number of universal metrics, independent of downstream application:
This is analogous to how we evaluate a CPU: we don't benchmark it on every possible application, we measure clock speed, IPC, and memory bandwidth. These universal metrics would enable standardized benchmarking of Spatial AI systems — something Davison pursued with the SLAMBench project.
This is the technical heart of the paper. Davison identifies three distinct graph structures that exist in every SLAM system. The key to efficient Spatial AI is recognizing these graphs and designing hardware that mirrors their topology.
Every camera image is a regular 2D grid of pixels. Nearby pixels are highly correlated — a pixel's value strongly predicts its neighbors'. This regularity is exactly why convolutional neural networks work so well: a 3x3 convolution kernel exploits the local, grid-structured correlations of the image graph. The image graph is regular, dense, and local.
The 3D map that SLAM builds is a very different kind of graph. It's a collection of features (points, surfaces, objects) linked by co-visibility — two features are connected if they've been seen together in the same camera frame. This graph is:
The real-time processing loop itself forms a graph: sensor data flows in, gets processed through tracking, data association, map update, and rendering stages, with feedback loops everywhere. This is a directed graph with cycles — the map affects what you predict, predictions affect how you interpret new data, interpretations update the map.
This is the paper's most important structural observation. Traditional SLAM systems force all three graphs onto a single processor type (CPU or GPU), which means at least two of the three graphs are poorly matched to the hardware. The future requires heterogeneous processors where each subsystem's hardware mirrors its algorithmic graph.
The essential structure of all SLAM systems — and by extension, all Spatial AI — is a closed loop between a persistent world model and incoming sensor data. Understanding this loop is critical because it dictates the entire computational architecture.
KinectFusion (Newcombe et al., 2011) is Davison's paradigmatic example of the closed loop done right. It maintains a voxel grid representing the 3D scene, renders predicted depth images from any viewpoint, tracks the camera by aligning predicted and observed depth images, and fuses new depth data back into the voxel grid — all at 30 Hz on a single GPU.
The key insight: in a well-designed closed loop, the system predicts every pixel of the next observation. Anything that differs from prediction is either a tracking error (correct the pose) or new information (update the map). This is maximally informative — nothing is wasted.
For decades, software developers enjoyed a free ride: Moore's law doubled transistor counts every two years, and Dennard scaling kept power consumption flat — more transistors at the same power. But Dennard scaling broke around 2006. More transistors now means more power, which means more heat, which means you can't actually use all those transistors at once.
This created the power wall. The only path forward is parallelism — use many simple cores instead of one fast core, and keep data movement minimal. But here's the catch: generic parallelism (like GPUs) still wastes enormous energy moving data between processing elements and memory. The solution is application-specific parallelism, where the hardware's physical structure matches the algorithm's data flow.
Graph Processors (Graphcore IPU): A massively parallel processor with ~1,200 independent tiles, each with its own local memory and compute. Tiles communicate via a configurable interconnect — the physical communication graph can be reshaped to match any algorithm's data dependencies. Perfect for the map graph, where features need to exchange information with their co-visible neighbors.
Neuromorphic Chips (SpiNNaker): Processors inspired by biological neural networks, with asynchronous, event-driven computation. Instead of processing every pixel at a fixed frame rate, they process only changes — a natural fit for event cameras and temporally sparse signals.
Custom Vision ASICs: Purpose-built chips for specific vision tasks (like Intel's Movidius VPU). Extremely power-efficient for their target workload, but inflexible.
The power numbers tell the story. A desktop GPU (NVIDIA GTX 1080) runs SLAM at 200+ watts. A mobile SoC (Snapdragon) might manage 5 watts. But the target for always-on AR glasses is under 1 watt — and that's for the entire perception pipeline, not just SLAM. Only radical co-design can close this 200x gap.
The most radical idea in the paper: the camera itself should become an active participant in the perception loop, not a passive data source. Today's cameras are "dumb" — they capture full frames at a fixed rate and throw megabytes of raw pixels at the processor. Most of those pixels are redundant (nothing changed since last frame). All of them must travel from sensor to processor, consuming precious energy on data movement.
Event cameras (Dynamic Vision Sensors) are the first step toward smarter sensors. Instead of capturing full frames, each pixel independently reports when its brightness changes. A static scene produces zero data. A moving edge produces a sparse stream of events only at the boundary. Bandwidth drops by 10-100x for typical scenes.
Even more radical: the SCAMP-5 chip (developed at the University of Manchester) puts a tiny processor at every pixel. A 256x256 array of pixel-processors can perform convolutions, edge detection, and simple CNN operations directly on the image plane, at 1.2 watts. Data never leaves the sensor — computation happens where the photons land.
Davison's boldest proposal: a camera that doesn't just report brightness changes (like an event camera), but reports deviations from the world model's predictions.
Think about what this means for power. In a familiar environment where the world model is good, the camera transmits almost nothing — just the occasional surprise. The bandwidth between sensor and processor drops to near zero. All the energy that was spent moving megabytes of redundant pixels per frame is saved.
This concept — that a sensor should report only where received data differs from prediction — is a generalization of the event camera concept. A standard event camera defines "prediction" as "the previous pixel value." The generalized event camera defines "prediction" as "what the world model says this pixel should look like." It's the same principle, but with a much better predictor.
Davison synthesizes everything into a single architectural vision: the Spatial AI brain. This is the paper's "Figure 4" — a conceptual diagram of what a complete, integrated Spatial AI processor would look like. Let's build it up piece by piece.
At the center sits the map — a distributed graph of learned features stored across the cores of a graph processor. Each processor core "owns" a local region of the map. The feature at each node encodes geometry, appearance, and semantics in a single learned representation (a latent code). Nearby map features live on physically nearby cores, so local map operations (smoothing, optimization, co-visibility queries) happen without long-distance data movement.
Surrounding the map store is the real-time processing loop: rendering (predict what the cameras should see), tracking (compare prediction to reality), fusion (incorporate new data), and CNN labeling (extract semantic information). This loop runs at frame rate — 30 Hz or faster. Each stage reads from and writes to the map store through short, local connections.
At the periphery, camera interfaces handle bidirectional communication with smart sensors. The processor sends rendered predictions out to the cameras. The cameras send back only the differences. Multiple cameras can be serviced simultaneously, each with its own interface.
A network interface connects to the cloud, enabling the device to download previously mapped areas (no need to re-map a known building) and upload new map data. Richard Newcombe's vision: all devices eventually share a single, global "machine perception map" of the entire world — a shared, continuously updated 3D model maintained by billions of devices.
What makes this architecture powerful is not any single component — it's the data locality. The map features that are most relevant to the current view are stored on cores that are physically closest to the camera interface processing that view. Predictions flow outward from map to camera. Differences flow inward from camera to map. Nothing travels far.
FutureMapping was written in 2018. In the years since, many of its predictions have begun to materialize — some in ways Davison anticipated, others in surprising directions.
Neural Radiance Fields and 3D Gaussian Splatting realized the "learned scene representation" that Davison predicted. A NeRF stores a scene as a neural network's weights; Gaussians store it as a set of learned 3D primitives with appearance codes. Both are exactly the kind of "compact latent representation encoding geometry + appearance" that FutureMapping called for. Gaussian SLAM systems now run the closed loop with learned representations in real time.
SLAM++ (Salas-Moreno et al., 2013) pioneered object-level SLAM — recognizing known 3D objects and inserting them as graph nodes. CodeSLAM (Bloesch et al., 2018, from Davison's own group) learned compact depth codes that could be optimized jointly with camera poses. Both are direct precursors to the "Level 4" vision.
The explosion of vision-language-action models (RT-2, pi-0, Octo) represents a different path: instead of building explicit 3D maps, these models implicitly encode spatial understanding in massive neural networks. This is precisely the "task-specific embedding" approach that Davison's H1 argues against. The debate remains open — are explicit maps or implicit representations the right answer? Perhaps both, with explicit maps for precision tasks and implicit models for generalization.
Apple's spatial computing headset is perhaps the closest realization of FutureMapping's hardware vision. It runs real-time SLAM with semantic understanding on a custom R1 chip designed specifically for sensor processing, paired with an M2 chip for general computation. The R1 processes all sensor data within 12 milliseconds — close-to-sensor processing in action. Power consumption: ~5 watts for the full system, still 5x above Davison's target but orders of magnitude below desktop SLAM.
Self-driving cars face the same three-graph problem at a larger scale: image graphs from multiple cameras, a dynamic map graph of lanes/vehicles/pedestrians, and a real-time computation graph with hard latency requirements. Tesla's approach (custom inference chips processing raw camera data) echoes the co-design philosophy.
Davison's group built SLAMBench as a standardized benchmarking framework for SLAM — an attempt to realize H2's universal metrics. SemanticFusion demonstrated real-time CNN-based semantic labeling fused into dense 3D maps, bridging Levels 2 and 3.