FutureMapping — Veanors

Chapter 0: The Problem

Imagine the future of augmented reality. Not a bulky headset tethered to a gaming PC, but lightweight spectacles you wear all day. They need to understand the 3D world around you in real time — where the walls are, where the table is, where your coffee mug sits — so they can overlay digital information precisely anchored to reality.

Now consider the constraints. These glasses must run on a battery smaller than your thumbnail. They must consume less than one watt of power. They must process every frame from every camera in milliseconds, not seconds. And they must do all of this while fitting in a form factor indistinguishable from regular eyewear.

In 2018, when Andrew Davison wrote this paper, the gap between what existed and what was needed was enormous. SLAM (Simultaneous Localization and Mapping) — the core technology for real-time 3D understanding — ran on desktop GPUs consuming 200+ watts. The algorithms were brilliant but power-hungry. The hardware was capable but massive.

The central tension: Mass-market spatial computing devices need spectacle-size form factors, all-day battery life, and sub-watt power budgets. But the algorithms that provide real-time 3D understanding were designed for desktop workstations. Something fundamental has to change — not just incremental improvement, but a complete rethinking of what is stored where, what is processed where, and what is transmitted where and when.

This paper is not a typical methods paper. It doesn't propose a new algorithm or report benchmark numbers. It's a manifesto — a detailed vision of what Spatial AI systems will look like when SLAM, deep learning, and specialized hardware are co-designed from the ground up. Davison calls this vision FutureMapping.

The key question that drives everything: can we design systems where the computational graph structure of the algorithms matches the physical graph structure of the processor hardware, so that data barely has to move?

What is the fundamental constraint that makes current SLAM approaches insufficient for mass-market AR glasses?

Power budget — always-on spatial understanding must run at sub-watt power, but current SLAM requires desktop-class hardware at 200+ watts SLAM algorithms are too inaccurate for AR Cameras are not good enough yet

Chapter 1: The Key Insight

Davison's central insight is that Spatial AI is not just a software problem. You can't solve it by writing a better SLAM algorithm and running it on commodity hardware. The path forward requires co-design across three layers simultaneously:

Algorithms — SLAM + semantics + deep learning, fused into a single representation
Processors — specialized graph processors whose physical topology mirrors the algorithm's data flow
Sensors — cameras that don't just capture frames but actively communicate with the processor, sending only what's new

The reason these three must be co-designed comes down to a single physical fact: moving data costs more energy than computing on it. In modern chips, a floating-point multiply costs about 1 picojoule. Moving that same number across a chip costs 100 picojoules. Sending it off-chip costs 1,000 picojoules. The energy hierarchy is brutal:

E_compute ≪ E_on-chip-move ≪ E_{off-chip-move}

This means the architecture of your processor — where data physically lives relative to where it's processed — determines your power budget far more than the algorithm's computational complexity. Two algorithms with identical FLOP counts can differ by 100x in power consumption depending on how much data movement they require.

The design principle: The computational GRAPH structure of Spatial AI algorithms should match the physical GRAPH structure of the processor hardware. When every piece of data is processed close to where it's stored, you minimize data movement and therefore minimize power. This is the path to sub-watt spatial intelligence.

This is why Davison calls the paper "FutureMapping" — it's about mapping the future of the entire system stack, not just the mapping algorithms. The future is not a better SLAM running on a GPU. It's a fundamentally new kind of computing system where algorithms, processors, and sensors are designed as one integrated whole.

Why does the graph structure of the algorithm need to match the graph structure of the processor?

Because data movement costs 100-1000x more energy than computation — matching structures keeps data local and minimizes power consumption Because graph processors can only run graph algorithms Because it makes the code easier to debug

Chapter 2: From SLAM to Spatial AI

SLAM has evolved through three distinct levels of capability, each building on the last. Understanding this progression reveals where we're headed.

Level 1: Sparse Feature SLAM

The original SLAM systems (MonoSLAM, PTAM, ORB-SLAM) track a sparse set of point features — corners, edges, distinctive patches. They estimate camera pose and a cloud of 3D points. The map is geometrically useful (you can localize) but visually empty — just dots floating in space. You can't tell if there's a wall or a table; you just see points.

Level 2: Dense Mapping

Systems like DTAM and KinectFusion reconstruct every surface. Using depth sensors or dense stereo, they build volumetric or surfel-based maps where you can see walls, floors, furniture shapes. The map is geometrically rich — you can measure distances, plan paths, detect obstacles. But the map still doesn't know what anything is. A chair and a table are just differently shaped surfaces.

Level 3: Semantic Understanding

Systems like SemanticFusion and Mask-SLAM add CNN-based object recognition to dense maps. Now the system knows "that's a chair" and "that's a table." But the semantics are typically bolted on as a separate processing stage — run a CNN on the image, project labels into the 3D map, done. The geometry and semantics live in separate worlds.

The future — Level 4: Object-oriented SLAM with learned features. The map becomes a graph of objects, each represented by a learned code that encodes geometry, appearance, and semantics in a single unified representation. Not "surface + label" but a compact latent vector that captures everything the system knows about that entity. This is what Davison calls Spatial AI.

The progression from Level 1 to Level 4 isn't just about adding more data to the map. It's a fundamental shift in what the map is. In Level 1, the map is a bag of coordinates. In Level 4, the map is a graph of learned concepts — each node a compressed, multi-modal representation that can be decoded into geometry, appearance, or semantic labels on demand.

What distinguishes the "Level 4" Spatial AI map from Level 3 semantic SLAM?

Level 4 uses learned codes that unify geometry, appearance, and semantics in a single representation per entity, rather than bolting labels onto separate geometry Level 4 has more points in the map Level 4 uses better cameras

Chapter 3: Two Core Hypotheses

Davison stakes the entire vision on two hypotheses. These aren't proven facts — they're bets about what the right design choices are for Spatial AI. And they're deliberately controversial.

Hypothesis 1: Build a general, persistent, metric 3D map

There's a tempting shortcut: skip the 3D map entirely. Modern end-to-end learning can go directly from pixels to actions (for a robot) or pixels to rendered overlays (for AR). Why bother reconstructing 3D geometry at all?

Davison argues this is wrong. A general, persistent, close-to-metric 3D scene representation is essential because:

Persistence — the map must outlive any single viewing session. When you come back tomorrow, the system should recognize where it is instantly, not re-map from scratch.
Generality — the same map must serve many tasks: navigation, object manipulation, AR overlay, semantic queries. A task-specific embedding is useless for tasks it wasn't trained for.
Metric accuracy — AR overlays must be anchored to within millimeters. Robots must reach to within centimeters. Approximate "it's roughly over there" is not enough.

H1 in one sentence: Spatial AI should build a general-purpose, persistent, close-to-metric 3D world model — not task-specific embeddings. This is the shared "operating system" for all spatial applications running on a device.

Hypothesis 2: Universal performance metrics

If Spatial AI systems build general 3D maps, then their quality can be measured by a small number of universal metrics, independent of downstream application:

Localization accuracy — how precisely can the system determine its own pose?
Surface prediction — how accurately can it reconstruct scene geometry?
Object identification — how reliably can it recognize and distinguish objects?

This is analogous to how we evaluate a CPU: we don't benchmark it on every possible application, we measure clock speed, IPC, and memory bandwidth. These universal metrics would enable standardized benchmarking of Spatial AI systems — something Davison pursued with the SLAMBench project.

Why these hypotheses matter: If H1 is wrong — if task-specific shortcuts work better — then the entire co-design vision falls apart, because you'd need different hardware for each task. If H2 is wrong — if there are no universal metrics — then you can't benchmark progress. Davison is betting that spatial understanding is fundamental enough to have a single, general solution.

Why does Davison argue against task-specific spatial embeddings in favor of a general 3D map?

A general map supports persistence across sessions, serves multiple tasks simultaneously, and enables metric-accurate AR/robotics — none of which task-specific embeddings provide Task-specific embeddings use too much memory General maps are easier to train

Chapter 4: Graph Structures in SLAM

This is the technical heart of the paper. Davison identifies three distinct graph structures that exist in every SLAM system. The key to efficient Spatial AI is recognizing these graphs and designing hardware that mirrors their topology.

(a) The Image Graph

Every camera image is a regular 2D grid of pixels. Nearby pixels are highly correlated — a pixel's value strongly predicts its neighbors'. This regularity is exactly why convolutional neural networks work so well: a 3x3 convolution kernel exploits the local, grid-structured correlations of the image graph. The image graph is regular, dense, and local.

(b) The Map Graph

The 3D map that SLAM builds is a very different kind of graph. It's a collection of features (points, surfaces, objects) linked by co-visibility — two features are connected if they've been seen together in the same camera frame. This graph is:

Irregular — some areas are densely mapped, others sparse
Multi-scale — features range from fine surface details to entire rooms
Dynamic — new nodes and edges appear as the camera moves; old ones may be pruned
Hierarchical — local feature groups cluster into objects, objects into rooms, rooms into buildings

(c) The Computation Graph

The real-time processing loop itself forms a graph: sensor data flows in, gets processed through tracking, data association, map update, and rendering stages, with feedback loops everywhere. This is a directed graph with cycles — the map affects what you predict, predictions affect how you interpret new data, interpretations update the map.

The matching principle: The image graph maps naturally to GPU-style SIMD arrays (regular grid processing). The map graph maps naturally to graph processors like Graphcore's IPU (irregular, dynamic connectivity). The computation graph dictates the data flow between these processing elements. When all three graph structures are physically realized in hardware, data stays local — and power consumption plummets.

This is the paper's most important structural observation. Traditional SLAM systems force all three graphs onto a single processor type (CPU or GPU), which means at least two of the three graphs are poorly matched to the hardware. The future requires heterogeneous processors where each subsystem's hardware mirrors its algorithmic graph.

Why is the map graph fundamentally different from the image graph?

The image graph is regular, dense, and local (a pixel grid), while the map graph is irregular, multi-scale, dynamic, and linked by co-visibility rather than spatial adjacency The map graph has more nodes The map graph is two-dimensional while the image graph is three-dimensional

Chapter 5: The Closed Loop

The essential structure of all SLAM systems — and by extension, all Spatial AI — is a closed loop between a persistent world model and incoming sensor data. Understanding this loop is critical because it dictates the entire computational architecture.

The loop has four stages:

Prediction (Rendering) — The system uses its current world model to predict what the camera should see from its estimated pose. In dense SLAM (like KinectFusion), this means rendering a synthetic depth image from the reconstructed 3D model.
Measurement — The camera captures what the world actually looks like from the current viewpoint.
Data Association + Tracking — The system compares prediction to measurement. The differences tell it two things: (a) how to refine its pose estimate (tracking), and (b) what parts of the scene are new or changed (data association).
Map Update (Fusion) — New observations are fused into the persistent world model, improving its accuracy and completeness.

Why "generative" matters: The prediction step is a generative model — the system generates an expected observation from its internal world model. This is the same idea behind variational autoencoders, NeRFs, and Gaussian splatting. The world model is a learned generator, and SLAM is the process of refining that generator so its outputs match reality. Dense SLAM systems like KinectFusion were doing "generative AI" before the term existed.

KinectFusion (Newcombe et al., 2011) is Davison's paradigmatic example of the closed loop done right. It maintains a voxel grid representing the 3D scene, renders predicted depth images from any viewpoint, tracks the camera by aligning predicted and observed depth images, and fuses new depth data back into the voxel grid — all at 30 Hz on a single GPU.

The key insight: in a well-designed closed loop, the system predicts every pixel of the next observation. Anything that differs from prediction is either a tracking error (correct the pose) or new information (update the map). This is maximally informative — nothing is wasted.

What makes the SLAM loop "closed" rather than "open"?

The world model generates predictions that are compared with new measurements, and the differences update both pose and map — creating a continuous feedback cycle The system processes each frame independently The camera returns to its starting position

Chapter 6: Co-Design with Hardware

For decades, software developers enjoyed a free ride: Moore's law doubled transistor counts every two years, and Dennard scaling kept power consumption flat — more transistors at the same power. But Dennard scaling broke around 2006. More transistors now means more power, which means more heat, which means you can't actually use all those transistors at once.

This created the power wall. The only path forward is parallelism — use many simple cores instead of one fast core, and keep data movement minimal. But here's the catch: generic parallelism (like GPUs) still wastes enormous energy moving data between processing elements and memory. The solution is application-specific parallelism, where the hardware's physical structure matches the algorithm's data flow.

Three hardware paradigms for Spatial AI

Graph Processors (Graphcore IPU): A massively parallel processor with ~1,200 independent tiles, each with its own local memory and compute. Tiles communicate via a configurable interconnect — the physical communication graph can be reshaped to match any algorithm's data dependencies. Perfect for the map graph, where features need to exchange information with their co-visible neighbors.

Neuromorphic Chips (SpiNNaker): Processors inspired by biological neural networks, with asynchronous, event-driven computation. Instead of processing every pixel at a fixed frame rate, they process only changes — a natural fit for event cameras and temporally sparse signals.

Custom Vision ASICs: Purpose-built chips for specific vision tasks (like Intel's Movidius VPU). Extremely power-efficient for their target workload, but inflexible.

The "Spatial AI brain" concept: Davison envisions a heterogeneous processor where different subsystems handle different graph structures. A CNN accelerator (regular grid) processes raw images. A graph processor (irregular) maintains and optimizes the map. A control processor orchestrates the closed loop. And custom sensor interfaces handle close-to-sensor processing. All on one chip, communicating over short, high-bandwidth internal links.

The power numbers tell the story. A desktop GPU (NVIDIA GTX 1080) runs SLAM at 200+ watts. A mobile SoC (Snapdragon) might manage 5 watts. But the target for always-on AR glasses is under 1 watt — and that's for the entire perception pipeline, not just SLAM. Only radical co-design can close this 200x gap.

Why did the end of Dennard scaling create the "power wall" that makes co-design essential?

Without Dennard scaling, more transistors means more power and heat — so the only path to efficiency is parallelism with minimal data movement, which requires matching hardware topology to algorithm structure Dennard scaling made chips too expensive It became impossible to manufacture smaller transistors

Chapter 7: Close-to-Sensor Processing

The most radical idea in the paper: the camera itself should become an active participant in the perception loop, not a passive data source. Today's cameras are "dumb" — they capture full frames at a fixed rate and throw megabytes of raw pixels at the processor. Most of those pixels are redundant (nothing changed since last frame). All of them must travel from sensor to processor, consuming precious energy on data movement.

Event Cameras

Event cameras (Dynamic Vision Sensors) are the first step toward smarter sensors. Instead of capturing full frames, each pixel independently reports when its brightness changes. A static scene produces zero data. A moving edge produces a sparse stream of events only at the boundary. Bandwidth drops by 10-100x for typical scenes.

SCAMP: Processing on the Image Plane

Even more radical: the SCAMP-5 chip (developed at the University of Manchester) puts a tiny processor at every pixel. A 256x256 array of pixel-processors can perform convolutions, edge detection, and simple CNN operations directly on the image plane, at 1.2 watts. Data never leaves the sensor — computation happens where the photons land.

The Generalized Event Camera

Davison's boldest proposal: a camera that doesn't just report brightness changes (like an event camera), but reports deviations from the world model's predictions.

The vision: The processor sends its predicted image to the camera. The camera compares, pixel by pixel, the predicted image with the actual light falling on the sensor. It reports only the differences. If the world model is accurate, almost no data needs to be transmitted. The camera becomes a "validation sensor" — confirming predictions and flagging surprises. This is bidirectional camera-processor communication.

Think about what this means for power. In a familiar environment where the world model is good, the camera transmits almost nothing — just the occasional surprise. The bandwidth between sensor and processor drops to near zero. All the energy that was spent moving megabytes of redundant pixels per frame is saved.

This concept — that a sensor should report only where received data differs from prediction — is a generalization of the event camera concept. A standard event camera defines "prediction" as "the previous pixel value." The generalized event camera defines "prediction" as "what the world model says this pixel should look like." It's the same principle, but with a much better predictor.

How does the "generalized event camera" concept differ from a standard event camera?

A standard event camera reports changes from the previous pixel value; a generalized event camera reports deviations from the world model's prediction of what each pixel should look like A generalized event camera has higher resolution A generalized event camera works in infrared

Chapter 8: The Spatial AI Brain

Davison synthesizes everything into a single architectural vision: the Spatial AI brain. This is the paper's "Figure 4" — a conceptual diagram of what a complete, integrated Spatial AI processor would look like. Let's build it up piece by piece.

The Map Store

At the center sits the map — a distributed graph of learned features stored across the cores of a graph processor. Each processor core "owns" a local region of the map. The feature at each node encodes geometry, appearance, and semantics in a single learned representation (a latent code). Nearby map features live on physically nearby cores, so local map operations (smoothing, optimization, co-visibility queries) happen without long-distance data movement.

The Real-Time Loop

Surrounding the map store is the real-time processing loop: rendering (predict what the cameras should see), tracking (compare prediction to reality), fusion (incorporate new data), and CNN labeling (extract semantic information). This loop runs at frame rate — 30 Hz or faster. Each stage reads from and writes to the map store through short, local connections.

Camera Interfaces

At the periphery, camera interfaces handle bidirectional communication with smart sensors. The processor sends rendered predictions out to the cameras. The cameras send back only the differences. Multiple cameras can be serviced simultaneously, each with its own interface.

Cloud Connection

A network interface connects to the cloud, enabling the device to download previously mapped areas (no need to re-map a known building) and upload new map data. Richard Newcombe's vision: all devices eventually share a single, global "machine perception map" of the entire world — a shared, continuously updated 3D model maintained by billions of devices.

The full picture: A disc-shaped processor. At the center: the map store on graph processor cores. Around it: the real-time perception loop (render, track, fuse, label). At the edges: smart camera interfaces sending/receiving only diffs. A cloud link for shared global maps. All connected by short, high-bandwidth internal links. Total power budget: under 1 watt. This is the "brain" of every future spatial device — from AR glasses to delivery drones to autonomous cars.

What makes this architecture powerful is not any single component — it's the data locality. The map features that are most relevant to the current view are stored on cores that are physically closest to the camera interface processing that view. Predictions flow outward from map to camera. Differences flow inward from camera to map. Nothing travels far.

What is the central organizing principle of the "Spatial AI brain" architecture?

Data locality — the map store, real-time loop, and camera interfaces are physically arranged so that data flows through short, local connections, minimizing energy-expensive long-distance data movement Maximum clock speed for each component Cloud-first processing with local caching

Chapter 9: Connections

FutureMapping was written in 2018. In the years since, many of its predictions have begun to materialize — some in ways Davison anticipated, others in surprising directions.

NeRF and Gaussian Splatting (2020-2023)

Neural Radiance Fields and 3D Gaussian Splatting realized the "learned scene representation" that Davison predicted. A NeRF stores a scene as a neural network's weights; Gaussians store it as a set of learned 3D primitives with appearance codes. Both are exactly the kind of "compact latent representation encoding geometry + appearance" that FutureMapping called for. Gaussian SLAM systems now run the closed loop with learned representations in real time.

SLAM++ and CodeSLAM

SLAM++ (Salas-Moreno et al., 2013) pioneered object-level SLAM — recognizing known 3D objects and inserting them as graph nodes. CodeSLAM (Bloesch et al., 2018, from Davison's own group) learned compact depth codes that could be optimized jointly with camera poses. Both are direct precursors to the "Level 4" vision.

Foundation Models for Robotics

The explosion of vision-language-action models (RT-2, pi-0, Octo) represents a different path: instead of building explicit 3D maps, these models implicitly encode spatial understanding in massive neural networks. This is precisely the "task-specific embedding" approach that Davison's H1 argues against. The debate remains open — are explicit maps or implicit representations the right answer? Perhaps both, with explicit maps for precision tasks and implicit models for generalization.

Apple Vision Pro (2024)

Apple's spatial computing headset is perhaps the closest realization of FutureMapping's hardware vision. It runs real-time SLAM with semantic understanding on a custom R1 chip designed specifically for sensor processing, paired with an M2 chip for general computation. The R1 processes all sensor data within 12 milliseconds — close-to-sensor processing in action. Power consumption: ~5 watts for the full system, still 5x above Davison's target but orders of magnitude below desktop SLAM.

Autonomous Driving

Self-driving cars face the same three-graph problem at a larger scale: image graphs from multiple cameras, a dynamic map graph of lanes/vehicles/pedestrians, and a real-time computation graph with hard latency requirements. Tesla's approach (custom inference chips processing raw camera data) echoes the co-design philosophy.

SLAMBench and SemanticFusion

Davison's group built SLAMBench as a standardized benchmarking framework for SLAM — an attempt to realize H2's universal metrics. SemanticFusion demonstrated real-time CNN-based semantic labeling fused into dense 3D maps, bridging Levels 2 and 3.

The broader lesson: FutureMapping's enduring insight isn't about any specific algorithm or chip. It's that spatial intelligence is a systems problem — you can't solve it by optimizing one layer in isolation. The algorithm, the processor, and the sensor must be designed as one integrated system. This co-design philosophy is now mainstream in AI hardware (TPUs, NPUs, Apple Neural Engine), validating Davison's 2018 prediction that general-purpose processors would not be enough.

Key references from the paper

KinectFusion (Newcombe et al., 2011) — real-time dense SLAM on GPU
ORB-SLAM (Mur-Artal et al., 2015) — state-of-art sparse feature SLAM
SemanticFusion (McCormac et al., 2017) — CNN semantics in dense SLAM
CodeSLAM (Bloesch et al., 2018) — learned compact depth codes
SLAM++ (Salas-Moreno et al., 2013) — object-level SLAM
SLAMBench (Nardi et al., 2015) — standardized SLAM benchmarking
Graphcore IPU — massively parallel graph processor
SpiNNaker — neuromorphic computing platform
SCAMP-5 — pixel-parallel vision chip

Which post-2018 development most directly realized FutureMapping's prediction of "learned scene representations encoding geometry and appearance"?

NeRF and 3D Gaussian Splatting — both store scenes as learned representations (neural network weights or learned 3D primitives with appearance codes) that can be rendered from any viewpoint ChatGPT Self-driving cars