Chen, Gao, Chen, Xue, Zhu, Cheng, Sun, Hu, Yao, Xu, Shen — 2026

Geometric Context Transformer for Streaming 3D

Three types of learned context — anchor frames, a sliding window, and compressed trajectory memory — give a feed-forward model the spatial awareness of SLAM, at 20 FPS over 10,000+ frames.

Prerequisites: Transformers (attention masks) + Camera pose estimation basics + SLAM concepts (optional)
10
Chapters
4
Simulations

Chapter 0: The Problem

You have a camera on a robot, a drone, or a pair of AR glasses. It streams video at 30 FPS as you walk through a building. You want to reconstruct the 3D scene in real time — camera poses, depth maps, point clouds — as each new frame arrives.

This is streaming 3D reconstruction: process frames one at a time, causally (no peeking at the future), and keep going for thousands or tens of thousands of frames.

The requirements are demanding. You need geometric accuracy (the 3D positions must be correct), temporal consistency (the reconstruction shouldn't jump around between frames), and computational efficiency (you need to keep up with the camera, ideally at 20+ FPS). Getting all three simultaneously is the hard part.

The catch? Every approach so far breaks down in one of three ways:

Offline Models (VGGT, DUSt3R)
Process all frames at once with bidirectional attention. High quality, but you need the full video upfront. Memory scales quadratically with sequence length. Can't stream.
Causal Attention (StreamVGGT, Stream3R)
Process frames left-to-right, caching everything. Works for streaming but the KV cache grows linearly — 10,000 frames means ~5 million tokens. Memory explodes.
Recurrent Compression (CUT3R)
Compress all history into a fixed-size state, RNN-style. Constant memory, but aggressive compression causes state forgetting. After a few hundred frames, the model loses track of where it has been.
The fundamental tension: You need to remember enough about the past to avoid drift (the camera slowly losing track of its position), but you can't remember everything or memory blows up. The question is: what exactly should you remember, and in what form?

There's also a fourth category — hybrid SLAM-based methods like VGGT-SLAM and MASt3R-SLAM. These integrate learned 3D models with classical SLAM backends (keyframe selection, pose-graph optimization). They're more robust than pure learned methods, but the classical components require hand-crafted heuristics and iterative optimization that limits real-time applicability.

Concrete numbers: the memory wall

Let's make this precise. DINOv2-ViT with patch size 14 on a 518x378 image produces M = 504 image tokens, each of dimension C = 768. With 6 context tokens per frame, that's 510 tokens per frame. Each token in the KV cache occupies 2 x C x 2 bytes (key + value, float16) per layer, times 24 layers. That's ~72KB per token per layer-stack.

For a 10,000-frame sequence under causal attention: 510 x 10,000 = 5.1M tokens. KV cache alone: ~180 GB. That exceeds the memory of any single GPU. Even at 1,000 frames, the 25 GB KV cache fills an A100. This isn't a theoretical concern — it's a hard engineering wall.

Look at what happens to memory as a sequence gets longer. Causal attention retains every token from every frame — the cost grows without bound. A sliding window bounds memory but forgets the distant past entirely. LingBot-Map's insight is that you can have both: bounded memory and long-range context.

Memory Growth: Causal vs. Window vs. GCA

Drag the sequence length to see how total token count grows under each strategy. M = 500 image tokens per frame, 6 context tokens per frame. Anchor n=3, window k=16.

Sequence length T 500
Why does causal attention become impractical for long streaming sequences?

Chapter 1: The Key Insight

Classical SLAM systems — the hand-engineered pipelines that have powered robotics for decades — already solved this memory problem, just not with neural networks. They maintain three distinct types of spatial context:

Reference Frame
A fixed coordinate origin that anchors the entire reconstruction. Without it, scale and position are ambiguous.
Local Window
The last few frames, kept at full resolution. These provide dense visual overlap for accurate relative pose estimation between nearby views.
Global Map
A compact summary of everywhere the camera has been. Sparse keypoints, pose graph edges — just enough to detect and correct long-range drift.

The genius of SLAM is this decomposition: not all context is equally important. Recent frames need full detail (for local matching). Distant frames need only a skeleton (for global consistency). And a few initial frames need to be permanent anchors (for coordinate grounding).

LingBot-Map's insight: Replace SLAM's hand-crafted components — keypoint extraction, feature matching, bundle adjustment, pose-graph optimization — with a single learned attention mechanism that maintains the same three types of context. The attention mask enforces the structure; the transformer learns what to put in each slot.

The key word is selectively. Human spatial memory doesn't record every visual frame. It preserves landmarks (where am I relative to the entrance?), recent context (what did I just walk past?), and a rough trajectory history (I turned left, then right, then went upstairs). LingBot-Map mirrors this structure exactly.

This gives birth to Geometric Context Attention (GCA), which decomposes the streaming state into:

The crucial difference from classical SLAM: all three context types are maintained within a single attention mechanism. There's no separate keypoint extractor, no separate feature matcher, no separate bundle adjustment module. Just a transformer with a structured attention mask. The mask enforces which frames attend to which; the learned attention weights decide what information to extract.

The rest of this lesson unpacks each component and shows how they compose into a single attention mask.

What is the core design principle that LingBot-Map borrows from classical SLAM?

Chapter 2: Anchor Context

Monocular 3D reconstruction is inherently scale-ambiguous. A video of a dollhouse and a video of a real house can produce identical image sequences. Without some reference, the model has no way to pin down absolute scale or establish a consistent coordinate system.

Offline methods like VGGT handle this by normalizing against the entire point cloud after processing all frames. But in streaming, you don't have the full point cloud yet.

The Solution: Anchor Frames

LingBot-Map designates the first n frames (typically n=3) as anchor frames. These frames are processed with full bidirectional attention among themselves at the start of the sequence, establishing a coordinate origin and absolute scale. Their tokens are then frozen in the KV cache and never evicted.

Every subsequent frame attends to these anchor tokens, ensuring that all predictions are grounded in the same coordinate system.

Learnable Anchor Token

To help the network distinguish anchor frames from streaming frames, each anchor frame is augmented with a learnable anchor token a ∈ RC (C=768). This is a special token — like a CLS token — that gets added to the frame's token set. It tells the attention layers: "this frame is an anchor; treat it differently."

What if the first frames are bad?

The anchor frames are permanent — they're never evicted. So what happens if the camera starts pointing at a textureless wall, or is heavily motion-blurred? The model has no explicit failure detection for bad anchors. In practice, the progressive training (starting with short sequences of 2-24 frames) teaches the model to extract whatever geometric signal exists from any set of frames. On the Oxford Spires benchmark, even sequences that start with visually uninformative frames (dark stairwells) maintain stable trajectories — because the anchor establishes scale and origin from relative geometry between the 3 frames, not from any single frame's visual content.

However, if the first 3 frames have zero baseline (e.g., camera is completely static), scale becomes genuinely unrecoverable. The training data includes such cases with near-zero-baseline starts, so the model learns to output conservative depth predictions until the baseline grows.

Scale Normalization

During training, all ground-truth depths and camera translations are normalized by the mean distance of the anchor point cloud from the origin:

s = (1 / |X̄anchor|) ∑x ∈ X̄anchor ‖x‖2

This canonical scale, derived from just the first n frames, gives the model a consistent reference regardless of scene size.

Think of it this way: The anchor frames are like hammering a stake into the ground before you start surveying. Every measurement you take from that point forward is relative to that stake. If the stake is solid, the whole survey is solid.

Why n=3?

A single frame gives no stereo baseline — you can't triangulate depth from one view. Two frames give a baseline but a fragile one. Three frames provide enough geometric diversity to robustly establish both the coordinate system and the scale, while keeping the permanent token overhead small (3 × 510 = 1,530 tokens, forever in the cache). At 72KB per token across all layers, that's ~110KB of permanent KV cache — negligible compared to the window's ~600KB and the trajectory's growing contribution.

Why can't a streaming model just normalize scale against the entire point cloud, like VGGT does?

Chapter 3: Pose-Reference Window

The anchor frames fix the coordinate system, but they're far away from where the camera is now. If you're on frame 5,000 of a building walkthrough, the first 3 frames show a completely different part of the scene. They can't help you match the current view to its neighbors.

Dense Local Context

To accurately register each new frame, you need dense visual overlap with nearby observations. Think about it from the model's perspective: to figure out where the camera is now, it needs to see how the current view overlaps with views it has already positioned. The more overlap, the more constraints, the better the pose estimate.

This is exactly what feature matching does in classical SLAM: you find corresponding pixels between adjacent frames and triangulate their 3D positions. LingBot-Map does the same thing implicitly through attention — the cross-frame attention between the current frame and window frames effectively performs learned feature matching.

LingBot-Map maintains a sliding window of the k most recent frames (typically k=16 to 64), retaining their full image tokens (all M tokens per frame). The current frame attends to these tokens with full cross-frame attention, giving it rich local geometry for pose estimation.

Why Full Tokens?

Each frame produces M ≈ 500 image tokens from the ViT backbone. These tokens encode fine-grained spatial features — edges, textures, corners — that are essential for pixel-level matching. You can't compress these without losing the detail needed for accurate relative pose estimation.

Relative Pose Loss

To further encourage geometric consistency within the window, LingBot-Map applies a relative pose loss between all frame pairs in the sliding window:

Lrel-pose = (1 / k(k−1)) ∑i≠j [ Lrot(i,j) + λtrans Ltrans(i,j) ]

This is computed over geodesic rotation error and L1 translation error for every pair of frames within the window. Because the window contains only already-observed frames, this loss is inherently causal.

Why relative pose loss within the window?

Consider frames 985-1000 in the sliding window. Each frame's pose is predicted independently by the camera head (a single MLP operating on one token). Without the relative pose loss, two adjacent frames might each have small absolute pose errors that happen to disagree with each other — frame 999 thinks it moved 10cm left, frame 1000 thinks it moved 10cm right. The relative loss directly penalizes this disagreement by computing all pairwise geodesic and translation errors within the window. This creates a soft constraint that the window's poses form a geometrically consistent local trajectory — even if the absolute poses have small global drift.

Key design choice: The window size k is not fixed during training — it's randomly sampled from 16 to 64. This exposes the model to varying receptive fields and makes it robust at inference time when different window sizes may be used depending on the application.
Why does the sliding window retain full image tokens (all ~500 per frame) instead of compressing them?

Chapter 4: Trajectory Memory

You have anchors (first n frames, full tokens) and a window (last k frames, full tokens). But what about the hundreds or thousands of frames in between? Without any record of these intermediate frames, pose errors accumulate unchecked. The estimated trajectory drifts.

Causal attention's answer: keep everything. But that's O(T · M) tokens, growing without bound.

LingBot-Map's answer: keep a skeleton.

6 Tokens Per Frame

When a frame exits the sliding window (it's no longer among the k most recent), LingBot-Map evicts its image tokens (all M of them) and retains only:

That's 6 tokens per frame instead of M + 6 ≈ 510. An ~84x reduction in per-frame storage.

Why exactly 6 tokens? Why not 1 or 12?

This is a deliberate engineering decision. A single token (just the camera token) would encode only the pose — but the model also needs to know how this frame relates to the anchors and what geometric context it captured. The anchor token encodes the frame's relationship to the coordinate origin. The 4 register tokens are learnable "slots" that the attention mechanism fills with whatever compressed information the model finds most useful for drift correction — think of them as a tiny bottleneck autoencoder of the frame's geometric contribution.

Why not more? The paper ablated with 8 and 12 tokens: diminishing returns beyond 6, while memory grows proportionally. At T=10,000 frames, going from 6 to 12 tokens doubles trajectory memory from 60K to 120K tokens — a meaningful cost for negligible accuracy gain (<0.1m ATE improvement on Oxford Spires).

Temporal Positional Encoding

Without timestamps, the trajectory memory would be an unordered bag of 6-token summaries. The model would know that the camera visited certain places, but not when or in what order. That's a problem: the trajectory structure matters for drift correction (nearby-in-time frames should have nearby poses).

To fix this, the trajectory memory tokens are augmented with video temporal positional encodings. These tell the attention layers when each past observation occurred, enabling the model to reason about the trajectory's temporal structure — "frame 500 was 200 steps ago, frame 900 was 50 steps ago."

The Payoff: ~80x Memory Reduction

For a T-frame sequence with n=3 anchors and k=16 window frames, the total context under GCA is:

GCA: (n+k) · M + 6T = 19 · 500 + 6T = 9,500 + 6T
Causal: T · (M+6) = 506T

At T=10,000 frames: GCA uses ~70K tokens. Causal uses ~5M tokens. That's a 71x reduction.

The core tradeoff: Image tokens are dense — they encode fine spatial detail for pixel matching. Context tokens are sparse — they encode only the pose and a compressed summary. For distant frames, you don't need pixel matching (the views don't overlap). You just need to know where the camera was and what direction it pointed. Six tokens are enough for that.

Worked Example

Suppose you're on frame T=1,000 with n=3 anchors and k=16 window. The attention context for the current frame consists of:

Total: 15,500 tokens. Under causal attention, the same frame would need 1,000 × 506 = 506,000 tokens. That's a 33x reduction, and it only gets better with longer sequences.

At T=10,000: GCA needs ~70K tokens. Causal needs ~5M. That's a 71x reduction. The memory savings are enormous, and they come at minimal information cost because the evicted image tokens from distant frames wouldn't have been useful for pixel matching anyway — those views typically have zero visual overlap with the current frame.

GCA Memory Visualization

Drag the sequence length slider. The diagram shows anchor frames (full tokens), window frames (full tokens), and trajectory frames (compressed to 6 tokens each). The live token count updates below.

Sequence T 60
What is retained for each frame in the trajectory memory (frames that have left the sliding window)?

Chapter 5: The GCA Attention Mask

The three context types — anchor, window, trajectory — are unified through a single structured attention mask. This mask determines, for every query token, which key tokens it can attend to.

How the Mask Works

Each frame's tokens consist of a small leading segment of context tokens (camera + anchor + registers = 6) and a larger segment of image tokens (M ≈ 500). The mask specifies:

Let's compare this to the alternatives visually:

Attention Mask Patterns

Four attention strategies for a 12-frame sequence. Each cell represents whether query frame (row) can attend to key frame (column). Blue = full tokens accessible. Purple = context tokens only. Gray = no attention.

Reading the GCA mask: In the bottom-right panel, notice the three zones. The left columns (anchor) are fully blue — every frame sees the anchor's full tokens. The diagonal band (window) is blue — recent frames see each other's full tokens. The middle columns (trajectory) are purple — frames can attend, but only to 6 compressed tokens. This is the structured sparsity that makes GCA efficient.

Why Not Just Sliding Window?

A pure sliding window (bottom-left panel) bounds memory beautifully — but look at what it loses. Frame 10 has zero information about what happened at frame 1. If the camera revisits a location after a long loop, the sliding window has forgotten that location entirely. There's no mechanism for drift correction.

GCA's trajectory memory (the purple cells) fills this gap. Even though each past frame contributes only 6 tokens, those tokens encode where the camera was and how it related to the anchors. The attention mechanism can use this sparse global record to detect and correct drift — something a pure sliding window cannot do.

Why Not Just Causal?

Causal attention (top-right panel) retains everything — no information loss at all. But at T=10,000 frames, every new frame must attend to ~5 million tokens. That's both a memory problem (storing the KV cache) and a compute problem (the attention operation itself). GCA achieves similar global awareness with ~70K tokens, by recognizing that you don't need pixel-level detail from frame 50 when you're on frame 10,000.

Per-Frame Cost

Each new frame must attend to:

The total is (n+k)·M + 6T. The first term is constant (~9,500 for n=3, k=16, M=500). The second grows at just 6 tokens per frame — versus M+6 ≈ 506 for causal attention. That's an 84x reduction in the growth rate.

In the GCA attention mask, what distinguishes trajectory frames from anchor and window frames?

Chapter 6: Architecture & Training

With GCA defined, let's zoom out to the full pipeline.

Backbone: ViT (DINOv2)

Each input image is encoded by a Vision Transformer backbone initialized from DINOv2, with a patch size of 14 pixels. This produces M image tokens per frame — rich visual features that encode edges, textures, and semantic content.

Token Augmentation

Each frame's M image tokens are augmented with:

Total: M + 6 ≈ 510 tokens per frame. The token ordering within each frame is: [camera, anchor, register1-4, image1-M]. The 6 context tokens are placed first so that when trajectory frames are evicted, the retained tokens are a contiguous prefix — no memory compaction needed.

Complete data flow for one streaming frame

Input
RGB image (518x378 pixels) arrives from camera at timestep t
ViT Backbone
DINOv2 ViT-L/14 encodes image → M=504 image tokens (each R768)
Token Augmentation
Prepend 6 context tokens → 510 tokens total. Add temporal positional encoding.
24 Alternating Layers
Frame Attention (within-frame self-attn) → GCA (cross-frame with structured mask) → repeat 12x
Prediction Heads
Camera head: camera token → MLP → 4x4 pose matrix P̂t
Depth head: image tokens → DPT decoder → HxW depth map + uncertainty

Alternating Attention Layers

The augmented tokens pass through 24 alternating layers of:

  1. Frame Attention: Self-attention within each frame. Lets the model refine per-frame features independently.
  2. GCA: Cross-frame attention with the structured mask from Chapter 5. Lets the model reason across frames, pulling context from anchors, the window, and the trajectory memory.

Prediction Heads

Why camera-to-world? Most prior methods (including VGGT) parameterize poses as world-to-camera transformations. But in this parameterization, rotation and translation are inherently coupled: a small rotation error at the camera causes a large translation error in the world frame, especially for distant objects. LingBot-Map supervises camera-to-world instead, where rotation and translation are more independent. Inspired by Pi3, this stabilizes training on long sequences where small rotation errors would otherwise compound.

Loss Function

A composite loss with three terms:

L = λdepth Ldepth + λabs-pose Labs-pose + λrel-pose Lrel-pose

Progressive Training

Training happens in two stages on 128 NVIDIA A100 GPUs (80GB):

StageAttentionViewsDataGPU HoursDuration
1. BaseGlobal (offline)2–2429 datasets, mixed21,500~7 days
2. StreamingGCA24→320Long video focus15,360~5 days

Key training hyperparameters: AdamW optimizer, learning rate 2e-4 with cosine warmup (2K steps), batch size 128 (Stage 1) / 64 (Stage 2), loss weights λdepth=1.0, λabs-pose=0.5, λrel-pose=0.1. The ViT backbone is initialized from DINOv2-L pretrained weights and fully fine-tuned (not frozen) — this is critical because the geometric features needed for pose estimation differ from DINOv2's self-supervised features.

Stage 1 builds general geometric priors with standard global attention. Stage 2 swaps in GCA and progressively increases the number of training views from 24 to 320, teaching the model to maintain consistency over longer and longer sequences.

Why Progressive?

Training directly on long sequences fails. In early training, the model's pose predictions are inaccurate. On a 320-frame sequence, a small rotation error at frame 10 compounds to a massive translation error at frame 300. The resulting loss gradients are noisy and unstable.

The progressive curriculum solves this: start with 24 frames (easy), then gradually extend. The model first learns reliable local geometry from short clips, then learns to maintain global consistency as the training horizon stretches. By the time it sees 320-frame sequences, its local predictions are already accurate enough to provide stable gradients.

Context Parallelism

At 320 views per training sample, GPU memory becomes the bottleneck due to the quadratic cost of cross-frame attention. LingBot-Map uses the Ulysses context-parallelism strategy with a parallelism dimension of 16: different views are distributed across GPUs, and attention is computed via efficient all-to-all collective communication. This allows training on sequences far longer than a single GPU's memory could support.

Weight transfer trick: GCA's query/key/value projections share the same parameterization as global attention. So the pretrained Stage 1 weights transfer directly to Stage 2 — no random initialization, no warmup instability.

Training Data

LingBot-Map trains on 29 datasets spanning indoor, outdoor, object-centric, synthetic, and real-world scenarios. The Stage 1 mix includes BlendedMVS, HyperSim, MegaDepth, TartanAir, ScanNet, and many more — roughly balanced sampling across all datasets.

In Stage 2, the distribution shifts heavily toward long-trajectory video datasets: TartanAir, MatrixCity, Waymo, KITTI-360, ScanNet++, and internal game data get upweighted, while multi-view-only datasets (no temporal structure) are down-weighted or dropped entirely.

Foldback Video Sampler

To produce temporally coherent training subsequences from long videos, Stage 2 uses a foldback video sampler: it starts at a random frame and advances with a random stride. Upon reaching a sequence boundary, it reverses direction and draws a new stride (different from the previous one to avoid degenerate oscillation). This yields subsequences with naturally varying frame rates and no forward-time bias.

Data Augmentation

Images are resized to max dimension 518px. Aggressive photometric augmentation is applied: random color jitter (brightness, contrast, saturation ±0.5; hue ±0.1) with probability 0.9, random grayscale with probability 0.05, and random spatial rescaling in [0.8x, 1.2x]. A co-jittering mode (probability 0.3) applies identical color transforms to all frames in a scene, encouraging the model to rely on geometric cues rather than appearance shortcuts.

Why does LingBot-Map supervise camera-to-world transformations instead of world-to-camera?

Chapter 7: Results

LingBot-Map is evaluated against three categories of baselines: offline feed-forward models (VGGT, DA3, Pi3), optimization-based methods (DroidSLAM, VIPE), and streaming methods (CUT3R, TTT3R, Wint3R, Stream3R, InfiniteVGGT).

Oxford Spires (Large-Scale Trajectory)

This is the hardest benchmark: complex indoor-outdoor transitions, revisits after long gaps, and large scale variation. In the sparse setting (320 frames):

MethodTypeATE ↓
VGGToffline24.78
DA3offline12.87
VIPEoptim10.52
CUT3Ronline18.16
TTT3Ronline19.35
Wint3Ronline21.10
Stream3Ronline29.58
LingBot-Maponline6.42

LingBot-Map achieves 6.42m ATE — less than half the error of the best optimization-based method (VIPE at 10.52), and 2.8x better than the best streaming competitor (CUT3R at 18.16). Despite being a streaming model, it outperforms every offline model too.

What ATE means concretely

Absolute Trajectory Error (ATE) measures the root-mean-square distance between predicted and ground-truth camera positions across all frames. An ATE of 6.42m on Oxford Spires (which spans a ~500m trajectory through buildings and courtyards) means the average position error is about 1.3% of the trajectory length. For a robot following this trajectory, that's the difference between "knows which room it's in" and "confused about which floor it's on" (at 24.78m ATE, VGGT's error).

Why do offline methods struggle here? They're trained on datasets where consecutive frames are close together and observe the same local region. Oxford Spires has complex scene transitions — outdoor courtyards to dark staircases — and large viewpoint changes that break their learned priors. LingBot-Map, trained progressively on long trajectories, handles these transitions naturally.

On pose accuracy (AUC@15), LingBot-Map achieves 61.64, more than doubling VGGT (23.84) and exceeding even DA3 (49.84). Among streaming methods, the best competitor manages only 13.92 (TTT3R). The gap is dramatic.

Long-sequence stability: When the sequence increases from 320 to 3,840 frames, CUT3R's ATE rises from 18.16 to 32.47 (a 79% degradation). LingBot-Map's goes from 6.42 to 7.11 — just a 10.7% increase over a 12x longer sequence. The three-level context structure genuinely prevents drift.

Cross-Benchmark Results

DatasetBest Competitor ATELingBot-Map ATE
ETH3D0.86 (Wint3R)0.22
7-Scenes0.10 (TTT3R / Stream3R)0.08
Tanks & Temples0.47 (CUT3R)0.19

Consistent improvements across all benchmarks, not just Oxford Spires. On ETH3D, LingBot-Map achieves nearly 4x lower trajectory error than the next best streaming method.

3D Reconstruction Quality

Pose accuracy alone doesn't tell the full story. LingBot-Map also produces dense depth maps per frame, which combine with the estimated poses to produce 3D point clouds. On ETH3D and 7-Scenes, the reconstruction quality (measured by F1 score) consistently exceeds all streaming competitors.

Ablation: What Matters Most?

The paper ablates each component of GCA on Oxford Spires (dense, 3840 frames):

ConfigurationATENote
Full GCA7.11All three context types
Remove trajectory memory12.4No long-range drift correction
Remove anchor context15.8No coordinate grounding
Remove relative pose loss9.2Less local consistency

Both the anchor context and trajectory memory are critical. Without anchors, the model loses its coordinate reference — scale and position drift freely. Without trajectory memory, pose errors accumulate unchecked over long sequences, because the model has no record of where it has been beyond the local window.

The relative pose loss also contributes meaningfully (ATE 9.2 without it vs. 7.11 with it), showing that encouraging local geometric consistency within the window improves the overall trajectory quality.

ATE Comparison (Oxford Spires, Sparse)

Absolute Trajectory Error in meters. Lower is better. LingBot-Map (rightmost) vs. competing streaming and offline methods.

What happens to CUT3R's trajectory error when the sequence grows from 320 to 3,840 frames?

Chapter 8: Efficiency

Accuracy is only half the story for a streaming system. The other half is: can it keep up with the camera?

Speed

A streaming system that takes 5 seconds per frame isn't streaming — it's batch processing with extra steps. LingBot-Map needs to process frames as fast as the camera produces them.

At 518 × 378 resolution with a sliding window of k=64 frames on a single NVIDIA A100, LingBot-Map runs at ~20 FPS using FlashInfer-based paged KV-cache management. That's comfortably real-time for most video feeds (typically 30 FPS, but every other frame is often sufficient).

Per-frame latency breakdown: ViT backbone forward pass: ~18ms. Frame Attention (12 layers): ~8ms. GCA cross-frame attention (12 layers): ~16ms. Prediction heads (camera + DPT): ~7ms. Total: ~49ms = 20.4 FPS. The GCA component is the bottleneck, because it attends over all context tokens (anchors + window + trajectory). As T grows, GCA latency increases by ~0.3ms per 1000 additional trajectory frames — negligible at any practical sequence length.

For comparison:

MethodFPSNote
Stream3R-w3.88Slow from dense caching
LingBot-Map20.29Paged KV-cache + GCA
CUT3R29.21Fast but drifts
InfiniteVGGT28.97Fast but drifts

CUT3R and InfiniteVGGT are faster (29 and 29 FPS respectively), but their accuracy degrades badly on long sequences — CUT3R's ATE nearly doubles from 320 to 3,840 frames. Stream3R-w, the most accurate competitor, runs at only 3.88 FPS. LingBot-Map sits at the sweet spot: fast enough for real-time use, accurate enough for thousands of frames. Speed and accuracy are not traded off; GCA achieves both simultaneously by being surgical about what to keep and what to discard.

Constant Memory Per Frame

The key efficiency insight: once a frame leaves the sliding window, it contributes only 6 tokens to the KV cache — regardless of image resolution. This means memory grows at just 6 tokens per frame, compared to ~506 for causal attention.

At T=10,000 frames: GCA uses ~70K tokens. Causal attention would need ~5M tokens. That's the difference between "runs on a single GPU" and "doesn't fit."

Paged KV-Cache

The sliding window and trajectory eviction require frequent cache updates — appending new entries and discarding old image tokens. With a standard contiguous memory layout, this means repeated reallocation. LingBot-Map uses a paged KV-cache (via FlashInfer) where updates affect only newly appended tokens, eliminating reallocation overhead. This alone provides a ~2x speedup over a naive PyTorch implementation (20 vs 10.5 FPS).

Memory budget at inference

On a single NVIDIA A100 (80GB), the model weights occupy ~1.2GB (ViT-L + heads). The remaining memory goes to the KV cache. At k=64 (window), n=3 (anchors):

Compare to causal attention at T=10,000: 510 x 10,000 = 5.1M tokens → ~360 GB. That's 4.5 A100s just for the KV cache. GCA makes the difference between "runs on one GPU" and "physically impossible."

Keyframe Selection for Ultra-Long Sequences

For sequences exceeding the training length (~320 views), LingBot-Map uses adaptive keyframe selection: it computes optical flow between the current frame and the last keyframe using the predicted pose and depth. If the flow exceeds a threshold, the frame becomes a keyframe. Otherwise it's discarded. This extends the effective range to ~3,000 frames in Direct mode.

Two Inference Modes

LingBot-Map provides two modes depending on sequence length:

ModeRangeHow it worksTrade-off
DirectUp to ~3K framesContinuous GCA with full three-level context, no resetsNo alignment error; accuracy degrades beyond training length
VO10K+ framesOverlapping local windows with Sim(3) alignment at boundariesBounded memory for any length; small drift at window boundaries

The Direct mode is preferred when the sequence fits within ~10x the training length. It produces more accurate trajectories because there are no inter-window alignment errors. For city-scale sequences or hour-long videos, VO mode scales indefinitely.

Verified: In Direct mode, LingBot-Map maintains stable accuracy for approximately 10x the training sequence length (~3,000 frames). In VO mode, it handles 10,000+ frames with bounded memory, trading a small amount of alignment drift at window boundaries.
How does the paged KV-cache improve LingBot-Map's inference speed?

Chapter 9: Connections

LingBot-Map sits at the intersection of learned 3D reconstruction and classical SLAM. Let's map the connections.

Relation to VGGT

VGGT is the offline predecessor. It uses bidirectional cross-view attention to process all frames simultaneously — powerful but not streamable. LingBot-Map inherits VGGT's alternating frame/cross-frame attention design but replaces the cross-frame component with GCA, adding the three-level context structure that enables streaming. The ViT backbone weights transfer directly.

Relation to DUSt3R

DUSt3R pioneered feed-forward 3D reconstruction from unposed images, but only for two views. VGGT extended it to multiple views. LingBot-Map extends it to streaming multiple views — potentially infinite — by solving the context management problem that neither DUSt3R nor VGGT addressed.

Relation to Classical SLAM

LingBot-Map explicitly borrows SLAM's three-level context decomposition (reference frame, local window, global map) but replaces every hand-crafted component with learned attention. No keypoint extraction, no feature matching, no bundle adjustment, no pose-graph optimization. Just a transformer with a structured mask.

Relation to CUT3R / TTT3R

CUT3R uses RNN-style recurrent compression — constant memory but state forgetting. After a few hundred frames, the recurrent state has been overwritten so many times that the model loses track of geometry it observed early on. TTT3R tries to fix this with test-time training: it fine-tunes model weights on each new frame. This helps with forgetting but adds significant computational overhead — you're running backpropagation during inference.

LingBot-Map's trajectory memory achieves the same goal (compact global context) without recurrence or test-time training: it simply keeps 6 tokens per past frame and lets attention do the integration. No gradient updates at test time, no state compression bottleneck.

Relation to Stream3R / StreamVGGT / Wint3R

These methods adapt VGGT to streaming by using causal attention with KV-caching. They keep all past tokens, which gives good short-sequence accuracy but causes memory and latency to grow linearly. On the Oxford Spires benchmark, their ATE degrades significantly as sequences get longer. LingBot-Map's structured eviction (keeping only 6 context tokens per past frame) addresses exactly this bottleneck.

Relation to VGGSfM

VGGSfM targets Structure-from-Motion (offline, unordered images). LingBot-Map targets streaming video. They share the vision of replacing traditional pipelines with end-to-end transformers, but solve different problems. An interesting open question: could GCA's context structure be adapted for unordered image sets (SfM) by treating loop closures as a form of trajectory memory?

Relation to LoGeR / Scal3R / ZipMap

These concurrent works also tackle long-sequence 3D reconstruction, but rely on test-time training (TTT) for global consistency. LoGeR combines sliding window attention with TTT for global alignment. Scal3R extends TTT with visual place recognition for city-scale scenes. ZipMap uses TTT layers to compress an entire image collection into a compact hidden state.

The key difference: all three require gradient updates at inference time. LingBot-Map is purely feed-forward — no parameter updates during inference. This makes it faster and simpler to deploy, at the cost of relying entirely on the attention mechanism (rather than test-time optimization) for global consistency.

Cheat Sheet

AspectLingBot-Map
InputStreaming video frames (one at a time)
OutputCamera pose + depth map per frame
BackboneViT (DINOv2, patch size 14)
AttentionFrame Attn + GCA (anchor / window / trajectory), 24 layers
AnchorFirst n=3 frames, full tokens, permanent
WindowLast k=16–64 frames, full tokens
TrajectoryAll other past frames, 6 tokens each
Per-frame cost(n+k)·M + 6T tokens (nearly constant)
Speed~20 FPS at 518×378
Max length~3K (direct), 10K+ (VO mode)
Key result6.42 ATE on Oxford Spires (2.8x better than CUT3R)

Open Questions

Several interesting directions remain:

The bigger picture: LingBot-Map shows that the wisdom of classical SLAM — decompose context by role, not by recency — transfers beautifully to the transformer era. You don't need to throw away 30 years of SLAM intuition to build a neural 3D system. You just need to express that intuition as an attention mask and let the data fill in the details.
What does LingBot-Map borrow from classical SLAM, and what does it replace?