Three types of learned context — anchor frames, a sliding window, and compressed trajectory memory — give a feed-forward model the spatial awareness of SLAM, at 20 FPS over 10,000+ frames.
You have a camera on a robot, a drone, or a pair of AR glasses. It streams video at 30 FPS as you walk through a building. You want to reconstruct the 3D scene in real time — camera poses, depth maps, point clouds — as each new frame arrives.
This is streaming 3D reconstruction: process frames one at a time, causally (no peeking at the future), and keep going for thousands or tens of thousands of frames.
The requirements are demanding. You need geometric accuracy (the 3D positions must be correct), temporal consistency (the reconstruction shouldn't jump around between frames), and computational efficiency (you need to keep up with the camera, ideally at 20+ FPS). Getting all three simultaneously is the hard part.
The catch? Every approach so far breaks down in one of three ways:
There's also a fourth category — hybrid SLAM-based methods like VGGT-SLAM and MASt3R-SLAM. These integrate learned 3D models with classical SLAM backends (keyframe selection, pose-graph optimization). They're more robust than pure learned methods, but the classical components require hand-crafted heuristics and iterative optimization that limits real-time applicability.
Let's make this precise. DINOv2-ViT with patch size 14 on a 518x378 image produces M = 504 image tokens, each of dimension C = 768. With 6 context tokens per frame, that's 510 tokens per frame. Each token in the KV cache occupies 2 x C x 2 bytes (key + value, float16) per layer, times 24 layers. That's ~72KB per token per layer-stack.
For a 10,000-frame sequence under causal attention: 510 x 10,000 = 5.1M tokens. KV cache alone: ~180 GB. That exceeds the memory of any single GPU. Even at 1,000 frames, the 25 GB KV cache fills an A100. This isn't a theoretical concern — it's a hard engineering wall.
Look at what happens to memory as a sequence gets longer. Causal attention retains every token from every frame — the cost grows without bound. A sliding window bounds memory but forgets the distant past entirely. LingBot-Map's insight is that you can have both: bounded memory and long-range context.
Drag the sequence length to see how total token count grows under each strategy. M = 500 image tokens per frame, 6 context tokens per frame. Anchor n=3, window k=16.
Classical SLAM systems — the hand-engineered pipelines that have powered robotics for decades — already solved this memory problem, just not with neural networks. They maintain three distinct types of spatial context:
The genius of SLAM is this decomposition: not all context is equally important. Recent frames need full detail (for local matching). Distant frames need only a skeleton (for global consistency). And a few initial frames need to be permanent anchors (for coordinate grounding).
The key word is selectively. Human spatial memory doesn't record every visual frame. It preserves landmarks (where am I relative to the entrance?), recent context (what did I just walk past?), and a rough trajectory history (I turned left, then right, then went upstairs). LingBot-Map mirrors this structure exactly.
This gives birth to Geometric Context Attention (GCA), which decomposes the streaming state into:
The crucial difference from classical SLAM: all three context types are maintained within a single attention mechanism. There's no separate keypoint extractor, no separate feature matcher, no separate bundle adjustment module. Just a transformer with a structured attention mask. The mask enforces which frames attend to which; the learned attention weights decide what information to extract.
The rest of this lesson unpacks each component and shows how they compose into a single attention mask.
Monocular 3D reconstruction is inherently scale-ambiguous. A video of a dollhouse and a video of a real house can produce identical image sequences. Without some reference, the model has no way to pin down absolute scale or establish a consistent coordinate system.
Offline methods like VGGT handle this by normalizing against the entire point cloud after processing all frames. But in streaming, you don't have the full point cloud yet.
LingBot-Map designates the first n frames (typically n=3) as anchor frames. These frames are processed with full bidirectional attention among themselves at the start of the sequence, establishing a coordinate origin and absolute scale. Their tokens are then frozen in the KV cache and never evicted.
Every subsequent frame attends to these anchor tokens, ensuring that all predictions are grounded in the same coordinate system.
To help the network distinguish anchor frames from streaming frames, each anchor frame is augmented with a learnable anchor token a ∈ RC (C=768). This is a special token — like a CLS token — that gets added to the frame's token set. It tells the attention layers: "this frame is an anchor; treat it differently."
The anchor frames are permanent — they're never evicted. So what happens if the camera starts pointing at a textureless wall, or is heavily motion-blurred? The model has no explicit failure detection for bad anchors. In practice, the progressive training (starting with short sequences of 2-24 frames) teaches the model to extract whatever geometric signal exists from any set of frames. On the Oxford Spires benchmark, even sequences that start with visually uninformative frames (dark stairwells) maintain stable trajectories — because the anchor establishes scale and origin from relative geometry between the 3 frames, not from any single frame's visual content.
However, if the first 3 frames have zero baseline (e.g., camera is completely static), scale becomes genuinely unrecoverable. The training data includes such cases with near-zero-baseline starts, so the model learns to output conservative depth predictions until the baseline grows.
During training, all ground-truth depths and camera translations are normalized by the mean distance of the anchor point cloud from the origin:
This canonical scale, derived from just the first n frames, gives the model a consistent reference regardless of scene size.
A single frame gives no stereo baseline — you can't triangulate depth from one view. Two frames give a baseline but a fragile one. Three frames provide enough geometric diversity to robustly establish both the coordinate system and the scale, while keeping the permanent token overhead small (3 × 510 = 1,530 tokens, forever in the cache). At 72KB per token across all layers, that's ~110KB of permanent KV cache — negligible compared to the window's ~600KB and the trajectory's growing contribution.
The anchor frames fix the coordinate system, but they're far away from where the camera is now. If you're on frame 5,000 of a building walkthrough, the first 3 frames show a completely different part of the scene. They can't help you match the current view to its neighbors.
To accurately register each new frame, you need dense visual overlap with nearby observations. Think about it from the model's perspective: to figure out where the camera is now, it needs to see how the current view overlaps with views it has already positioned. The more overlap, the more constraints, the better the pose estimate.
This is exactly what feature matching does in classical SLAM: you find corresponding pixels between adjacent frames and triangulate their 3D positions. LingBot-Map does the same thing implicitly through attention — the cross-frame attention between the current frame and window frames effectively performs learned feature matching.
LingBot-Map maintains a sliding window of the k most recent frames (typically k=16 to 64), retaining their full image tokens (all M tokens per frame). The current frame attends to these tokens with full cross-frame attention, giving it rich local geometry for pose estimation.
Each frame produces M ≈ 500 image tokens from the ViT backbone. These tokens encode fine-grained spatial features — edges, textures, corners — that are essential for pixel-level matching. You can't compress these without losing the detail needed for accurate relative pose estimation.
To further encourage geometric consistency within the window, LingBot-Map applies a relative pose loss between all frame pairs in the sliding window:
This is computed over geodesic rotation error and L1 translation error for every pair of frames within the window. Because the window contains only already-observed frames, this loss is inherently causal.
Consider frames 985-1000 in the sliding window. Each frame's pose is predicted independently by the camera head (a single MLP operating on one token). Without the relative pose loss, two adjacent frames might each have small absolute pose errors that happen to disagree with each other — frame 999 thinks it moved 10cm left, frame 1000 thinks it moved 10cm right. The relative loss directly penalizes this disagreement by computing all pairwise geodesic and translation errors within the window. This creates a soft constraint that the window's poses form a geometrically consistent local trajectory — even if the absolute poses have small global drift.
You have anchors (first n frames, full tokens) and a window (last k frames, full tokens). But what about the hundreds or thousands of frames in between? Without any record of these intermediate frames, pose errors accumulate unchecked. The estimated trajectory drifts.
Causal attention's answer: keep everything. But that's O(T · M) tokens, growing without bound.
LingBot-Map's answer: keep a skeleton.
When a frame exits the sliding window (it's no longer among the k most recent), LingBot-Map evicts its image tokens (all M of them) and retains only:
That's 6 tokens per frame instead of M + 6 ≈ 510. An ~84x reduction in per-frame storage.
This is a deliberate engineering decision. A single token (just the camera token) would encode only the pose — but the model also needs to know how this frame relates to the anchors and what geometric context it captured. The anchor token encodes the frame's relationship to the coordinate origin. The 4 register tokens are learnable "slots" that the attention mechanism fills with whatever compressed information the model finds most useful for drift correction — think of them as a tiny bottleneck autoencoder of the frame's geometric contribution.
Why not more? The paper ablated with 8 and 12 tokens: diminishing returns beyond 6, while memory grows proportionally. At T=10,000 frames, going from 6 to 12 tokens doubles trajectory memory from 60K to 120K tokens — a meaningful cost for negligible accuracy gain (<0.1m ATE improvement on Oxford Spires).
Without timestamps, the trajectory memory would be an unordered bag of 6-token summaries. The model would know that the camera visited certain places, but not when or in what order. That's a problem: the trajectory structure matters for drift correction (nearby-in-time frames should have nearby poses).
To fix this, the trajectory memory tokens are augmented with video temporal positional encodings. These tell the attention layers when each past observation occurred, enabling the model to reason about the trajectory's temporal structure — "frame 500 was 200 steps ago, frame 900 was 50 steps ago."
For a T-frame sequence with n=3 anchors and k=16 window frames, the total context under GCA is:
At T=10,000 frames: GCA uses ~70K tokens. Causal uses ~5M tokens. That's a 71x reduction.
Suppose you're on frame T=1,000 with n=3 anchors and k=16 window. The attention context for the current frame consists of:
Total: 15,500 tokens. Under causal attention, the same frame would need 1,000 × 506 = 506,000 tokens. That's a 33x reduction, and it only gets better with longer sequences.
At T=10,000: GCA needs ~70K tokens. Causal needs ~5M. That's a 71x reduction. The memory savings are enormous, and they come at minimal information cost because the evicted image tokens from distant frames wouldn't have been useful for pixel matching anyway — those views typically have zero visual overlap with the current frame.
Drag the sequence length slider. The diagram shows anchor frames (full tokens), window frames (full tokens), and trajectory frames (compressed to 6 tokens each). The live token count updates below.
The three context types — anchor, window, trajectory — are unified through a single structured attention mask. This mask determines, for every query token, which key tokens it can attend to.
Each frame's tokens consist of a small leading segment of context tokens (camera + anchor + registers = 6) and a larger segment of image tokens (M ≈ 500). The mask specifies:
Let's compare this to the alternatives visually:
Four attention strategies for a 12-frame sequence. Each cell represents whether query frame (row) can attend to key frame (column). Blue = full tokens accessible. Purple = context tokens only. Gray = no attention.
A pure sliding window (bottom-left panel) bounds memory beautifully — but look at what it loses. Frame 10 has zero information about what happened at frame 1. If the camera revisits a location after a long loop, the sliding window has forgotten that location entirely. There's no mechanism for drift correction.
GCA's trajectory memory (the purple cells) fills this gap. Even though each past frame contributes only 6 tokens, those tokens encode where the camera was and how it related to the anchors. The attention mechanism can use this sparse global record to detect and correct drift — something a pure sliding window cannot do.
Causal attention (top-right panel) retains everything — no information loss at all. But at T=10,000 frames, every new frame must attend to ~5 million tokens. That's both a memory problem (storing the KV cache) and a compute problem (the attention operation itself). GCA achieves similar global awareness with ~70K tokens, by recognizing that you don't need pixel-level detail from frame 50 when you're on frame 10,000.
Each new frame must attend to:
The total is (n+k)·M + 6T. The first term is constant (~9,500 for n=3, k=16, M=500). The second grows at just 6 tokens per frame — versus M+6 ≈ 506 for causal attention. That's an 84x reduction in the growth rate.
With GCA defined, let's zoom out to the full pipeline.
Each input image is encoded by a Vision Transformer backbone initialized from DINOv2, with a patch size of 14 pixels. This produces M image tokens per frame — rich visual features that encode edges, textures, and semantic content.
Each frame's M image tokens are augmented with:
Total: M + 6 ≈ 510 tokens per frame. The token ordering within each frame is: [camera, anchor, register1-4, image1-M]. The 6 context tokens are placed first so that when trajectory frames are evicted, the retained tokens are a contiguous prefix — no memory compaction needed.
The augmented tokens pass through 24 alternating layers of:
A composite loss with three terms:
Training happens in two stages on 128 NVIDIA A100 GPUs (80GB):
| Stage | Attention | Views | Data | GPU Hours | Duration |
|---|---|---|---|---|---|
| 1. Base | Global (offline) | 2–24 | 29 datasets, mixed | 21,500 | ~7 days |
| 2. Streaming | GCA | 24→320 | Long video focus | 15,360 | ~5 days |
Key training hyperparameters: AdamW optimizer, learning rate 2e-4 with cosine warmup (2K steps), batch size 128 (Stage 1) / 64 (Stage 2), loss weights λdepth=1.0, λabs-pose=0.5, λrel-pose=0.1. The ViT backbone is initialized from DINOv2-L pretrained weights and fully fine-tuned (not frozen) — this is critical because the geometric features needed for pose estimation differ from DINOv2's self-supervised features.
Stage 1 builds general geometric priors with standard global attention. Stage 2 swaps in GCA and progressively increases the number of training views from 24 to 320, teaching the model to maintain consistency over longer and longer sequences.
Training directly on long sequences fails. In early training, the model's pose predictions are inaccurate. On a 320-frame sequence, a small rotation error at frame 10 compounds to a massive translation error at frame 300. The resulting loss gradients are noisy and unstable.
The progressive curriculum solves this: start with 24 frames (easy), then gradually extend. The model first learns reliable local geometry from short clips, then learns to maintain global consistency as the training horizon stretches. By the time it sees 320-frame sequences, its local predictions are already accurate enough to provide stable gradients.
At 320 views per training sample, GPU memory becomes the bottleneck due to the quadratic cost of cross-frame attention. LingBot-Map uses the Ulysses context-parallelism strategy with a parallelism dimension of 16: different views are distributed across GPUs, and attention is computed via efficient all-to-all collective communication. This allows training on sequences far longer than a single GPU's memory could support.
LingBot-Map trains on 29 datasets spanning indoor, outdoor, object-centric, synthetic, and real-world scenarios. The Stage 1 mix includes BlendedMVS, HyperSim, MegaDepth, TartanAir, ScanNet, and many more — roughly balanced sampling across all datasets.
In Stage 2, the distribution shifts heavily toward long-trajectory video datasets: TartanAir, MatrixCity, Waymo, KITTI-360, ScanNet++, and internal game data get upweighted, while multi-view-only datasets (no temporal structure) are down-weighted or dropped entirely.
To produce temporally coherent training subsequences from long videos, Stage 2 uses a foldback video sampler: it starts at a random frame and advances with a random stride. Upon reaching a sequence boundary, it reverses direction and draws a new stride (different from the previous one to avoid degenerate oscillation). This yields subsequences with naturally varying frame rates and no forward-time bias.
Images are resized to max dimension 518px. Aggressive photometric augmentation is applied: random color jitter (brightness, contrast, saturation ±0.5; hue ±0.1) with probability 0.9, random grayscale with probability 0.05, and random spatial rescaling in [0.8x, 1.2x]. A co-jittering mode (probability 0.3) applies identical color transforms to all frames in a scene, encouraging the model to rely on geometric cues rather than appearance shortcuts.
LingBot-Map is evaluated against three categories of baselines: offline feed-forward models (VGGT, DA3, Pi3), optimization-based methods (DroidSLAM, VIPE), and streaming methods (CUT3R, TTT3R, Wint3R, Stream3R, InfiniteVGGT).
This is the hardest benchmark: complex indoor-outdoor transitions, revisits after long gaps, and large scale variation. In the sparse setting (320 frames):
| Method | Type | ATE ↓ |
|---|---|---|
| VGGT | offline | 24.78 |
| DA3 | offline | 12.87 |
| VIPE | optim | 10.52 |
| CUT3R | online | 18.16 |
| TTT3R | online | 19.35 |
| Wint3R | online | 21.10 |
| Stream3R | online | 29.58 |
| LingBot-Map | online | 6.42 |
LingBot-Map achieves 6.42m ATE — less than half the error of the best optimization-based method (VIPE at 10.52), and 2.8x better than the best streaming competitor (CUT3R at 18.16). Despite being a streaming model, it outperforms every offline model too.
Absolute Trajectory Error (ATE) measures the root-mean-square distance between predicted and ground-truth camera positions across all frames. An ATE of 6.42m on Oxford Spires (which spans a ~500m trajectory through buildings and courtyards) means the average position error is about 1.3% of the trajectory length. For a robot following this trajectory, that's the difference between "knows which room it's in" and "confused about which floor it's on" (at 24.78m ATE, VGGT's error).
Why do offline methods struggle here? They're trained on datasets where consecutive frames are close together and observe the same local region. Oxford Spires has complex scene transitions — outdoor courtyards to dark staircases — and large viewpoint changes that break their learned priors. LingBot-Map, trained progressively on long trajectories, handles these transitions naturally.
On pose accuracy (AUC@15), LingBot-Map achieves 61.64, more than doubling VGGT (23.84) and exceeding even DA3 (49.84). Among streaming methods, the best competitor manages only 13.92 (TTT3R). The gap is dramatic.
| Dataset | Best Competitor ATE | LingBot-Map ATE |
|---|---|---|
| ETH3D | 0.86 (Wint3R) | 0.22 |
| 7-Scenes | 0.10 (TTT3R / Stream3R) | 0.08 |
| Tanks & Temples | 0.47 (CUT3R) | 0.19 |
Consistent improvements across all benchmarks, not just Oxford Spires. On ETH3D, LingBot-Map achieves nearly 4x lower trajectory error than the next best streaming method.
Pose accuracy alone doesn't tell the full story. LingBot-Map also produces dense depth maps per frame, which combine with the estimated poses to produce 3D point clouds. On ETH3D and 7-Scenes, the reconstruction quality (measured by F1 score) consistently exceeds all streaming competitors.
The paper ablates each component of GCA on Oxford Spires (dense, 3840 frames):
| Configuration | ATE | Note |
|---|---|---|
| Full GCA | 7.11 | All three context types |
| Remove trajectory memory | 12.4 | No long-range drift correction |
| Remove anchor context | 15.8 | No coordinate grounding |
| Remove relative pose loss | 9.2 | Less local consistency |
Both the anchor context and trajectory memory are critical. Without anchors, the model loses its coordinate reference — scale and position drift freely. Without trajectory memory, pose errors accumulate unchecked over long sequences, because the model has no record of where it has been beyond the local window.
The relative pose loss also contributes meaningfully (ATE 9.2 without it vs. 7.11 with it), showing that encouraging local geometric consistency within the window improves the overall trajectory quality.
Absolute Trajectory Error in meters. Lower is better. LingBot-Map (rightmost) vs. competing streaming and offline methods.
Accuracy is only half the story for a streaming system. The other half is: can it keep up with the camera?
A streaming system that takes 5 seconds per frame isn't streaming — it's batch processing with extra steps. LingBot-Map needs to process frames as fast as the camera produces them.
At 518 × 378 resolution with a sliding window of k=64 frames on a single NVIDIA A100, LingBot-Map runs at ~20 FPS using FlashInfer-based paged KV-cache management. That's comfortably real-time for most video feeds (typically 30 FPS, but every other frame is often sufficient).
Per-frame latency breakdown: ViT backbone forward pass: ~18ms. Frame Attention (12 layers): ~8ms. GCA cross-frame attention (12 layers): ~16ms. Prediction heads (camera + DPT): ~7ms. Total: ~49ms = 20.4 FPS. The GCA component is the bottleneck, because it attends over all context tokens (anchors + window + trajectory). As T grows, GCA latency increases by ~0.3ms per 1000 additional trajectory frames — negligible at any practical sequence length.
For comparison:
| Method | FPS | Note |
|---|---|---|
| Stream3R-w | 3.88 | Slow from dense caching |
| LingBot-Map | 20.29 | Paged KV-cache + GCA |
| CUT3R | 29.21 | Fast but drifts |
| InfiniteVGGT | 28.97 | Fast but drifts |
CUT3R and InfiniteVGGT are faster (29 and 29 FPS respectively), but their accuracy degrades badly on long sequences — CUT3R's ATE nearly doubles from 320 to 3,840 frames. Stream3R-w, the most accurate competitor, runs at only 3.88 FPS. LingBot-Map sits at the sweet spot: fast enough for real-time use, accurate enough for thousands of frames. Speed and accuracy are not traded off; GCA achieves both simultaneously by being surgical about what to keep and what to discard.
The key efficiency insight: once a frame leaves the sliding window, it contributes only 6 tokens to the KV cache — regardless of image resolution. This means memory grows at just 6 tokens per frame, compared to ~506 for causal attention.
At T=10,000 frames: GCA uses ~70K tokens. Causal attention would need ~5M tokens. That's the difference between "runs on a single GPU" and "doesn't fit."
The sliding window and trajectory eviction require frequent cache updates — appending new entries and discarding old image tokens. With a standard contiguous memory layout, this means repeated reallocation. LingBot-Map uses a paged KV-cache (via FlashInfer) where updates affect only newly appended tokens, eliminating reallocation overhead. This alone provides a ~2x speedup over a naive PyTorch implementation (20 vs 10.5 FPS).
On a single NVIDIA A100 (80GB), the model weights occupy ~1.2GB (ViT-L + heads). The remaining memory goes to the KV cache. At k=64 (window), n=3 (anchors):
Compare to causal attention at T=10,000: 510 x 10,000 = 5.1M tokens → ~360 GB. That's 4.5 A100s just for the KV cache. GCA makes the difference between "runs on one GPU" and "physically impossible."
For sequences exceeding the training length (~320 views), LingBot-Map uses adaptive keyframe selection: it computes optical flow between the current frame and the last keyframe using the predicted pose and depth. If the flow exceeds a threshold, the frame becomes a keyframe. Otherwise it's discarded. This extends the effective range to ~3,000 frames in Direct mode.
LingBot-Map provides two modes depending on sequence length:
| Mode | Range | How it works | Trade-off |
|---|---|---|---|
| Direct | Up to ~3K frames | Continuous GCA with full three-level context, no resets | No alignment error; accuracy degrades beyond training length |
| VO | 10K+ frames | Overlapping local windows with Sim(3) alignment at boundaries | Bounded memory for any length; small drift at window boundaries |
The Direct mode is preferred when the sequence fits within ~10x the training length. It produces more accurate trajectories because there are no inter-window alignment errors. For city-scale sequences or hour-long videos, VO mode scales indefinitely.
LingBot-Map sits at the intersection of learned 3D reconstruction and classical SLAM. Let's map the connections.
VGGT is the offline predecessor. It uses bidirectional cross-view attention to process all frames simultaneously — powerful but not streamable. LingBot-Map inherits VGGT's alternating frame/cross-frame attention design but replaces the cross-frame component with GCA, adding the three-level context structure that enables streaming. The ViT backbone weights transfer directly.
DUSt3R pioneered feed-forward 3D reconstruction from unposed images, but only for two views. VGGT extended it to multiple views. LingBot-Map extends it to streaming multiple views — potentially infinite — by solving the context management problem that neither DUSt3R nor VGGT addressed.
LingBot-Map explicitly borrows SLAM's three-level context decomposition (reference frame, local window, global map) but replaces every hand-crafted component with learned attention. No keypoint extraction, no feature matching, no bundle adjustment, no pose-graph optimization. Just a transformer with a structured mask.
CUT3R uses RNN-style recurrent compression — constant memory but state forgetting. After a few hundred frames, the recurrent state has been overwritten so many times that the model loses track of geometry it observed early on. TTT3R tries to fix this with test-time training: it fine-tunes model weights on each new frame. This helps with forgetting but adds significant computational overhead — you're running backpropagation during inference.
LingBot-Map's trajectory memory achieves the same goal (compact global context) without recurrence or test-time training: it simply keeps 6 tokens per past frame and lets attention do the integration. No gradient updates at test time, no state compression bottleneck.
These methods adapt VGGT to streaming by using causal attention with KV-caching. They keep all past tokens, which gives good short-sequence accuracy but causes memory and latency to grow linearly. On the Oxford Spires benchmark, their ATE degrades significantly as sequences get longer. LingBot-Map's structured eviction (keeping only 6 context tokens per past frame) addresses exactly this bottleneck.
VGGSfM targets Structure-from-Motion (offline, unordered images). LingBot-Map targets streaming video. They share the vision of replacing traditional pipelines with end-to-end transformers, but solve different problems. An interesting open question: could GCA's context structure be adapted for unordered image sets (SfM) by treating loop closures as a form of trajectory memory?
These concurrent works also tackle long-sequence 3D reconstruction, but rely on test-time training (TTT) for global consistency. LoGeR combines sliding window attention with TTT for global alignment. Scal3R extends TTT with visual place recognition for city-scale scenes. ZipMap uses TTT layers to compress an entire image collection into a compact hidden state.
The key difference: all three require gradient updates at inference time. LingBot-Map is purely feed-forward — no parameter updates during inference. This makes it faster and simpler to deploy, at the cost of relying entirely on the attention mechanism (rather than test-time optimization) for global consistency.
| Aspect | LingBot-Map |
|---|---|
| Input | Streaming video frames (one at a time) |
| Output | Camera pose + depth map per frame |
| Backbone | ViT (DINOv2, patch size 14) |
| Attention | Frame Attn + GCA (anchor / window / trajectory), 24 layers |
| Anchor | First n=3 frames, full tokens, permanent |
| Window | Last k=16–64 frames, full tokens |
| Trajectory | All other past frames, 6 tokens each |
| Per-frame cost | (n+k)·M + 6T tokens (nearly constant) |
| Speed | ~20 FPS at 518×378 |
| Max length | ~3K (direct), 10K+ (VO mode) |
| Key result | 6.42 ATE on Oxford Spires (2.8x better than CUT3R) |
Several interesting directions remain: