Kuznetsov, Bhutra, Pal — 2025 (Reimagined)

GAS v2

Rebuilding the semantic mapping pipeline with 2025 foundation models — zero fine-tuning, open vocabulary, no camera calibration.

Prerequisites: GAS v1 + DINOv2 + Basic Python
12
Chapters
6+
Simulations

Chapter 0: GAS v1 Recap & What Changed

In 2024, Kuznetsov, Bhutra, and Pal built GAS — a pipeline that turns monocular video into semantic floorplans. The pipeline chained four components: GO-SLAM for 3D reconstruction, Faster R-CNN for object detection, SAM for segmentation masks, and a geometric projection step to map everything into 2D.

It worked. On the Replica dataset, GAS produced readable semantic floorplans from nothing but RGB video. But the engineering was painful.

What worked in GAS v1

What was painful

The thesis of GAS v2: Every single pain point above has been eliminated by foundation models released between 2024 and 2025. No fine-tuning (Grounding DINO is zero-shot). No intrinsics (VGGT estimates everything). No fixed vocabulary (open-vocab detection). Shared backbone (DINOv2 ViT underlies most components). Learned merging (ConceptGraphs with language features). The pipeline design was right — the components just needed to catch up.
What was the single most labor-intensive part of building GAS v1?

Chapter 1: The Shared Backbone Revolution

GAS v1 ran three completely separate neural networks on every video frame. GO-SLAM used its own CNN encoder. Faster R-CNN used a ResNet-50 backbone with FPN. SAM used a ViT-H image encoder. Each model independently extracted features from the same image, tripling the compute.

In 2025, a single model family dominates visual feature extraction: DINOv2. Meta's self-supervised ViT, trained on 142M curated images, produces features so general that they serve as the backbone for nearly every downstream vision task.

DINOv2 is everywhere

Look at what uses DINOv2 internally:

This is the shared backbone revolution. Instead of running three unrelated networks, GAS v2 can extract DINOv2 features once and route them to multiple task heads. One forward pass through a ViT-G produces features rich enough for geometry estimation, object detection, segmentation, and depth prediction simultaneously.

The data flow through a shared backbone: Input image (518×518×3, resized for DINOv2) → ViT-G encoder (1.1B params) → 37×37 = 1,369 patch tokens of dimension 1536 → these tokens are routed to lightweight task heads. The geometry head (a DPT-style decoder, ~20M params) produces a pointmap (518×518×3, xyz per pixel). The detection head produces N bounding boxes with text-matched labels. The segmentation head produces N binary masks. The depth head (another DPT decoder, ~20M params) produces metric depth (518×518, float32 in meters). Total shared compute: ~800ms for the backbone. Each head adds ~50-100ms. Compare to running three separate networks: 3×800ms = 2.4 seconds.
The practical benefit: On an RTX 4090, running three separate model forward passes takes ~180ms per frame. A shared DINOv2 backbone with lightweight task heads takes ~60ms — a 3x speedup with equal or better quality. This is the difference between real-time (15+ FPS) and batch-only processing.

How feature sharing works

DINOv2 ViT-G processes an image into a grid of 1024-dimensional patch tokens (e.g., 16x16 = 256 tokens for a 224x224 image, more for higher resolutions). These tokens encode both local appearance (texture, edges, color) and global semantics (object identity, scene category). Task heads are lightweight networks that read these tokens and produce task-specific outputs:

Not all components share weights in practice (yet): VGGT, Grounding DINO, SAM 2, and Depth Anything V2 each ship as separate models with their own weights. The shared backbone architecture is where the field is heading (and what makes sense for a production pipeline), but today you still load them separately. The key insight is that they could share a backbone because they all build on the same ViT feature space.
Why is DINOv2 suitable as a universal visual backbone?

Chapter 2: 3D Geometry: VGGT or MASt3R-SLAM

GO-SLAM was GAS v1's geometry backbone. It worked, but it required camera intrinsics (the K matrix: focal length, principal point), sometimes stereo pairs, and ran a complex neural SLAM loop. In 2025, two alternatives have made calibrated cameras optional.

Option A: VGGT (offline, batch)

VGGT (Visual Geometry Grounded Transformer) won CVPR 2025 Best Paper. Feed it a set of unposed images — no camera intrinsics, no ordering, no calibration — and it outputs in a single forward pass:

# VGGT: single forward pass, all geometry
from vggt.models import VGGTModel

model = VGGTModel.from_pretrained("facebook/vggt-1b")
model.eval().cuda()

# images: (B, N, 3, H, W) - batch of N unposed images
predictions = model(images)

poses = predictions["extrinsic"]     # (B, N, 3, 4) camera-to-world
points = predictions["pointmap"]     # (B, N, H, W, 3) 3D per pixel
depth = predictions["depth"]         # (B, N, H, W) metric depth
intrinsics = predictions["intrinsic"] # (B, N, 3, 3) estimated K

No calibration. No sequential processing. No SLAM loop. Just images in, geometry out.

VGGT data flow and engineering details: VGGT takes N unposed images (typically 8-64 frames) at resolution divisible by 14 (patch size). Internally: ViT backbone (DINOv2-initialized, ~1B params) extracts per-image features → cross-view transformer layers attend across image pairs → regression heads output per-pixel pointmaps (N×H×W×3), extrinsics (N×3×4), and intrinsics (N×3×3). Memory requirement: ~12GB VRAM for 32 frames at 504×378. Inference time: ~2.1 seconds for 32 frames on an A10G. The key constraint: H and W must be divisible by 14 or the ViT patch embedding will silently crop pixels. Always resize frames before feeding to VGGT.

Option B: MASt3R-SLAM (real-time, streaming)

MASt3R-SLAM is real-time dense SLAM built on top of MASt3R's learned stereo matching. It processes frames sequentially as they arrive, maintains a running map, and handles loop closures. Unlike VGGT (which processes a batch at once), MASt3R-SLAM is designed for live operation.

# MASt3R-SLAM: real-time streaming
from mast3r_slam import MASt3RSLAM

slam = MASt3RSLAM(config="configs/base.yaml")

for frame in video_stream:
    result = slam.process_frame(frame)
    pose = result.pose        # current camera pose
    pointmap = result.points  # current dense 3D map
    if result.loop_closure:
        print("Loop closure detected!")

When to use which

VGGT for offline/batch processing — you have all frames upfront and want highest quality. Process a recorded video, get perfect geometry.

MASt3R-SLAM for real-time/streaming — robot is exploring live, needs poses as it moves. Slightly lower accuracy but runs at 15+ FPS.

Both approaches eliminate GAS v1's most fragile dependency: the camera intrinsics matrix K. GO-SLAM would produce distorted geometry if K was wrong by even a few pixels. VGGT estimates K. MASt3R-SLAM uses learned priors that are robust to approximate intrinsics. This single change eliminates hours of calibration debugging.

What is the key advantage VGGT has over GO-SLAM for semantic mapping?

Chapter 3: Open-Vocabulary Detection: Grounding DINO

GAS v1's detection story was the most painful part of the entire project. To detect 10 object classes (door, chair, table, toilet, sink, sofa, bed, TV, refrigerator, exit sign), the authors had to:

  1. Collect images from 5 different datasets (ScanNet, Matterport3D, ADE20K, NYU Depth V2, COCO)
  2. Convert all annotations to COCO format (different datasets use different schemas)
  3. Balance class distributions (some classes had 50x more samples than others)
  4. Fine-tune Faster R-CNN with ResNet-50-FPN backbone
  5. Tune confidence thresholds per class

All of this to detect 10 classes. Want to add "bookshelf"? Start over: find training data, re-annotate, re-train.

Grounding DINO: zero-shot, open-vocabulary

Grounding DINO detects any object described in natural language. No training data. No fine-tuning. No COCO format conversion. You just describe what you want to find.

# Grounding DINO: detect anything from text
from groundingdino.util.inference import load_model, predict

model = load_model(
    "groundingdino/config/GroundingDINO_SwinB.py",
    "weights/groundingdino_swinb.pth"
)

# Detect ANY objects via text prompt
text_prompt = "door . chair . table . exit sign . bookshelf . window"
boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=text_prompt,
    box_threshold=0.3,
    text_threshold=0.25
)
# boxes: (N, 4) in [cx, cy, w, h] normalized
# phrases: ['door', 'chair', 'table', ...] matched labels
The paradigm shift: GAS v1 spent weeks assembling 12,000 images to detect 10 classes. GAS v2 detects unlimited classes in 3 lines of code. Want to add "fire extinguisher"? Just add it to the text prompt. This is the difference between a closed-vocabulary pipeline (fixed at training time) and an open-vocabulary pipeline (defined at inference time).

How Grounding DINO works

Grounding DINO fuses a vision transformer (Swin-B or Swin-L) with a text encoder (BERT). It uses cross-modality attention to match text tokens to image regions:

  1. The text encoder processes your prompt into text features
  2. The image encoder extracts multi-scale visual features
  3. Cross-attention layers fuse text and image features bidirectionally
  4. A detection head predicts boxes where text-image attention is highest

The dot-separated prompt format ("door . chair . table") tells the model to detect each concept independently. Each detected box comes with the matching text phrase and a confidence score.

Grounding DINO data flow: Input image (800×1333 after resize) + text prompt → Swin-B image encoder (88M params) produces multi-scale features at 1/8, 1/16, 1/32 resolution → BERT text encoder (110M params) produces text tokens → 6 cross-attention fusion layers fuse text↔image bidirectionally → detection head outputs boxes in [cx, cy, w, h] normalized format + text-matched phrases. The model's output boxes tensor is (N, 4) and logits is (N,) with confidence scores. Boxes with logits below box_threshold are filtered. Inference: ~125ms per frame on RTX 4090 with Swin-B. The HuggingFace transformers port runs in pure PyTorch — no custom CUDA ops needed.

Practical tips

How does Grounding DINO eliminate the need for fine-tuning in the detection stage?

Chapter 4: Grounded SAM 2 — Detect + Segment + Track

In GAS v1, detection and segmentation were separate steps with separate models. Faster R-CNN produced bounding boxes. Each box was then fed to SAM as a point prompt to generate a mask. For video, masks were merged across frames using IoU matching — a brittle process that lost objects when viewpoints changed significantly.

Grounded SAM 2 combines Grounding DINO and SAM 2 into a single system that detects, segments, and tracks objects across video frames. One call replaces three separate steps from v1.

The three stages in one

  1. Detect — Grounding DINO finds objects from text prompt → bounding boxes
  2. Segment — SAM 2 generates pixel-precise masks from boxes → instance masks
  3. Track — SAM 2's memory attention propagates masks across video frames → consistent object IDs
# Grounded SAM 2: detect + segment + track
import supervision as sv
from grounded_sam2 import GroundedSAM2Pipeline

pipeline = GroundedSAM2Pipeline(
    grounding_model="gdino_swinb",
    sam2_checkpoint="sam2_hiera_large.pt",
    device="cuda"
)

# Single frame: detect + segment
results = pipeline.predict(
    image=frame,
    text_prompt="door . chair . table . exit sign",
    box_threshold=0.3
)
masks = results.masks       # (N, H, W) binary masks
labels = results.labels     # ['door', 'chair', ...]
boxes = results.boxes       # (N, 4) bounding boxes

# Video tracking: propagate masks with memory
tracker = pipeline.create_video_tracker()
for i, frame in enumerate(video_frames):
    if i == 0:
        # Initialize with detections on first frame
        tracked = tracker.init_frame(frame, results)
    else:
        # Propagate masks using SAM 2 memory bank
        tracked = tracker.propagate(frame)
    # tracked.masks has consistent IDs across frames
SAM 2's memory bank is the key innovation: Unlike GAS v1's IoU matching (which compared masks frame-by-frame and lost objects when they were occluded), SAM 2 maintains a memory bank of object appearances. When an object disappears behind a wall and reappears, SAM 2 recognizes it as the same object. This is critical for mapping — you want "chair #3" to be the same chair regardless of which frame it appears in.
SAM 2 internals — what degrades: SAM 2's Hiera image encoder (224M params) produces per-frame features. The memory bank stores up to 16 recent frame features + the initial prompt frame. Memory attention (cross-attention between current frame and stored memories) runs at each frame. Key failure modes: (1) Objects that change appearance drastically between views (e.g., back of a chair looks nothing like the front) can lose tracking — the memory bank stores 2D appearance, not 3D identity. (2) Multiple similar objects (three identical chairs) can swap IDs when they pass near each other. (3) SAM 2 seeds fresh IDs on each "keyframe" where Grounding DINO re-detects — post-hoc merging by label + 3D proximity is required to consolidate tracks. Inference cost: ~150ms per frame for mask propagation on an A10G.

Why this matters for semantic mapping

GAS v1 had to solve two hard problems in post-processing: (1) which masks across different frames correspond to the same object? and (2) how do we merge their 3D projections? Grounded SAM 2 solves problem (1) directly — every mask comes with a consistent track ID. Problem (2) becomes trivial: just aggregate all 3D points that share the same track ID.

What does SAM 2's memory bank provide that GAS v1's IoU matching cannot?

Chapter 5: Metric Depth: Depth Anything V2

GAS v1 relied on GO-SLAM's depth estimation or, when available, RGB-D sensor data. Both approaches have limitations: GO-SLAM's depth is tied to its SLAM quality, and RGB-D sensors (like Intel RealSense) are expensive, fragile, and limited in range.

Depth Anything V2 produces metric depth maps from any single RGB image. No stereo pair. No depth sensor. No camera calibration. Just one image → one depth map with real-world scale.

# Depth Anything V2: metric depth from any image
from depth_anything_v2.dpt import DepthAnythingV2

model = DepthAnythingV2(
    encoder="vitl",  # vitb, vitl, or vitg
    max_depth=20.0   # indoor scenes: 20m max
)
model.load_state_dict(torch.load("checkpoints/depth_anything_v2_metric_indoor_vitl.pth"))
model.eval().cuda()

depth_map = model.infer_image(image)  # (H, W) in meters
# depth_map[100, 200] = 3.7 means 3.7 meters from camera

Why metric depth matters

Depth Anything V1 produced relative depth — it knew that object A is farther than object B, but not how far either one actually is. For mapping, you need metric depth: real distances in meters. Depth Anything V2 added a metric depth head trained on synthetic data (Hypersim, Virtual KITTI) that predicts absolute distances.

Depth Anything V2 data flow: Input (518×518×3, resized) → DINOv2 ViT-L encoder (304M params, frozen during metric training) → DPT decoder (multi-scale feature fusion, ~20M params, trained) → metric depth map (518×518 float32, in meters). The indoor model is capped at 20m max depth. Accuracy: mean absolute error ~0.15m on NYU Depth V2 (typical room-scale scenes). Inference: ~50ms on RTX 4090, ~80ms on A10G. Key limitation: depth is predicted per-frame with no cross-frame consistency. Two frames of the same wall might get 2.8m and 3.1m. VGGT's multi-frame geometry is more consistent because it optimizes all frames jointly.

Back-projection without K

GAS v1 used the equation P = d · K-1 · [u, v, 1]T to project pixels into 3D. This requires camera intrinsics K. With VGGT, K is estimated automatically. But even without VGGT, Depth Anything V2 + estimated field of view gives you approximate 3D:

# Approximate back-projection without exact K
import numpy as np

def backproject(depth, fov_deg=60, H=480, W=640):
    # Estimate focal length from field of view
    f = W / (2 * np.tan(np.radians(fov_deg / 2)))
    cx, cy = W / 2, H / 2
    # Create pixel coordinate grid
    u, v = np.meshgrid(np.arange(W), np.arange(H))
    # Back-project to 3D
    x = (u - cx) * depth / f
    y = (v - cy) * depth / f
    z = depth
    return np.stack([x, y, z], axis=-1)  # (H, W, 3)
In GAS v2, Depth Anything V2 is a fallback. VGGT already provides metric depth and pointmaps. But when you're using MASt3R-SLAM (which focuses on sparse keypoint matching), Depth Anything V2 fills in dense depth for every pixel. It's also useful for quick prototyping — run DA-V2 on a single image to get instant 3D without any SLAM setup.
What is the key difference between Depth Anything V1 and V2 for semantic mapping?

Chapter 6: 3D Scene Graphs: ConceptGraphs

GAS v1's merging step was simple: project object masks into 3D, check IoU between point clouds, and merge those above a threshold. This is purely geometric — it knows nothing about what the objects are in a semantic sense. Two spatially overlapping point clouds get merged even if one is a chair and the other is a table leg behind it.

ConceptGraphs (Gu et al., 2023) replaces this with a fundamentally different representation: a 3D scene graph where each node is an object with both geometric and semantic properties.

What is a 3D scene graph?

A scene graph is a data structure where:

This is dramatically richer than a point cloud. You can query the scene graph with natural language: "Where is the nearest exit?" "Which room has a refrigerator?" "Is there a fire extinguisher near the kitchen?"

How ConceptGraphs builds the graph

  1. SLAM — get camera poses and depth (VGGT or MASt3R-SLAM in v2)
  2. Open-vocab detection — Grounding DINO detects objects per frame
  3. Feature extraction — CLIP encodes each detected region into a semantic embedding
  4. 3D lifting — back-project detections using depth to get 3D object positions
  5. Graph merging — merge observations of the same object using both geometric proximity AND CLIP feature similarity
The crucial improvement: ConceptGraphs merges objects using CLIP feature cosine similarity in addition to spatial IoU. Two observations are the "same chair" not just because they overlap spatially, but because their CLIP embeddings are similar. This dramatically reduces false merges (combining different objects that happen to be close) and missed merges (failing to combine the same object seen from different angles).
ConceptGraphs data flow and what degrades: Each detected region is cropped and encoded by CLIP ViT-L/14 (428M params) into a 768-dim feature vector. Merging threshold: cosine similarity > 0.7 AND spatial distance < 0.5m. What degrades: (1) Visually similar but distinct objects (two white chairs) get CLIP similarity ~0.9 and merge incorrectly if within 0.5m. (2) Same object from very different angles (chair front vs. back) can get CLIP similarity < 0.7 and fail to merge. (3) CLIP's features are view-dependent — a mug's side and top produce different embeddings. Mitigation: average CLIP features across multiple observations of each node, making the representation more view-invariant over time. Cost: CLIP encoding adds ~20ms per detected object per frame.

Integration with GAS v2

ConceptGraphs slots perfectly into the GAS v2 pipeline:

The output is a 3D scene graph that can be queried in natural language — the ultimate semantic map.

# ConceptGraphs-style scene graph construction
import clip
import numpy as np

class SceneGraph:
    def __init__(self, merge_threshold=0.7):
        self.nodes = []  # list of ObjectNode
        self.merge_threshold = merge_threshold

    def add_observation(self, position_3d, clip_feature, label):
        # Check against existing nodes
        for node in self.nodes:
            spatial_dist = np.linalg.norm(node.centroid - position_3d)
            semantic_sim = np.dot(node.clip_feature, clip_feature)
            if spatial_dist < 0.5 and semantic_sim > self.merge_threshold:
                node.merge(position_3d, clip_feature)
                return
        # New object
        self.nodes.append(ObjectNode(position_3d, clip_feature, label))

    def query(self, text):
        # Natural language query
        text_feat = clip.encode_text(text)
        scores = [np.dot(n.clip_feature, text_feat) for n in self.nodes]
        return self.nodes[np.argmax(scores)]
How does ConceptGraphs improve on GAS v1's IoU-based merging?

Chapter 7: 3D Gaussian Splatting for Scene Representation

GAS v1 represented scenes as meshes (from GO-SLAM) and point clouds (from mask back-projection). These are the traditional representations for 3D mapping. But in 2024-2025, a new representation swept the field: 3D Gaussian Splatting (3DGS).

What are 3D Gaussians?

Instead of triangular meshes or discrete points, 3DGS represents a scene as millions of tiny 3D Gaussian blobs. Each Gaussian has:

Rendering is done by "splatting" these Gaussians onto the image plane — no ray marching, no neural network evaluation per ray. This makes rendering extremely fast: 100+ FPS at high resolution.

Why Gaussians for semantic mapping?

The key innovation for our purposes is feature splatting. Each Gaussian can carry not just color but arbitrary feature vectors — including CLIP embeddings, semantic labels, or instance IDs. You can render the scene from any viewpoint and get:

Gaussians vs ConceptGraphs: ConceptGraphs builds a discrete scene graph (one node per object). 3DGS builds a continuous scene representation (millions of Gaussians, each carrying semantic features). ConceptGraphs is better for high-level reasoning ("which room has a fridge?"). 3DGS is better for photorealistic rendering and novel view synthesis. In a production GAS v2 pipeline, you might use both: 3DGS as the scene representation and ConceptGraphs as the queryable index.

Integration with GAS v2

# Semantic Gaussian Splatting pipeline
# 1. Initialize from VGGT pointmaps
pointmaps = vggt_model(images)["pointmap"]  # (N, H, W, 3)
poses = vggt_model(images)["extrinsic"]     # (N, 3, 4)

# 2. Initialize Gaussians from VGGT points
gaussians = initialize_gaussians(pointmaps, images)

# 3. Attach semantic features (from Grounding DINO + CLIP)
for i, frame in enumerate(frames):
    detections = grounding_dino.predict(frame, "door . chair . table")
    clip_features = clip.encode_regions(frame, detections.boxes)
    assign_features(gaussians, clip_features, poses[i])

# 4. Optimize with photometric + semantic loss
optimize_gaussians(gaussians, images, poses, semantic_labels)

# 5. Query: render semantic map from any viewpoint
semantic_img = render(gaussians, query_pose, mode="semantic")
When to use 3DGS in GAS v2: Use it when the map itself is the product — virtual tours, digital twins, AR overlays. If you only need a 2D floorplan (the original GAS v1 goal), ConceptGraphs or even simple point cloud projection is sufficient and much simpler to set up.
What does "feature splatting" enable for semantic mapping?

Chapter 8: The Complete GAS v2 Pipeline

Let's put it all together. Here's the full GAS v2 data flow, from raw video to semantic map.

The data flow

  1. Video → Frames — extract frames at 2-5 FPS (not every frame — too redundant)
  2. Frames → VGGT — single forward pass produces camera poses, pointmaps, depth maps, and estimated intrinsics
  3. Frames → Grounding DINO — open-vocabulary detection with text prompt describing target objects
  4. Boxes → SAM 2 — generate pixel-precise masks and propagate with video tracking for consistent IDs
  5. Masks + Depth → 3D Lifting — back-project masked regions using VGGT depth (or Depth Anything V2 fallback)
  6. 3D Objects + CLIP → ConceptGraphs — build scene graph with semantic merging
  7. Scene Graph → Output — render as 2D floorplan, 3D scene graph, or Gaussian splat scene
Full pipeline inference budget (55-second indoor video, 32 keyframes, A10G GPU):
• Frame extraction: negligible (~1s)
• VGGT geometry: ~2.1s (32 frames in one batch)
• Grounding DINO detection: 32 × 125ms = ~4s
• SAM 2 segmentation + tracking: ~4.8s (150ms/frame)
• CLIP feature extraction: ~2s (32 frames × ~5 objects × 12ms)
• ConceptGraphs merging: ~0.5s (CPU-bound, geometric + feature matching)
Total: ~14.5 seconds for a full room semantic map from 55 seconds of video
Cost on Modal (A10G): < $0.10 per video
Frozen vs trained: Every single component is used with pretrained weights, zero fine-tuning. VGGT (1B params, pretrained on MegaDepth + ScanNet++), Grounding DINO (200M, pretrained on O365 + GoldG + Cap4M), SAM 2 (224M, pretrained on SA-V), Depth Anything V2 (324M, pretrained on synthetic + real depth), CLIP (428M, pretrained on LAION-2B). Total model parameters loaded: ~2.2B. All frozen at inference.
What changed from v1:
Zero training data — no dataset assembly, no fine-tuning, no COCO conversion
Zero calibration — no camera intrinsics needed
Open vocabulary — detect any object class at inference time
Video-native tracking — SAM 2 memory bank replaces IoU matching
Language-grounded merging — CLIP features replace pure geometry
Queryable output — "where is the nearest exit?" on the scene graph

Input/output summary

Input: RGB video from any camera (phone, GoPro, webcam, robot). No depth sensor. No calibration pattern. No training data.

Output (choose one or more):

How many datasets and training runs are needed to build a GAS v2 pipeline?

Chapter 9: Project Setup: From Zero to Running

This chapter is a practical guide. By the end, you'll have a working GAS v2 pipeline on your machine.

Build notes from the trenches — read before you start: We actually built this pipeline end-to-end on Modal, and hit seven distinct failure modes before it worked. The walkthrough lives in the Mirdan build log: Building GASv2 on Modal. The most important lessons, summarized:
Use the transformers port of Grounding DINO, not the IDEA-Research repo. The official repo compiles a CUDA op at install time that wants nvcc; the HF port is pure PyTorch. Inference-only installs stay on debian_slim with PyTorch cu124 wheels, no CUDA toolkit needed.
VGGT's actual output dict has keys pose_enc, depth, world_points, world_points_conf — not extrinsic/intrinsic. Use pose_encoding_to_extri_intri() for matrices. Better: world_points already gives you per-pixel 3D points; skip the unprojection math.
VGGT ViT requires H/W divisible by 14 (patch size). Resize at decode.
Pin torch at the END of your image with --force-reinstall. SAM 2's install will otherwise upgrade torch to a version bundling CUDA 13 wheels, breaking nvrtc at runtime.
Wrap SAM 2 calls in torch.autocast("cuda", dtype=torch.bfloat16) — the image encoder and mask decoder have different dtype conventions.
Post-hoc track merging is required. SAM 2 seeds fresh IDs per keyframe; the same chair seen twice becomes two "chair" objects unless you merge by label + proximity (or CLIP features).
Use 10th/90th percentile AABBs, not min/max. VGGT's per-pixel world points have heavy tails; min/max makes every bounding box span the room.
For oriented boxes, add Apple's Cubify Transformer (CuTR) as a sibling per-keyframe detector and fuse across keyframes with greedy 3D-IoU. Keep a PCA-OBB fit on every SAM 2 track's merged points as an always-on fallback: when CuTR's coverage drops below ~30 % on a scene (killswitch threshold), the pipeline still produces clean geometry. Empirically important for dim rooms, where CuTR's iPad-LiDAR training distribution stops applying.
Gate CuTR detections against VGGT depth before fusion. Project each candidate OBB's center into the keyframe and compare its z to the median VGGT depth in a small patch around that pixel. Reject if they disagree by more than 30 %. This catches most of CuTR's “wall at 3 m, object at 8 m” hallucinations without touching its geometry for the good cases.
Comparison callout (day 2): Boxer vs CuTR on phone footage. The stack above — VGGT for geometry, Grounding DINO for labels, SAM 2 for tracks, and CuTR as a per-keyframe 3D-OBB sibling — worked on our calibration clip and started degrading on fresh rooms. We ran a parallel experiment with Meta's facebook/boxer checkpoint in an isolated Modal app and got cleaner OBBs on the same video. The trade-off table below is what changed our mind:

CuTR (Apple, CC-BY-NC-ND weights). Single-image 3D-OBB transformer, class-agnostic, trained on CA-1M (iPad LiDAR walkthroughs). Pairs with Grounding DINO labels via SAM 2 track identity. Strong on bright iPad-like captures; on dim or fast-motion phone video, the depth-sanity gate kills most candidates and the pipeline falls back to PCA-OBB — which is why we built the killswitch.
Boxer (Meta, CC-BY-NC weights). OWLv2 for text-grounded 2D, DINOv3 + camera/depth cross-attention for 3D-OBB regression, Hungarian 3D-IoU fusion. One model, one API, joint-trained across indoor datasets. At 512 px on an A10G: 0.42 s/frame, < 1 GB peak, drop-in replacement for the CuTR+GDINO+SAM2 trio we had been composing by hand. Boxer still misses small objects (chairs, plants, mirrors) at 512 px — recoverable at 960 px on an A100.
Geometry handoff. CuTR only needs RGB, so it lived inside our VGGT pipeline. Boxer wants extrinsics and depth, so it sits downstream of a geometry provider. We paired it with the Lingbot-Map GCT checkpoint in a sibling Modal app; the orchestrator stitches them with two .remote() calls.
When to pick which. If your footage is bright and static, CuTR + SAM 2 labels gives you finer control (track-level rather than detection-level). If your footage is arbitrary phone video and you want one pipeline that mostly works, Boxer is the less-surprising default. The Run 12 update in the Mirdan build log shows both side-by-side with the demo video.

Both routes remain selectable in the dashboard we wrote for this project. Neither is “correct” — they fail on different scenes in different ways.

Hardware requirements

Step 1: Environment

# Create conda environment
conda create -n gas-v2 python=3.11 -y
conda activate gas-v2

# PyTorch with CUDA
pip install torch torchvision torchaudio --index-url \
    https://download.pytorch.org/whl/cu121

# Core dependencies
pip install numpy scipy opencv-python pillow tqdm
pip install supervision  # Roboflow's annotation/viz toolkit

Step 2: Install models

What actually worked for us (April 2026): the PyPI packages named in the original recipe (vggt, groundingdino-py, segment-anything-2) are either stale or non-functional. Install from the source repos instead, and use the HuggingFace transformers port of Grounding DINO to avoid the CUDA-toolchain requirement entirely.
# VGGT (3D geometry) — from source
pip install git+https://github.com/facebookresearch/vggt.git

# Grounding DINO — use transformers' port (pure PyTorch, no nvcc)
pip install "transformers>=4.44"
# At load time: AutoModelForZeroShotObjectDetection.from_pretrained(
#   "IDEA-Research/grounding-dino-base")

# SAM 2 (segmentation + tracking) — from source
pip install git+https://github.com/facebookresearch/sam2.git

# Rerun (for visualization)
pip install "rerun-sdk>=0.20"

# Open3D (for PLY + Poisson mesh export)
pip install "open3d>=0.18"

# CLIP (semantic features for ConceptGraphs — if you extend dedup)
pip install open-clip-torch

# IMPORTANT: re-pin torch LAST, after the above installs.
# SAM 2's deps will otherwise drag torch to a CUDA-13 wheel that breaks nvrtc.
pip install --upgrade --force-reinstall \
    torch==2.5.1 torchvision==0.20.1 \
    --index-url https://download.pytorch.org/whl/cu124

Step 3: Download weights

# Create weights directory
mkdir -p weights && cd weights

# VGGT (auto-downloaded on first use via HuggingFace)
# Grounding DINO
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth

# SAM 2
wget https://dl.fbaipublicfiles.com/segment_anything_2/sam2_hiera_large.pt

# Depth Anything V2 (optional)
wget https://huggingface.co/depth-anything/Depth-Anything-V2-Metric-Indoor-Large/resolve/main/depth_anything_v2_metric_indoor_vitl.pth

Step 4: Capture data

Step 5: Code structure

# Recommended project layout
gas-v2/
  main.py          # Orchestrator: load video, run pipeline
  slam.py          # VGGT or MASt3R-SLAM wrapper
  detect.py        # Grounding DINO + SAM 2 wrapper
  segment.py       # Mask processing, tracking
  merge.py         # ConceptGraphs scene graph builder
  visualize.py     # 2D floorplan + 3D rendering
  config.yaml      # Model paths, thresholds, prompt
  weights/         # Downloaded model weights
  data/            # Input videos
  output/          # Results

Step 6: Run the pipeline

# main.py - simplified orchestrator
from slam import run_vggt
from detect import detect_and_segment
from merge import build_scene_graph
from visualize import render_floorplan

# 1. Extract frames
frames = extract_frames("data/apartment.mp4", fps=3)

# 2. 3D geometry
geometry = run_vggt(frames)  # poses, depth, pointmaps

# 3. Detect + segment
objects = detect_and_segment(
    frames, prompt="door . chair . table . sofa . bed . window"
)

# 4. Build scene graph
scene_graph = build_scene_graph(objects, geometry)

# 5. Render output
render_floorplan(scene_graph, output="output/floorplan.png")
print(f"Found {len(scene_graph.nodes)} objects")
Total setup time: ~30 minutes from scratch (environment + model downloads). Compare to GAS v1: ~2 weeks (dataset assembly + format conversion + training + debugging). This is the foundation model advantage.
Want the shortest path to a running pipeline? We published a complete Modal-based implementation with every workaround baked in: Mirdan · Building GASv2 on Modal. The article walks through each crash, explains the underlying cause, and shows the diff that fixed it. If you skim one Mirdan entry before you start, skim that one.
What is the minimum GPU requirement for running GAS v2?

Chapter 10: Evaluation & Benchmarking

How do you know if your GAS v2 pipeline actually works? You need benchmarks with ground truth, metrics to measure quality, and baselines to compare against.

Datasets

Metrics

Each stage of the pipeline has its own evaluation metric:

GAS v1 vs v2: expected comparison

Where v2 wins: Open vocabulary (unlimited classes vs 10 fixed), zero setup time, no calibration needed, better temporal consistency (SAM 2 tracking), language-queryable output.

Where v1 might win: On the specific 10 classes it was fine-tuned for, Faster R-CNN may have slightly higher mAP than zero-shot Grounding DINO. This is the classic specialist-vs-generalist tradeoff. But v2's generalist capability is far more useful in practice — real environments have hundreds of object types, not 10.

Running evaluation

# Evaluate on Replica dataset
from eval import evaluate_pipeline

results = evaluate_pipeline(
    pipeline=gas_v2,
    dataset="replica",
    scenes=["office_0", "room_0", "apartment_0"],
    metrics=["ate", "mAP", "mIoU", "completeness"]
)

for scene, m in results.items():
    print(f"{scene}: ATE={m['ate']:.3f}m, mAP={m['mAP']:.1f}, mIoU={m['mIoU']:.1f}%")
In what specific scenario might GAS v1's fine-tuned detector outperform GAS v2's zero-shot detector?

Chapter 11: What's Next: Embodied Foundation Models

GAS v2 builds a semantic map. But a map is only useful if something acts on it. The next frontier is closing the loop: explore, map, reason, and act.

OK-Robot: pick-and-place from semantic maps

OK-Robot (Liu et al., 2024) demonstrates the end-to-end vision: give a home robot a natural language command ("bring me the coffee mug"), it uses an open-vocabulary semantic map to find the mug, plans a path to it, and picks it up. The semantic map is essentially a GAS v2-style scene graph.

SayPlan: LLM planning on 3D scene graphs

SayPlan (Rana et al., 2023) feeds a 3D scene graph (like ConceptGraphs output) directly to an LLM for task planning. The LLM reasons about spatial relationships: "To get to the kitchen, go through the hallway, turn left at the bathroom." The scene graph provides the grounding that prevents the LLM from hallucinating spatial facts.

The convergence

We're watching three fields merge into one:

The pipeline is becoming: perceive (GAS v2) → reason (LLM + scene graph) → act (VLA). Each piece is a foundation model. Each piece is zero-shot. The system works in any environment without environment-specific training.

The spatial intelligence stack of 2025:
Perception: DINOv2 + VGGT + Grounding DINO + SAM 2 = open-world 3D understanding
Representation: ConceptGraphs + 3DGS = queryable, renderable scene models
Reasoning: GPT-4V / Gemini / Claude on scene graphs = spatial planning
Action: pi-0 / RT-2 / Octo = language-conditioned robot control
GAS v2 is the perception layer. The rest is plugging in.

Related Veanors lessons

What is the complete "spatial intelligence stack" that GAS v2 enables?