Rebuilding the semantic mapping pipeline with 2025 foundation models — zero fine-tuning, open vocabulary, no camera calibration.
In 2024, Kuznetsov, Bhutra, and Pal built GAS — a pipeline that turns monocular video into semantic floorplans. The pipeline chained four components: GO-SLAM for 3D reconstruction, Faster R-CNN for object detection, SAM for segmentation masks, and a geometric projection step to map everything into 2D.
It worked. On the Replica dataset, GAS produced readable semantic floorplans from nothing but RGB video. But the engineering was painful.
GAS v1 ran three completely separate neural networks on every video frame. GO-SLAM used its own CNN encoder. Faster R-CNN used a ResNet-50 backbone with FPN. SAM used a ViT-H image encoder. Each model independently extracted features from the same image, tripling the compute.
In 2025, a single model family dominates visual feature extraction: DINOv2. Meta's self-supervised ViT, trained on 142M curated images, produces features so general that they serve as the backbone for nearly every downstream vision task.
Look at what uses DINOv2 internally:
This is the shared backbone revolution. Instead of running three unrelated networks, GAS v2 can extract DINOv2 features once and route them to multiple task heads. One forward pass through a ViT-G produces features rich enough for geometry estimation, object detection, segmentation, and depth prediction simultaneously.
DINOv2 ViT-G processes an image into a grid of 1024-dimensional patch tokens (e.g., 16x16 = 256 tokens for a 224x224 image, more for higher resolutions). These tokens encode both local appearance (texture, edges, color) and global semantics (object identity, scene category). Task heads are lightweight networks that read these tokens and produce task-specific outputs:
GO-SLAM was GAS v1's geometry backbone. It worked, but it required camera intrinsics (the K matrix: focal length, principal point), sometimes stereo pairs, and ran a complex neural SLAM loop. In 2025, two alternatives have made calibrated cameras optional.
VGGT (Visual Geometry Grounded Transformer) won CVPR 2025 Best Paper. Feed it a set of unposed images — no camera intrinsics, no ordering, no calibration — and it outputs in a single forward pass:
# VGGT: single forward pass, all geometry from vggt.models import VGGTModel model = VGGTModel.from_pretrained("facebook/vggt-1b") model.eval().cuda() # images: (B, N, 3, H, W) - batch of N unposed images predictions = model(images) poses = predictions["extrinsic"] # (B, N, 3, 4) camera-to-world points = predictions["pointmap"] # (B, N, H, W, 3) 3D per pixel depth = predictions["depth"] # (B, N, H, W) metric depth intrinsics = predictions["intrinsic"] # (B, N, 3, 3) estimated K
No calibration. No sequential processing. No SLAM loop. Just images in, geometry out.
MASt3R-SLAM is real-time dense SLAM built on top of MASt3R's learned stereo matching. It processes frames sequentially as they arrive, maintains a running map, and handles loop closures. Unlike VGGT (which processes a batch at once), MASt3R-SLAM is designed for live operation.
# MASt3R-SLAM: real-time streaming from mast3r_slam import MASt3RSLAM slam = MASt3RSLAM(config="configs/base.yaml") for frame in video_stream: result = slam.process_frame(frame) pose = result.pose # current camera pose pointmap = result.points # current dense 3D map if result.loop_closure: print("Loop closure detected!")
Both approaches eliminate GAS v1's most fragile dependency: the camera intrinsics matrix K. GO-SLAM would produce distorted geometry if K was wrong by even a few pixels. VGGT estimates K. MASt3R-SLAM uses learned priors that are robust to approximate intrinsics. This single change eliminates hours of calibration debugging.
GAS v1's detection story was the most painful part of the entire project. To detect 10 object classes (door, chair, table, toilet, sink, sofa, bed, TV, refrigerator, exit sign), the authors had to:
All of this to detect 10 classes. Want to add "bookshelf"? Start over: find training data, re-annotate, re-train.
Grounding DINO detects any object described in natural language. No training data. No fine-tuning. No COCO format conversion. You just describe what you want to find.
# Grounding DINO: detect anything from text from groundingdino.util.inference import load_model, predict model = load_model( "groundingdino/config/GroundingDINO_SwinB.py", "weights/groundingdino_swinb.pth" ) # Detect ANY objects via text prompt text_prompt = "door . chair . table . exit sign . bookshelf . window" boxes, logits, phrases = predict( model=model, image=image, caption=text_prompt, box_threshold=0.3, text_threshold=0.25 ) # boxes: (N, 4) in [cx, cy, w, h] normalized # phrases: ['door', 'chair', 'table', ...] matched labels
Grounding DINO fuses a vision transformer (Swin-B or Swin-L) with a text encoder (BERT). It uses cross-modality attention to match text tokens to image regions:
The dot-separated prompt format ("door . chair . table") tells the model to detect each concept independently. Each detected box comes with the matching text phrase and a confidence score.
boxes tensor is (N, 4) and logits is (N,) with confidence scores. Boxes with logits below box_threshold are filtered. Inference: ~125ms per frame on RTX 4090 with Swin-B. The HuggingFace transformers port runs in pure PyTorch — no custom CUDA ops needed.box_threshold=0.3 and lower to 0.15-0.2 for recall-critical applications (semantic mapping needs high recall)In GAS v1, detection and segmentation were separate steps with separate models. Faster R-CNN produced bounding boxes. Each box was then fed to SAM as a point prompt to generate a mask. For video, masks were merged across frames using IoU matching — a brittle process that lost objects when viewpoints changed significantly.
Grounded SAM 2 combines Grounding DINO and SAM 2 into a single system that detects, segments, and tracks objects across video frames. One call replaces three separate steps from v1.
# Grounded SAM 2: detect + segment + track import supervision as sv from grounded_sam2 import GroundedSAM2Pipeline pipeline = GroundedSAM2Pipeline( grounding_model="gdino_swinb", sam2_checkpoint="sam2_hiera_large.pt", device="cuda" ) # Single frame: detect + segment results = pipeline.predict( image=frame, text_prompt="door . chair . table . exit sign", box_threshold=0.3 ) masks = results.masks # (N, H, W) binary masks labels = results.labels # ['door', 'chair', ...] boxes = results.boxes # (N, 4) bounding boxes # Video tracking: propagate masks with memory tracker = pipeline.create_video_tracker() for i, frame in enumerate(video_frames): if i == 0: # Initialize with detections on first frame tracked = tracker.init_frame(frame, results) else: # Propagate masks using SAM 2 memory bank tracked = tracker.propagate(frame) # tracked.masks has consistent IDs across frames
GAS v1 had to solve two hard problems in post-processing: (1) which masks across different frames correspond to the same object? and (2) how do we merge their 3D projections? Grounded SAM 2 solves problem (1) directly — every mask comes with a consistent track ID. Problem (2) becomes trivial: just aggregate all 3D points that share the same track ID.
GAS v1 relied on GO-SLAM's depth estimation or, when available, RGB-D sensor data. Both approaches have limitations: GO-SLAM's depth is tied to its SLAM quality, and RGB-D sensors (like Intel RealSense) are expensive, fragile, and limited in range.
Depth Anything V2 produces metric depth maps from any single RGB image. No stereo pair. No depth sensor. No camera calibration. Just one image → one depth map with real-world scale.
# Depth Anything V2: metric depth from any image from depth_anything_v2.dpt import DepthAnythingV2 model = DepthAnythingV2( encoder="vitl", # vitb, vitl, or vitg max_depth=20.0 # indoor scenes: 20m max ) model.load_state_dict(torch.load("checkpoints/depth_anything_v2_metric_indoor_vitl.pth")) model.eval().cuda() depth_map = model.infer_image(image) # (H, W) in meters # depth_map[100, 200] = 3.7 means 3.7 meters from camera
Depth Anything V1 produced relative depth — it knew that object A is farther than object B, but not how far either one actually is. For mapping, you need metric depth: real distances in meters. Depth Anything V2 added a metric depth head trained on synthetic data (Hypersim, Virtual KITTI) that predicts absolute distances.
GAS v1 used the equation P = d · K-1 · [u, v, 1]T to project pixels into 3D. This requires camera intrinsics K. With VGGT, K is estimated automatically. But even without VGGT, Depth Anything V2 + estimated field of view gives you approximate 3D:
# Approximate back-projection without exact K import numpy as np def backproject(depth, fov_deg=60, H=480, W=640): # Estimate focal length from field of view f = W / (2 * np.tan(np.radians(fov_deg / 2))) cx, cy = W / 2, H / 2 # Create pixel coordinate grid u, v = np.meshgrid(np.arange(W), np.arange(H)) # Back-project to 3D x = (u - cx) * depth / f y = (v - cy) * depth / f z = depth return np.stack([x, y, z], axis=-1) # (H, W, 3)
GAS v1's merging step was simple: project object masks into 3D, check IoU between point clouds, and merge those above a threshold. This is purely geometric — it knows nothing about what the objects are in a semantic sense. Two spatially overlapping point clouds get merged even if one is a chair and the other is a table leg behind it.
ConceptGraphs (Gu et al., 2023) replaces this with a fundamentally different representation: a 3D scene graph where each node is an object with both geometric and semantic properties.
A scene graph is a data structure where:
This is dramatically richer than a point cloud. You can query the scene graph with natural language: "Where is the nearest exit?" "Which room has a refrigerator?" "Is there a fire extinguisher near the kitchen?"
ConceptGraphs slots perfectly into the GAS v2 pipeline:
The output is a 3D scene graph that can be queried in natural language — the ultimate semantic map.
# ConceptGraphs-style scene graph construction import clip import numpy as np class SceneGraph: def __init__(self, merge_threshold=0.7): self.nodes = [] # list of ObjectNode self.merge_threshold = merge_threshold def add_observation(self, position_3d, clip_feature, label): # Check against existing nodes for node in self.nodes: spatial_dist = np.linalg.norm(node.centroid - position_3d) semantic_sim = np.dot(node.clip_feature, clip_feature) if spatial_dist < 0.5 and semantic_sim > self.merge_threshold: node.merge(position_3d, clip_feature) return # New object self.nodes.append(ObjectNode(position_3d, clip_feature, label)) def query(self, text): # Natural language query text_feat = clip.encode_text(text) scores = [np.dot(n.clip_feature, text_feat) for n in self.nodes] return self.nodes[np.argmax(scores)]
GAS v1 represented scenes as meshes (from GO-SLAM) and point clouds (from mask back-projection). These are the traditional representations for 3D mapping. But in 2024-2025, a new representation swept the field: 3D Gaussian Splatting (3DGS).
Instead of triangular meshes or discrete points, 3DGS represents a scene as millions of tiny 3D Gaussian blobs. Each Gaussian has:
Rendering is done by "splatting" these Gaussians onto the image plane — no ray marching, no neural network evaluation per ray. This makes rendering extremely fast: 100+ FPS at high resolution.
The key innovation for our purposes is feature splatting. Each Gaussian can carry not just color but arbitrary feature vectors — including CLIP embeddings, semantic labels, or instance IDs. You can render the scene from any viewpoint and get:
# Semantic Gaussian Splatting pipeline # 1. Initialize from VGGT pointmaps pointmaps = vggt_model(images)["pointmap"] # (N, H, W, 3) poses = vggt_model(images)["extrinsic"] # (N, 3, 4) # 2. Initialize Gaussians from VGGT points gaussians = initialize_gaussians(pointmaps, images) # 3. Attach semantic features (from Grounding DINO + CLIP) for i, frame in enumerate(frames): detections = grounding_dino.predict(frame, "door . chair . table") clip_features = clip.encode_regions(frame, detections.boxes) assign_features(gaussians, clip_features, poses[i]) # 4. Optimize with photometric + semantic loss optimize_gaussians(gaussians, images, poses, semantic_labels) # 5. Query: render semantic map from any viewpoint semantic_img = render(gaussians, query_pose, mode="semantic")
Let's put it all together. Here's the full GAS v2 data flow, from raw video to semantic map.
Input: RGB video from any camera (phone, GoPro, webcam, robot). No depth sensor. No calibration pattern. No training data.
Output (choose one or more):
This chapter is a practical guide. By the end, you'll have a working GAS v2 pipeline on your machine.
transformers port of Grounding DINO, not the IDEA-Research repo. The official repo compiles a CUDA op at install time that wants nvcc; the HF port is pure PyTorch. Inference-only installs stay on debian_slim with PyTorch cu124 wheels, no CUDA toolkit needed.
pose_enc, depth, world_points, world_points_conf — not extrinsic/intrinsic. Use pose_encoding_to_extri_intri() for matrices. Better: world_points already gives you per-pixel 3D points; skip the unprojection math.
--force-reinstall. SAM 2's install will otherwise upgrade torch to a version bundling CUDA 13 wheels, breaking nvrtc at runtime.
torch.autocast("cuda", dtype=torch.bfloat16) — the image encoder and mask decoder have different dtype conventions.
.remote() calls.
modal run recipe.transformers GDINO port + MPS for a degraded-but-working dev loop, or go to the cloud.# Create conda environment conda create -n gas-v2 python=3.11 -y conda activate gas-v2 # PyTorch with CUDA pip install torch torchvision torchaudio --index-url \ https://download.pytorch.org/whl/cu121 # Core dependencies pip install numpy scipy opencv-python pillow tqdm pip install supervision # Roboflow's annotation/viz toolkit
vggt, groundingdino-py, segment-anything-2) are either stale or non-functional. Install from the source repos instead, and use the HuggingFace transformers port of Grounding DINO to avoid the CUDA-toolchain requirement entirely.# VGGT (3D geometry) — from source pip install git+https://github.com/facebookresearch/vggt.git # Grounding DINO — use transformers' port (pure PyTorch, no nvcc) pip install "transformers>=4.44" # At load time: AutoModelForZeroShotObjectDetection.from_pretrained( # "IDEA-Research/grounding-dino-base") # SAM 2 (segmentation + tracking) — from source pip install git+https://github.com/facebookresearch/sam2.git # Rerun (for visualization) pip install "rerun-sdk>=0.20" # Open3D (for PLY + Poisson mesh export) pip install "open3d>=0.18" # CLIP (semantic features for ConceptGraphs — if you extend dedup) pip install open-clip-torch # IMPORTANT: re-pin torch LAST, after the above installs. # SAM 2's deps will otherwise drag torch to a CUDA-13 wheel that breaks nvrtc. pip install --upgrade --force-reinstall \ torch==2.5.1 torchvision==0.20.1 \ --index-url https://download.pytorch.org/whl/cu124
# Create weights directory mkdir -p weights && cd weights # VGGT (auto-downloaded on first use via HuggingFace) # Grounding DINO wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth # SAM 2 wget https://dl.fbaipublicfiles.com/segment_anything_2/sam2_hiera_large.pt # Depth Anything V2 (optional) wget https://huggingface.co/depth-anything/Depth-Anything-V2-Metric-Indoor-Large/resolve/main/depth_anything_v2_metric_indoor_vitl.pth
# Recommended project layout gas-v2/ main.py # Orchestrator: load video, run pipeline slam.py # VGGT or MASt3R-SLAM wrapper detect.py # Grounding DINO + SAM 2 wrapper segment.py # Mask processing, tracking merge.py # ConceptGraphs scene graph builder visualize.py # 2D floorplan + 3D rendering config.yaml # Model paths, thresholds, prompt weights/ # Downloaded model weights data/ # Input videos output/ # Results
# main.py - simplified orchestrator from slam import run_vggt from detect import detect_and_segment from merge import build_scene_graph from visualize import render_floorplan # 1. Extract frames frames = extract_frames("data/apartment.mp4", fps=3) # 2. 3D geometry geometry = run_vggt(frames) # poses, depth, pointmaps # 3. Detect + segment objects = detect_and_segment( frames, prompt="door . chair . table . sofa . bed . window" ) # 4. Build scene graph scene_graph = build_scene_graph(objects, geometry) # 5. Render output render_floorplan(scene_graph, output="output/floorplan.png") print(f"Found {len(scene_graph.nodes)} objects")
How do you know if your GAS v2 pipeline actually works? You need benchmarks with ground truth, metrics to measure quality, and baselines to compare against.
Each stage of the pipeline has its own evaluation metric:
# Evaluate on Replica dataset from eval import evaluate_pipeline results = evaluate_pipeline( pipeline=gas_v2, dataset="replica", scenes=["office_0", "room_0", "apartment_0"], metrics=["ate", "mAP", "mIoU", "completeness"] ) for scene, m in results.items(): print(f"{scene}: ATE={m['ate']:.3f}m, mAP={m['mAP']:.1f}, mIoU={m['mIoU']:.1f}%")
GAS v2 builds a semantic map. But a map is only useful if something acts on it. The next frontier is closing the loop: explore, map, reason, and act.
OK-Robot (Liu et al., 2024) demonstrates the end-to-end vision: give a home robot a natural language command ("bring me the coffee mug"), it uses an open-vocabulary semantic map to find the mug, plans a path to it, and picks it up. The semantic map is essentially a GAS v2-style scene graph.
SayPlan (Rana et al., 2023) feeds a 3D scene graph (like ConceptGraphs output) directly to an LLM for task planning. The LLM reasons about spatial relationships: "To get to the kitchen, go through the hallway, turn left at the bathroom." The scene graph provides the grounding that prevents the LLM from hallucinating spatial facts.
We're watching three fields merge into one:
The pipeline is becoming: perceive (GAS v2) → reason (LLM + scene graph) → act (VLA). Each piece is a foundation model. Each piece is zero-shot. The system works in any environment without environment-specific training.