GAS: Semantic Mapping

Chapter 0: The Problem

Imagine you drop a robot into an unfamiliar apartment. It has a camera. It can move around. But it has no map, no GPS, and no floorplan. To do anything useful — "go to the kitchen and grab the mug" — it needs two things: where am I? and what is around me?

Outdoor mapping is largely solved. We have satellite imagery, LiDAR-equipped cars, and decades of GPS infrastructure. Indoors, none of that exists. Walls block signals. Rooms change — someone moves a chair, stacks boxes, opens a door. Every indoor space is a unique, dynamic, GPS-denied puzzle.

Why cameras?

LiDAR gives beautiful 3D scans but costs thousands of dollars, weighs significant amounts, and captures only geometry — it can't tell you what something is. A $5 RGB camera, on the other hand, captures appearance, texture, color, and all the cues humans use to recognize objects. The challenge is extracting 3D structure and semantic meaning from flat 2D images.

The semantic mapping dream

A semantic map doesn't just know "there's a surface 2.3 meters ahead." It knows "there's a couch 2.3 meters ahead, and behind it is a bookshelf." This is the difference between a 3D point cloud and a map a robot (or a person) can actually reason about. The GAS paper tackles exactly this pipeline: video in, semantic floorplan out.

The state of the art (2023-2024): ConceptGraphs (Gu et al., 2023) builds open-vocabulary 3D scene graphs from RGB-D video using foundation models. ScanNet and Matterport3D provide benchmark datasets for indoor 3D understanding. The vision is converging: cameras + foundation models = complete spatial intelligence for robots. GAS contributes a practical, modular pipeline that chains existing SOTA components into an end-to-end system.

The GAS approach at a glance

GAS (short for the three authors' initials) processes video through four stages:

SLAM — reconstruct the 3D environment mesh and estimate camera poses
Detect — find objects in each video frame
Segment + Project — create precise masks and project them into 3D
Merge + Map — fuse objects across frames and generate a 2D floorplan

Each stage uses an off-the-shelf or fine-tuned model. The genius isn't in any single component — it's in how they're composed and the engineering choices that make the pipeline robust.

The complete data flow with tensor shapes:
• Input: RGB video at 640×480, 30fps → subsampled to every 10th frame (~3fps)
• GO-SLAM: frames → camera poses T_i (4×4 SE(3) matrices), depth maps D_i (640×480 float32, values in meters), dense mesh (vertices + faces)
• Faster R-CNN: each frame (640×480×3) → N bounding boxes (N×4 in xyxy format), N class labels (int), N confidence scores (float)
• SAM: each bounding box prompt → binary mask (640×480, 0/1)
• Back-projection: each masked pixel (u,v,d) → 3D point (x,y,z) in world coordinates. ~1000-50,000 points per object
• Merge: voxelized point clouds (5cm resolution) → IoU comparison → merged object list
• Output: 2D floor plan image (top-down projection with labeled polygons)

Why is indoor semantic mapping harder than outdoor mapping?

No GPS, no satellite imagery, environments change constantly, and every space is unique Indoor cameras are lower resolution There are fewer objects indoors

Chapter 1: Visual SLAM Foundations

Before you can label anything, you need to know where you are and what the world looks like in 3D. This is the job of SLAM — Simultaneous Localization and Mapping. The word "simultaneous" is key: you're building the map and figuring out your position at the same time, each helping the other.

The SLAM loop

Every SLAM system runs a continuous loop:

Tracking: Given a new camera frame, estimate how the camera moved (the "pose" — position + orientation in 3D).
Mapping: Using the estimated pose, add new 3D structure to the map.
Loop closure: If you revisit a previously seen area, correct accumulated drift by aligning the old and new observations.

The tracking step typically matches features (distinctive points) between consecutive frames and solves for the relative camera motion. The mapping step triangulates those features into 3D points. Loop closure is the hardest part — you need to recognize "I've been here before" and then globally adjust all poses to be consistent.

Classical vs. neural SLAM

Classical SLAM (ORB-SLAM3, 2021) detects hand-crafted features like ORB keypoints and tracks them with geometric optimization. It's fast, well-understood, and works on a CPU. But it produces sparse point clouds — a scattered set of 3D dots, not a dense surface.

Neural SLAM replaces parts of this pipeline with learned components. DROID-SLAM (Teed & Deng, 2021) learns to estimate dense optical flow between frames, producing much denser reconstructions. It uses differentiable bundle adjustment — the optimization that refines all camera poses and 3D structure jointly — as a layer in the neural network, so gradients flow through the geometric reasoning.

The DUSt3R revolution (2024): DUSt3R (Wang et al., CVPR 2024) threw out the SLAM loop entirely. Given two images, it directly predicts a 3D pointmap — no feature matching, no triangulation, no explicit pose estimation. MASt3R-SLAM (Murai et al., CVPR 2025) built real-time dense SLAM on top of this, and VGGT (Wang et al., CVPR 2025 Best Paper) pushed it further: one transformer, one forward pass, all 3D geometry from unposed images.

GAS uses GO-SLAM

GAS chose GO-SLAM (Zhang et al., 2023), a neural implicit SLAM system that represents the scene as a neural radiance field (similar to NeRF). GO-SLAM performs global optimization, meaning it refines all camera poses jointly rather than greedily fixing each pose as it goes. This produces more consistent 3D reconstructions, especially in challenging environments with loops or revisited areas.

Engineering decision — why GO-SLAM over alternatives? GAS needed dense depth maps (for pixel-level 3D projection), not sparse keypoints. ORB-SLAM3 gives only sparse points — useless for projecting segmentation masks. DROID-SLAM gives dense depth but no mesh and no global optimization (poses drift). GO-SLAM gives dense depth + mesh + globally consistent poses. The tradeoff: GO-SLAM runs at ~2fps on an RTX 3090 (not real-time) and requires ~8GB VRAM for a typical indoor sequence. For an offline mapping pipeline, this is acceptable.

GO-SLAM outputs three critical things that GAS needs downstream:

Camera poses (T_i) — the 4×4 transformation matrix for each frame, encoding where the camera was and which direction it was pointing
Depth maps — per-pixel depth estimates for each frame
3D mesh — a reconstructed surface of the entire environment

The camera poses and depth maps are essential for projecting 2D detections into 3D space. The mesh provides the overall spatial structure.

How SLAM has evolved

System	Year	Approach	Output density
ORB-SLAM3	2021	Feature-based, CPU	Sparse points
DROID-SLAM	2021	Learned optical flow + diff. BA	Dense
GO-SLAM	2023	Neural implicit + global opt.	Dense mesh
DUSt3R	2024	Direct pointmap prediction	Dense
Gaussian SLAM	2024	3D Gaussians as map representation	Dense + renderable
MASt3R-SLAM	2025	Learned 3D priors, real-time	Dense
VGGT	2025	Feed-forward transformer, no loop	Dense

What three outputs does GO-SLAM provide that GAS needs for downstream processing?

Camera poses (where the camera was), depth maps (per-pixel distance), and a 3D mesh of the environment RGB images, object labels, and bounding boxes Feature descriptors, keypoint matches, and loop closure candidates

Chapter 2: Object Detection for Scene Understanding

With the camera poses and 3D structure recovered by SLAM, the next question is: what objects are in each frame? This is the job of an object detector — a model that takes an image and outputs a set of bounding boxes, each with a class label and confidence score.

The two-stage detection paradigm

GAS uses Faster R-CNN (Ren et al., 2015), the workhorse of two-stage detection. The architecture has two components:

Region Proposal Network (RPN): Scans the feature map and proposes "interesting" rectangular regions that might contain objects. It does this by sliding a small network over the feature map and predicting, at each location, whether an object exists and rough box coordinates. Anchor boxes at multiple scales and aspect ratios ensure it catches objects of all shapes.
Detection head: Takes each proposed region, crops the corresponding features (via RoI pooling/align), and classifies it into a specific category while refining the box coordinates.

GAS specifically uses the Detectron2 implementation from Meta AI, fine-tuned on indoor object datasets. Detectron2 provides a modular framework where you can swap backbones (ResNet-50, ResNet-101, etc.), adjust anchor sizes, and configure training schedules.

Why Faster R-CNN over YOLO? For semantic mapping, accuracy matters more than speed. The pipeline runs offline on pre-recorded video — there's no real-time constraint. Faster R-CNN with ResNet-50-FPN runs at ~8fps on an RTX 3090, which is fine for a 3fps subsampled video. YOLO runs at 30+fps but with lower mAP on small indoor objects (especially at the 2023 state of the art when GAS was built). The two-stage design also provides natural confidence scores that GAS uses for threshold filtering — detections below 0.5 confidence are discarded. YOLO's single-shot regression tends to produce more false positives at equivalent recall.

The open-vocabulary revolution: Faster R-CNN can only detect categories it was trained on. Grounding DINO (Liu et al., 2023) changed the game: give it any text prompt ("find the coffee mug next to the laptop") and it detects it — zero-shot, no fine-tuning needed. OWL-ViT and YOLO-World push open-vocabulary detection further. GAS's fine-tuning approach works well for known indoor categories but can't handle novel objects. Open-vocabulary detectors are the clear next step.

Why detection matters for mapping

A 3D reconstruction without semantics is just geometry — surfaces and volumes with no meaning. Detection gives each region an identity: "this cluster of 3D points is a table, that one is a chair." But bounding boxes are crude rectangles that include background pixels. To project objects accurately into 3D, we need tighter boundaries. That's where segmentation comes in (Chapter 3).

The detection landscape

Detector	Year	Type	Key innovation
Faster R-CNN	2015	Two-stage, closed-vocab	Region Proposal Networks
YOLO	2015	Single-shot, closed-vocab	Detection as regression, real-time
DETR	2020	Transformer, closed-vocab	Set prediction, no NMS/anchors
OWL-ViT	2022	Open-vocab	Vision-language contrastive detection
Grounding DINO	2023	Open-vocab	Text-grounded detection, any category
YOLO-World	2024	Open-vocab, real-time	Vision-language YOLO at 52 FPS

What is the main limitation of GAS's Faster R-CNN detection approach compared to modern open-vocabulary detectors?

It can only detect the fixed set of categories it was trained on — it can't recognize novel objects at test time It's too slow for real-time use It can't output confidence scores

Chapter 3: Segmentation: From Boxes to Masks

A bounding box says "the object is somewhere in this rectangle." A segmentation mask says "these exact pixels belong to the object." For 3D projection, this distinction is critical.

Why masks matter for depth projection

Consider a lamp on a table. The bounding box around the lamp includes the table surface behind it. When you project bounding box pixels into 3D using depth, you get a cloud of 3D points that includes both the lamp and the table. The lamp's point cloud "bleeds" into the table's surface, corrupting both the lamp's estimated position and its shape. A tight segmentation mask captures only lamp pixels, producing a clean 3D object.

GAS demonstrated this empirically: switching from bounding box projection to mask-based projection produced dramatically cleaner 3D maps with less object confusion.

SAM: Segment Anything Model

GAS uses Meta's Segment Anything Model (SAM, Kirillov et al., 2023) — a foundation model for segmentation. SAM was trained on 11 million images with over 1 billion masks. Given a prompt (a point, a box, or text), SAM outputs a precise segmentation mask. GAS feeds each Faster R-CNN bounding box to SAM as a box prompt, and SAM returns a tight mask for the object inside.

This is a clever division of labor: Faster R-CNN knows what objects exist and roughly where they are (box + label), and SAM knows exactly which pixels belong to each object (mask). Neither alone is sufficient — SAM needs a prompt to know where to segment, and Faster R-CNN needs SAM for pixel-precise boundaries.

Frozen vs trained components in GAS:
• GO-SLAM: Used off-the-shelf, pretrained weights, no fine-tuning. The neural implicit representation is optimized per-scene at inference time (test-time optimization), not trained on indoor data.
• Faster R-CNN: Fine-tuned by the GAS authors. Backbone (ResNet-50 stages 1-2) frozen, stages 3-4 + FPN + RPN + detection head trained on 12,427 indoor images. Training: 20 epochs, batch size 4, lr=0.0001 with step decay, on 1 RTX 3090, ~6 hours.
• SAM: Completely frozen. Used as-is with box prompts. SAM ViT-H has 636M parameters, all pretrained on SA-1B (11M images, 1.1B masks). No fine-tuning needed — SAM generalizes to indoor objects without modification.
• The pipeline's total inference time per frame: GO-SLAM ~500ms + Faster R-CNN ~125ms + SAM ~100ms per object (typically 5-10 objects) = ~1.5-2.5 seconds per frame.

SAM 2 and video segmentation (2024): SAM 2 (Ravi et al., 2024) extended SAM to video with streaming memory attention — segment an object in one frame, and SAM 2 tracks it through the entire video automatically. Grounded SAM 2 combines Grounding DINO + SAM 2: describe an object in text, and get a tracked mask through the video. This would eliminate GAS's need for per-frame detection + segmentation entirely.

Semantic vs. instance segmentation

Two related but different tasks:

Semantic segmentation: Labels every pixel with a class (floor, wall, chair) but doesn't distinguish between two chairs. All chair pixels get the same label.
Instance segmentation: Labels every pixel and distinguishes individual objects. Chair-1 and Chair-2 get different masks.

GAS needs instance segmentation — it must know that the couch in frame 47 and the couch in frame 92 are the same couch. SAM provides instance-level masks. The cross-frame association (matching objects across frames) is handled later via IoU-based merging (Chapter 5).

Why does GAS use SAM masks instead of Faster R-CNN bounding boxes for 3D projection?

Bounding boxes include background pixels — when projected to 3D via depth, these create "depth bleeding" where object point clouds absorb nearby surfaces SAM is faster than Faster R-CNN Bounding boxes don't have class labels

Chapter 4: The GAS Pipeline

Now let's put all the pieces together. This is the showcase chapter — the full four-stage pipeline from raw video to semantic floorplan.

Stage 1: SLAM → geometry

GO-SLAM processes the input video (monocular, stereo, or RGB-D) and outputs:

Camera pose T_i ∈ SE(3) for each frame i — a 4×4 matrix encoding rotation and translation
Depth map D_i for each frame — per-pixel distance to the surface
A dense 3D mesh of the entire scene

Stage 2: Detect → what's there

Faster R-CNN (fine-tuned Detectron2) runs on every N-th frame (GAS subsamples to reduce computation). For each frame, it outputs a set of bounding boxes {b_j}, class labels {c_j}, and confidence scores {s_j}.

Stage 3: Segment + Project → 3D objects

This is the mathematical heart of the pipeline. For each detected bounding box:

SAM segmentation: Feed the bounding box to SAM as a box prompt. SAM returns a binary mask M_j — 1 for object pixels, 0 for background.
Depth extraction: For each pixel (u, v) where M_j(u, v) = 1, read the depth d = D_i(u, v) from GO-SLAM's depth map.
Back-projection to 3D: Convert each masked pixel to a 3D point using the camera intrinsics and pose.

The back-projection math is the key equation. Given a pixel (u, v) with depth d, camera intrinsic matrix K, and camera pose T_i:

P_cam = d · K⁻¹ · [u, v, 1]^T

This gives the 3D point in camera coordinates. The camera intrinsic matrix K encodes focal lengths (f_x, f_y) and principal point (c_x, c_y):

K = [[f_x, 0, c_x], [0, f_y, c_y], [0, 0, 1]]

To transform from camera coordinates to world coordinates, multiply by the camera pose:

P_world = T_i · [P_cam, 1]^T

Where T_i is the 4×4 transformation matrix (rotation + translation) from GO-SLAM. The result is a set of 3D points in world coordinates for each detected object.

Why camera intrinsics are critical: GAS found that incorrect camera intrinsics (focal length, principal point) cause the entire 3D reconstruction to be wrong — objects appear at the wrong scale, depth, or position. GO-SLAM's estimated intrinsics sometimes differ from ground truth, and this was one of the biggest sources of error in their experiments. This is a fundamental challenge in monocular 3D vision: you need to know the camera to interpret the image geometrically.

What degrades and why:
• Noisy depth → bad 3D projection. If GO-SLAM estimates depth=3.0m but true depth is 2.5m, the 3D point is placed 0.5m too far. Over thousands of pixels, the object's point cloud is distorted. For a couch at 3m distance, a 10% depth error shifts it by 30cm — enough to overlap with adjacent furniture and corrupt merging.
• Missed detections → incomplete map. If Faster R-CNN misses a chair in frames 20-40 (confidence below threshold), that viewing angle never contributes points. The chair's point cloud has a "hole" and its footprint on the floor plan is undersized.
• Wrong focal length → systematic scale error. If K says f=500 but true f=525, every back-projected ray diverges by ~5%. Objects near the image edge are displaced by several centimeters. This compounds across frames because SLAM pose estimation also uses K.
• Fast camera motion → SLAM failure. GO-SLAM needs sufficient frame overlap for tracking. At >30 degrees rotation between frames, tracking fails and the pose graph breaks. GAS subsamples at 3fps, so fast rotations are the enemy.

Stage 4: Merge + Map → floorplan

The same couch appears in dozens of frames. Each frame produces a separate 3D point cloud for it. GAS merges these across frames using IoU-based matching (detailed in Chapter 5), then projects the merged objects down to a 2D floorplan (Chapter 6).

In the depth back-projection equation P_cam = d · K⁻¹ · [u, v, 1]^T, what does K⁻¹ do?

It converts pixel coordinates (u, v) into normalized camera ray directions by removing the effects of focal length and principal point It rotates the point into world coordinates It computes the depth value from RGB

Chapter 5: 3D Point Cloud Fusion and Object Merging

A single video frame gives you a partial view of an object — maybe the front of a couch. The next frame shows a slightly different angle. After walking around the room, you've seen the couch from many viewpoints. GAS needs to recognize that all these partial observations are the same object and merge their 3D points into a single, complete point cloud.

The merging algorithm

GAS uses a straightforward but effective approach based on quantized 3D IoU (Intersection over Union):

Voxelization: Discretize 3D space into a grid of small cubes (voxels). Each existing object's point cloud occupies a set of voxels.
New observation: When a new detection produces a 3D point cloud, voxelize it too.
IoU comparison: Compute the volumetric IoU between the new point cloud's voxels and each existing object's voxels. IoU = |intersection| / |union|.
Merge or create: If IoU exceeds a threshold, merge the new points into the existing object. Otherwise, create a new object.

The class label must also match — a "chair" point cloud won't merge with a "table" point cloud regardless of spatial overlap.

Engineering decisions in merging: The voxel resolution is 5cm — chosen empirically as a balance between precision and noise tolerance. The IoU threshold for merging is 0.3 (fairly permissive) because partial views often overlap by only 30-50% even when they're the same object. The algorithm processes frames sequentially: for each new detection, it computes IoU against all existing objects of the same class. Worst case this is O(K×V) where K is objects seen so far and V is voxels per object. For a typical room with 20 objects and ~5000 voxels each, this is fast (<10ms per frame).

Modern alternatives — feature fields: LERF (Kerr et al., 2023) embeds CLIP features directly into a NeRF, so every point in 3D space has both a position and a semantic feature vector. You can query the scene with text ("where is the red mug?") and get a 3D heatmap. 3D Gaussian Splatting (Kerbl et al., 2023) represents scenes as millions of 3D Gaussians that can be rendered in real-time — and researchers have added semantic features to each Gaussian, creating semantically-aware renderable scenes.

Why IoU-based merging works (and when it doesn't)

IoU works well when objects are spatially separated (a table and a chair 2 meters apart will never have high IoU). It struggles when objects are close together and have similar classes — two adjacent chairs, for instance. GAS handles this by keeping per-class object lists and requiring both spatial overlap and class match.

The voxel resolution also matters. Too coarse, and nearby objects blur together. Too fine, and partial observations of the same object don't overlap enough to trigger merging. GAS balances this empirically.

The representation question

GAS stores objects as raw point clouds — unordered sets of (x, y, z) coordinates. This is simple but memory-intensive. Modern approaches offer alternatives:

Neural implicit fields (NeRF, SDF networks): Represent the scene as a continuous function. Memory-efficient for large scenes but slow to query.
3D Gaussians: Each object is a set of oriented 3D Gaussians. Renderable in real-time, easily merged.
Scene graphs: Nodes are objects with features; edges encode spatial relationships ("chair is next to table"). This is what ConceptGraphs builds.

How does GAS determine whether a newly detected object should be merged with an existing one?

It voxelizes both point clouds and computes volumetric IoU — if IoU exceeds a threshold and class labels match, they're merged It compares RGB color histograms It uses a neural network to predict similarity

Chapter 6: From 3D to 2D: Generating Floor Plans

A 3D point cloud is powerful but hard for humans or planning algorithms to use directly. A 2D floor plan — a top-down view showing rooms, walls, and labelled objects — is immediately interpretable. GAS converts its 3D semantic map into a 2D floorplan through geometric analysis and projection.

Step 1: RANSAC plane fitting

RANSAC (Random Sample Consensus) is a robust method for fitting models to data with outliers. To find the floor plane, GAS:

Randomly samples 3 points from the scene's point cloud
Fits a plane through them (a plane is defined by 3 non-collinear points)
Counts how many other points lie within a threshold distance of this plane (the "inliers")
Repeats many times and keeps the plane with the most inliers

The floor is typically the largest horizontal plane in the scene. GAS also identifies ceiling and wall planes by looking for planes aligned with the principal axes (x, y, z).

Step 2: Top-down projection

Once the floor plane is identified, GAS projects all object point clouds onto it by dropping the vertical (y) coordinate. Each object becomes a 2D cluster of points on the floor plane. The density of projected points forms a 2D histogram — cells with many points indicate where objects sit on the floor.

Step 3: Gaussian smoothing + convex hull

Raw projected points are noisy and scattered. GAS applies Gaussian smoothing to the 2D density histogram, creating smooth object regions. Then it computes the convex hull of each object's projected points — the smallest convex polygon enclosing all points. This gives each object a clean, labeled patch on the floorplan.

Neural floor plan generation: Recent work has explored end-to-end neural approaches to floor plan generation. MonteFloor (Stekovic et al., 2021) uses Monte Carlo Tree Search to generate floor plans from point clouds. Structured3D (Zheng et al., 2020) provides a large-scale synthetic dataset for structured 3D understanding. The BIM (Building Information Modeling) community is pushing toward fully automated conversion of 3D scans to architectural plans — a natural extension of what GAS produces.

The final output

The result is a 2D map where each colored patch represents a detected object with its class label. Walls appear as line segments, rooms as enclosed regions, and furniture as labeled polygons. This is directly usable for robot navigation ("go to the table in the living room") or for generating human-readable floor plans.

Why does GAS apply Gaussian smoothing to the projected 2D point density before computing convex hulls?

Raw projected points are noisy and scattered — smoothing fills gaps and creates coherent object regions for clean boundary extraction To reduce the file size of the output To convert from RGB to grayscale

Chapter 7: Fine-Tuning Object Detectors

Off-the-shelf Faster R-CNN, pre-trained on COCO (80 outdoor-heavy categories), performs poorly on indoor scenes. COCO has "chair" but doesn't have "nightstand." It has "couch" but the instances look different from indoor furniture. GAS fine-tunes the detector specifically for indoor objects.

Dataset assembly

GAS merged 5 datasets into a unified training set of 12,427 images:

ScanNet: 1,513 real indoor RGB-D scans
Hypersim: Photorealistic synthetic indoor renderings
ADE20K: Dense semantic annotations across diverse scenes
SUNRGBD: Indoor RGB-D images with 3D bounding boxes
Custom augmented data: Additional augmentations for underrepresented classes

Each dataset has its own annotation format, class taxonomy, and labeling conventions. GAS unified them into COCO format — a single JSON file per split with standardized bounding box annotations. This required mapping categories across datasets (e.g., "sofa" in one dataset = "couch" in another) and handling class imbalance.

Freezing layers: why and how

Training details: GAS fine-tuned for 20 epochs on 12,427 images (batch size 4, 1 GPU). Optimizer: SGD with momentum 0.9. Training time: ~6 hours on a single RTX 3090. The 12,427 images break down to ~1,500 from ScanNet, ~3,000 from Hypersim, ~4,000 from ADE20K, ~2,500 from SUNRGBD, ~1,400 augmented. Class distribution before balancing: "chair" had 8,000+ instances, "exit sign" had 127. After oversampling rare classes, the minimum per-class count reached ~500.

A key finding from GAS's experiments: freezing early layers of the backbone during fine-tuning works best under resource constraints. They tested five configurations:

Config	Frozen layers	Learning rate	Performance
1	None (train all)	0.00025	Overfits quickly
2	ResNet stages 1-2	0.00025	Good, but unstable
3	ResNet stages 1-3	0.00025	Stable but slower convergence
4	None, lower LR	0.0001	Better than 1, still overfits
5	Stages 1-2, lower LR	0.0001	Best overall

Transfer learning theory: Why does freezing early layers work? Early convolutional layers learn general visual features — edges, textures, color gradients — that transfer well across domains. Later layers learn task-specific features. When fine-tuning on a small dataset, allowing early layers to update risks destroying these general features (catastrophic forgetting) without gaining enough from the limited indoor data. This same principle powers modern PEFT methods like LoRA — instead of freezing layers, you add small trainable adapters while keeping the backbone frozen. LoRA has been applied to detection transformers (DETR-LoRA) with strong results.

The PEFT connection

GAS's layer-freezing approach is a simple form of parameter-efficient fine-tuning (PEFT). Modern PEFT methods are more sophisticated:

LoRA: Add low-rank matrices to attention layers. Only ~1% of parameters are trainable, but performance matches full fine-tuning.
Adapters: Insert small bottleneck layers between existing layers. The original weights are completely frozen.
Prompt tuning: Learn a set of "soft prompts" prepended to the input while freezing the entire model.

For detection models specifically, LoRA applied to DETR-style models has shown strong results with minimal compute, and would be a natural upgrade to GAS's fine-tuning approach.

Why did Config 5 (frozen stages 1-2 + lower learning rate) perform best in GAS's fine-tuning experiments?

Freezing early layers preserves general visual features while letting later layers specialize for indoor objects, and the lower learning rate prevents overfitting on the small dataset It used the most GPU memory It trained for the most epochs

Chapter 8: Results and Lessons Learned

GAS evaluated on the Replica dataset — photorealistic synthetic indoor environments with ground truth geometry and semantics. The results reveal both successes and fundamental challenges in vision-based semantic mapping.

What worked

Segmentation masks >> bounding boxes: This was the clearest finding. Using SAM masks instead of raw bounding boxes for depth projection dramatically reduced noise in the 3D object point clouds. Background depth bleeding was virtually eliminated.
RGB-D >> monocular: When real depth is available (from a depth sensor), the pipeline produces far more accurate 3D reconstructions than when relying on GO-SLAM's estimated depth from monocular RGB. Estimated depth has scale ambiguity and per-frame inconsistency.
The modular pipeline works: Each component can be independently upgraded. Swap in a better SLAM system, a better detector, or a better segmentor, and the rest of the pipeline still works.

What didn't work (and why)

Camera intrinsics sensitivity: GO-SLAM sometimes estimates camera intrinsics (focal length) that differ from ground truth. Since K appears in the back-projection equation, wrong intrinsics corrupt every 3D point. This was the single largest source of error.
Object confusion: The fine-tuned Faster R-CNN confused visually similar categories: chairs vs. couches (both are "sit-on" furniture), lamps vs. bowls (similar silhouettes from certain angles), tables vs. desks. More training data and open-vocabulary detectors would help.
Partial observations: Objects seen from only one or two viewpoints have incomplete point clouds, leading to poor footprint estimates on the floor plan.

Modern evaluation metrics: The ScanNet benchmark evaluates 3D semantic understanding on real-world scans. Metrics include 3D mAP (mean Average Precision for 3D instance detection), scene graph quality (are spatial relationships correct?), and reconstruction completeness. GAS uses qualitative evaluation on Replica, which is standard for CS231n projects but wouldn't meet the rigor of a top-venue submission. ConceptGraphs and OpenScene provide more systematic evaluation frameworks for semantic 3D scenes.

The bigger picture

GAS's results echo a pattern across robotics: the integration challenge is as hard as the algorithmic challenge. Each component (SLAM, detection, segmentation) works well in isolation on its standard benchmark. Composing them into an end-to-end system exposes failure modes that no single benchmark captures — depth scale mismatches, coordinate frame confusion, class taxonomy mismatches between detection and segmentation datasets.

What was the single largest source of error in GAS's pipeline?

Incorrect camera intrinsics from GO-SLAM — since K appears in the back-projection equation, wrong focal length corrupts every 3D point SAM producing bad masks RANSAC failing to find the floor

Chapter 9: The Future — Open-Vocabulary Semantic SLAM

GAS points toward a future where cameras + foundation models = complete spatial intelligence. The gap between GAS's modular pipeline and the end goal is closing rapidly. Let's look at where the field is heading.

From fixed vocabulary to open world

GAS's biggest limitation is its fixed object vocabulary — 12 indoor categories that must be pre-defined and trained. The future is open-vocabulary: describe any object in natural language and the system finds it in 3D space.

ConceptGraphs (Gu et al., 2023) already does this. It builds 3D scene graphs where each node is an object with an open-vocabulary CLIP embedding. You can query "the red book on the third shelf" and get a 3D location. No fine-tuning, no predefined categories.

From mapping to manipulation

A semantic map is only useful if a robot can act on it. OK-Robot (Liu et al., 2024) combines open-vocabulary 3D mapping with a grasping system: "pick up the mug" triggers a query against the semantic map, navigation to the mug's 3D location, and a grasp plan. SayPlan (Rana et al., 2023) uses LLMs to generate task plans from 3D scene graphs: "set the table for dinner" becomes a sequence of pick-and-place operations grounded in the semantic map.

The convergence — spatial intelligence: The field is converging on a unified vision: one model, one pass, complete 3D understanding. VGGT (CVPR 2025 Best Paper) predicts all 3D geometry from unposed images. Add CLIP features to each 3D point (as LERF does), and you have open-vocabulary 3D understanding without any SLAM loop, without separate detection, without explicit segmentation. The entire GAS pipeline — SLAM, detect, segment, project, merge — could collapse into a single forward pass of a foundation model.

GAS's future work

The paper itself identifies several directions:

Zero-shot learning: Replace Faster R-CNN with Grounding DINO for open-vocabulary detection
VLFM integration: Use vision-language frontier maps for semantic navigation — direct a robot toward semantically interesting unexplored areas
Real-world deployment: Test on real RGB-D cameras (RealSense, Azure Kinect) in actual apartments
Dynamic objects: Handle objects that move between visits (someone moves a chair)

Related Veanors lessons

Explore the components and successors of GAS's approach:

DUSt3R — Direct 3D reconstruction from image pairs (replaces SLAM)
MASt3R-SLAM — Real-time dense SLAM from learned priors
VGGT — All 3D geometry in one forward pass
SAM 2 — Segment anything in images and video
Faster R-CNN — Region Proposal Networks for object detection
DETR — Transformer-based end-to-end detection
3D Gaussian Splatting — Real-time radiance field rendering

Cheat sheet: the full GAS pipeline equations

1. SLAM: V = GO-SLAM(I_1:N) → {T_i, D_i, Mesh}

2. Detect: {b_j, c_j, s_j} = FasterRCNN(I_i)

3. Segment: M_j = SAM(I_i, b_j)

4. Project: P_world = T_i · [d · K⁻¹ [u, v, 1]^T, 1]^T ∀ (u,v) ∈ M_j

5. Merge: IoU_3D(O_k, O_new) > τ ⇒ O_k ∪= O_new

6. Floor: RANSAC → plane ⇒ project(O_k) → smooth → hull

What is the key advantage of ConceptGraphs over GAS for semantic mapping?

ConceptGraphs uses open-vocabulary CLIP embeddings — you can query for any object using natural language, with no predefined categories or fine-tuning ConceptGraphs runs faster ConceptGraphs uses better cameras