A visual navigation framework that consumes monocular, stereo, or RGB-D video and produces semantically labelled 2D and 3D maps — fusing neural SLAM, object detection, segmentation, and geometric projection into a single pipeline.
Imagine you drop a robot into an unfamiliar apartment. It has a camera. It can move around. But it has no map, no GPS, and no floorplan. To do anything useful — "go to the kitchen and grab the mug" — it needs two things: where am I? and what is around me?
Outdoor mapping is largely solved. We have satellite imagery, LiDAR-equipped cars, and decades of GPS infrastructure. Indoors, none of that exists. Walls block signals. Rooms change — someone moves a chair, stacks boxes, opens a door. Every indoor space is a unique, dynamic, GPS-denied puzzle.
LiDAR gives beautiful 3D scans but costs thousands of dollars, weighs significant amounts, and captures only geometry — it can't tell you what something is. A $5 RGB camera, on the other hand, captures appearance, texture, color, and all the cues humans use to recognize objects. The challenge is extracting 3D structure and semantic meaning from flat 2D images.
A semantic map doesn't just know "there's a surface 2.3 meters ahead." It knows "there's a couch 2.3 meters ahead, and behind it is a bookshelf." This is the difference between a 3D point cloud and a map a robot (or a person) can actually reason about. The GAS paper tackles exactly this pipeline: video in, semantic floorplan out.
GAS (short for the three authors' initials) processes video through four stages:
Each stage uses an off-the-shelf or fine-tuned model. The genius isn't in any single component — it's in how they're composed and the engineering choices that make the pipeline robust.
Before you can label anything, you need to know where you are and what the world looks like in 3D. This is the job of SLAM — Simultaneous Localization and Mapping. The word "simultaneous" is key: you're building the map and figuring out your position at the same time, each helping the other.
Every SLAM system runs a continuous loop:
The tracking step typically matches features (distinctive points) between consecutive frames and solves for the relative camera motion. The mapping step triangulates those features into 3D points. Loop closure is the hardest part — you need to recognize "I've been here before" and then globally adjust all poses to be consistent.
Classical SLAM (ORB-SLAM3, 2021) detects hand-crafted features like ORB keypoints and tracks them with geometric optimization. It's fast, well-understood, and works on a CPU. But it produces sparse point clouds — a scattered set of 3D dots, not a dense surface.
Neural SLAM replaces parts of this pipeline with learned components. DROID-SLAM (Teed & Deng, 2021) learns to estimate dense optical flow between frames, producing much denser reconstructions. It uses differentiable bundle adjustment — the optimization that refines all camera poses and 3D structure jointly — as a layer in the neural network, so gradients flow through the geometric reasoning.
GAS chose GO-SLAM (Zhang et al., 2023), a neural implicit SLAM system that represents the scene as a neural radiance field (similar to NeRF). GO-SLAM performs global optimization, meaning it refines all camera poses jointly rather than greedily fixing each pose as it goes. This produces more consistent 3D reconstructions, especially in challenging environments with loops or revisited areas.
GO-SLAM outputs three critical things that GAS needs downstream:
The camera poses and depth maps are essential for projecting 2D detections into 3D space. The mesh provides the overall spatial structure.
| System | Year | Approach | Output density |
|---|---|---|---|
| ORB-SLAM3 | 2021 | Feature-based, CPU | Sparse points |
| DROID-SLAM | 2021 | Learned optical flow + diff. BA | Dense |
| GO-SLAM | 2023 | Neural implicit + global opt. | Dense mesh |
| DUSt3R | 2024 | Direct pointmap prediction | Dense |
| Gaussian SLAM | 2024 | 3D Gaussians as map representation | Dense + renderable |
| MASt3R-SLAM | 2025 | Learned 3D priors, real-time | Dense |
| VGGT | 2025 | Feed-forward transformer, no loop | Dense |
With the camera poses and 3D structure recovered by SLAM, the next question is: what objects are in each frame? This is the job of an object detector — a model that takes an image and outputs a set of bounding boxes, each with a class label and confidence score.
GAS uses Faster R-CNN (Ren et al., 2015), the workhorse of two-stage detection. The architecture has two components:
GAS specifically uses the Detectron2 implementation from Meta AI, fine-tuned on indoor object datasets. Detectron2 provides a modular framework where you can swap backbones (ResNet-50, ResNet-101, etc.), adjust anchor sizes, and configure training schedules.
A 3D reconstruction without semantics is just geometry — surfaces and volumes with no meaning. Detection gives each region an identity: "this cluster of 3D points is a table, that one is a chair." But bounding boxes are crude rectangles that include background pixels. To project objects accurately into 3D, we need tighter boundaries. That's where segmentation comes in (Chapter 3).
| Detector | Year | Type | Key innovation |
|---|---|---|---|
| Faster R-CNN | 2015 | Two-stage, closed-vocab | Region Proposal Networks |
| YOLO | 2015 | Single-shot, closed-vocab | Detection as regression, real-time |
| DETR | 2020 | Transformer, closed-vocab | Set prediction, no NMS/anchors |
| OWL-ViT | 2022 | Open-vocab | Vision-language contrastive detection |
| Grounding DINO | 2023 | Open-vocab | Text-grounded detection, any category |
| YOLO-World | 2024 | Open-vocab, real-time | Vision-language YOLO at 52 FPS |
A bounding box says "the object is somewhere in this rectangle." A segmentation mask says "these exact pixels belong to the object." For 3D projection, this distinction is critical.
Consider a lamp on a table. The bounding box around the lamp includes the table surface behind it. When you project bounding box pixels into 3D using depth, you get a cloud of 3D points that includes both the lamp and the table. The lamp's point cloud "bleeds" into the table's surface, corrupting both the lamp's estimated position and its shape. A tight segmentation mask captures only lamp pixels, producing a clean 3D object.
GAS demonstrated this empirically: switching from bounding box projection to mask-based projection produced dramatically cleaner 3D maps with less object confusion.
GAS uses Meta's Segment Anything Model (SAM, Kirillov et al., 2023) — a foundation model for segmentation. SAM was trained on 11 million images with over 1 billion masks. Given a prompt (a point, a box, or text), SAM outputs a precise segmentation mask. GAS feeds each Faster R-CNN bounding box to SAM as a box prompt, and SAM returns a tight mask for the object inside.
This is a clever division of labor: Faster R-CNN knows what objects exist and roughly where they are (box + label), and SAM knows exactly which pixels belong to each object (mask). Neither alone is sufficient — SAM needs a prompt to know where to segment, and Faster R-CNN needs SAM for pixel-precise boundaries.
Two related but different tasks:
GAS needs instance segmentation — it must know that the couch in frame 47 and the couch in frame 92 are the same couch. SAM provides instance-level masks. The cross-frame association (matching objects across frames) is handled later via IoU-based merging (Chapter 5).
Now let's put all the pieces together. This is the showcase chapter — the full four-stage pipeline from raw video to semantic floorplan.
GO-SLAM processes the input video (monocular, stereo, or RGB-D) and outputs:
Faster R-CNN (fine-tuned Detectron2) runs on every N-th frame (GAS subsamples to reduce computation). For each frame, it outputs a set of bounding boxes {bj}, class labels {cj}, and confidence scores {sj}.
This is the mathematical heart of the pipeline. For each detected bounding box:
The back-projection math is the key equation. Given a pixel (u, v) with depth d, camera intrinsic matrix K, and camera pose Ti:
This gives the 3D point in camera coordinates. The camera intrinsic matrix K encodes focal lengths (fx, fy) and principal point (cx, cy):
To transform from camera coordinates to world coordinates, multiply by the camera pose:
Where Ti is the 4×4 transformation matrix (rotation + translation) from GO-SLAM. The result is a set of 3D points in world coordinates for each detected object.
The same couch appears in dozens of frames. Each frame produces a separate 3D point cloud for it. GAS merges these across frames using IoU-based matching (detailed in Chapter 5), then projects the merged objects down to a 2D floorplan (Chapter 6).
A single video frame gives you a partial view of an object — maybe the front of a couch. The next frame shows a slightly different angle. After walking around the room, you've seen the couch from many viewpoints. GAS needs to recognize that all these partial observations are the same object and merge their 3D points into a single, complete point cloud.
GAS uses a straightforward but effective approach based on quantized 3D IoU (Intersection over Union):
The class label must also match — a "chair" point cloud won't merge with a "table" point cloud regardless of spatial overlap.
IoU works well when objects are spatially separated (a table and a chair 2 meters apart will never have high IoU). It struggles when objects are close together and have similar classes — two adjacent chairs, for instance. GAS handles this by keeping per-class object lists and requiring both spatial overlap and class match.
The voxel resolution also matters. Too coarse, and nearby objects blur together. Too fine, and partial observations of the same object don't overlap enough to trigger merging. GAS balances this empirically.
GAS stores objects as raw point clouds — unordered sets of (x, y, z) coordinates. This is simple but memory-intensive. Modern approaches offer alternatives:
A 3D point cloud is powerful but hard for humans or planning algorithms to use directly. A 2D floor plan — a top-down view showing rooms, walls, and labelled objects — is immediately interpretable. GAS converts its 3D semantic map into a 2D floorplan through geometric analysis and projection.
RANSAC (Random Sample Consensus) is a robust method for fitting models to data with outliers. To find the floor plane, GAS:
The floor is typically the largest horizontal plane in the scene. GAS also identifies ceiling and wall planes by looking for planes aligned with the principal axes (x, y, z).
Once the floor plane is identified, GAS projects all object point clouds onto it by dropping the vertical (y) coordinate. Each object becomes a 2D cluster of points on the floor plane. The density of projected points forms a 2D histogram — cells with many points indicate where objects sit on the floor.
Raw projected points are noisy and scattered. GAS applies Gaussian smoothing to the 2D density histogram, creating smooth object regions. Then it computes the convex hull of each object's projected points — the smallest convex polygon enclosing all points. This gives each object a clean, labeled patch on the floorplan.
The result is a 2D map where each colored patch represents a detected object with its class label. Walls appear as line segments, rooms as enclosed regions, and furniture as labeled polygons. This is directly usable for robot navigation ("go to the table in the living room") or for generating human-readable floor plans.
Off-the-shelf Faster R-CNN, pre-trained on COCO (80 outdoor-heavy categories), performs poorly on indoor scenes. COCO has "chair" but doesn't have "nightstand." It has "couch" but the instances look different from indoor furniture. GAS fine-tunes the detector specifically for indoor objects.
GAS merged 5 datasets into a unified training set of 12,427 images:
Each dataset has its own annotation format, class taxonomy, and labeling conventions. GAS unified them into COCO format — a single JSON file per split with standardized bounding box annotations. This required mapping categories across datasets (e.g., "sofa" in one dataset = "couch" in another) and handling class imbalance.
A key finding from GAS's experiments: freezing early layers of the backbone during fine-tuning works best under resource constraints. They tested five configurations:
| Config | Frozen layers | Learning rate | Performance |
|---|---|---|---|
| 1 | None (train all) | 0.00025 | Overfits quickly |
| 2 | ResNet stages 1-2 | 0.00025 | Good, but unstable |
| 3 | ResNet stages 1-3 | 0.00025 | Stable but slower convergence |
| 4 | None, lower LR | 0.0001 | Better than 1, still overfits |
| 5 | Stages 1-2, lower LR | 0.0001 | Best overall |
GAS's layer-freezing approach is a simple form of parameter-efficient fine-tuning (PEFT). Modern PEFT methods are more sophisticated:
For detection models specifically, LoRA applied to DETR-style models has shown strong results with minimal compute, and would be a natural upgrade to GAS's fine-tuning approach.
GAS evaluated on the Replica dataset — photorealistic synthetic indoor environments with ground truth geometry and semantics. The results reveal both successes and fundamental challenges in vision-based semantic mapping.
GAS's results echo a pattern across robotics: the integration challenge is as hard as the algorithmic challenge. Each component (SLAM, detection, segmentation) works well in isolation on its standard benchmark. Composing them into an end-to-end system exposes failure modes that no single benchmark captures — depth scale mismatches, coordinate frame confusion, class taxonomy mismatches between detection and segmentation datasets.
GAS points toward a future where cameras + foundation models = complete spatial intelligence. The gap between GAS's modular pipeline and the end goal is closing rapidly. Let's look at where the field is heading.
GAS's biggest limitation is its fixed object vocabulary — 12 indoor categories that must be pre-defined and trained. The future is open-vocabulary: describe any object in natural language and the system finds it in 3D space.
ConceptGraphs (Gu et al., 2023) already does this. It builds 3D scene graphs where each node is an object with an open-vocabulary CLIP embedding. You can query "the red book on the third shelf" and get a 3D location. No fine-tuning, no predefined categories.
A semantic map is only useful if a robot can act on it. OK-Robot (Liu et al., 2024) combines open-vocabulary 3D mapping with a grasping system: "pick up the mug" triggers a query against the semantic map, navigation to the mug's 3D location, and a grasp plan. SayPlan (Rana et al., 2023) uses LLMs to generate task plans from 3D scene graphs: "set the table for dinner" becomes a sequence of pick-and-place operations grounded in the semantic map.
The paper itself identifies several directions:
Explore the components and successors of GAS's approach: