The finished pipeline running on a 55-second phone walkthrough. Recorded at final-run parameters (Run 10). Demo
The experiment in 28 seconds — stack, data flow, the ten-run road, the six final artifacts. Overview
Chapter 0

TL;DR

We read the GAS v2 Veanor, decided we wanted a working implementation, and built it on Modal in an afternoon. Input: a 55-second phone walkthrough. Output: a 3D scene graph with 26 labeled objects, a first-person POV video with mask and depth overlays, a 2D floorplan, and a triangle mesh of the point cloud. No fine-tuning, no camera calibration, no local GPU.

Runs
10
including the ones that crashed
Frames processed
64
sampled from 1,644
Objects
26
after dedup merge
Artifacts
6
rrd + ply + obj + mp4 + json + png
GPU
A10G
24 GB, ~$1.10/hr
Total GPU cost
< $2
across all ten runs
💡 The meta-lesson

Stacking four foundation models sounds like "four API calls." It's not. Each model has its own dtype convention, its own memory profile, its own install-time trap, and its own recent API rename. The paper describes a pipeline; the runnable version is the pipeline plus the six hours of version wrangling between them. This article is that six hours.

If you want the final code, it lives at gas2/app.py in the repo — one file, ~500 lines, deployable with modal run. If you want to understand how it got there, read on.

ℹ What this article assumes

You've seen the GAS v2 Veanor or are comfortable with: ViTs, monocular depth, open-vocabulary detection, mask tracking. We don't re-derive the models themselves — we derive the integration. The output of every chapter is either a code diff, a failure mode with an explanation, or a piece of math you need to size things correctly.

Chapter 1

Why this problem

The thing we want is simple to state: I point my phone at a room, record a walkthrough, and get back a labeled 3D map. "There's a chair at (1.2, 0, 0.5). There's a door 2 meters north. There's a bookshelf along the east wall." Queryable in natural language. Works on any room I haven't pre-scanned.

The 2024 paper — "GAS" — did this with the classical recipe: SLAM for 3D, Faster R-CNN for detection, SAM for masks, glue code for fusion. That works, but "works" is carrying weight. To fine-tune Faster R-CNN for the 10 object classes in GAS, the authors assembled 12,000 images across 5 datasets and converted everything to COCO format. Add a new class? Restart the data pipeline.

The 2025 thesis behind GAS v2 is: every painful step above has been obviated by a foundation model that shipped in the last 18 months. Zero-shot open-vocabulary detection (Grounding DINO) eliminates the fine-tuning. One-shot monocular geometry (VGGT) eliminates the calibration step. Video-native mask tracking with a memory bank (SAM 2) eliminates the brittle IoU-matching frame-by-frame. You describe what you want to find, point the camera, and the pipeline runs.

The paper is a good read. But a paper's claim is never the same thing as "a script I can run on my phone video." The gap between them is where Mirdan lives. We set ourselves a weekend budget and went looking for the bill of materials.

💡 The implicit target

Every system-build piece should set a concrete acceptance criterion before you touch code. Ours: given an arbitrary phone video of a room, produce a 3D scene graph with correctly labeled objects, their approximate positions, and a viewable POV video — in under ten minutes of wall-clock time on a cloud GPU, with under $5 total cost. Everything that follows is in service of that target, or a pivot forced by reality.

Chapter 2

Why not a Mac

The first instinct for a weekend project is to run it on your laptop. Apple Silicon has respectable GPU performance and MPS is reasonably mature now. So let's ask the obvious question: does this stack run on an M-series Mac?

Three of the four models have at least one piece that refuses MPS:

  • Grounding DINO (official repo) ships a custom CUDA kernel for MultiScaleDeformableAttention, the core op of deformable attention. It compiles at pip install time against the installed CUDA toolchain. There is no MPS backend. Without this op, the model doesn't run. Period.
  • VGGT is written as PyTorch modules (no custom kernels), but in practice the released checkpoints, their mixed-precision regime (bf16), and the attention shapes they produce are tuned for CUDA. MPS will "work" in the sense that you can load the model, but inference on a 64-frame batch is orders of magnitude slower — and that's before we discuss bf16 support gaps.
  • SAM 2 has optional fused CUDA ops for connected-components post-processing. These are technically disable-able, but you lose small gains; the memory-bank logic itself runs anywhere PyTorch runs, so this is the least-blocking of the three.

(There is a way to run Grounding DINO on CPU/MPS: use the HuggingFace transformers port, which reimplements MSDA in pure PyTorch. We'll come back to this — it turns out to be load-bearing on Modal too, for a different reason. Chapter 6.)

So: we need a CUDA machine. The options, ranked by how close "dev loop" is to "nothing to set up":

Provider Billing Dev loop Verdict
Modal Per-second, no minimum modal run app.py, no Docker, no k8s, no idle servers Chosen. Python-native.
Colab Pro+ Monthly, unit credits Fastest to try one thing; A100s are flaky and you can't persist 50 GB of weights Fine for a quick poke, not for iteration
Runpod / Lambda / Vast.ai Per-hour, cheaper You manage the box; ssh in, scp, manage Docker yourself Right if you're doing a long batch job
GCP / AWS Per-hour + reserved Quota requests for A100s; VPC configuration; IAM Overkill. Skip unless you have credits.

Modal's pitch specifically: you write a Python function, add one decorator, and that function now runs on an A10G (24 GB, $1.10/hr). You can mount a persistent volume, so you download VGGT's 5 GB checkpoint once and keep it mounted across future runs. The container stays warm for five minutes after your last call, so the second modal run starts in seconds, not minutes. A typical iteration of "edit code, run, see output" costs a few cents.

💡 The billing granularity is the dev loop

Everything else about cloud GPU providers is downstream of one number: what's the smallest unit of time you pay for? If you pay by the hour, you keep the box up all day and your "debug cycle" is "write a script and run a test". If you pay by the second, your debug cycle is the same as local — hit run, see output, fix, hit run again — and the cost scales with your wall-clock, not with your idleness. For a Mirdan-style experimental build-log where you break things twelve times, per-second is orders of magnitude cheaper.

Chapter 3

The stack, from first principles

Four foundation models, each doing one thing, pipelined.

Pipeline data flow Architecture
RGB frames
phone video · 64 sampled from 1644 · 280×504
VGGT
world_points, poses, depth
Grounding DINO
open-vocab boxes · keyframes
SAM 2
mask tracking · memory bank
3D lift + dedup
world_points[mask] · label+proximity merge · 10/90 percentile AABB
scene graphgas2.json
floorplanfloorplan.png
point cloud + meshscene.ply · scene.obj
POV + Rerunpov.mp4 · gas2.rrd

RGB frames split three ways: VGGT recovers geometry, Grounding DINO detects objects on keyframes, SAM 2 propagates masks across the full clip. The three outputs converge on a 3D lifting + dedup step that produces the scene graph, which fans out to six artifacts.

3.1 — VGGT: one forward pass, three geometric outputs

VGGT (Visual Geometry Grounded Transformer, Meta, 2025) takes a batch of N RGB frames and returns, in a single forward pass: camera extrinsics, camera intrinsics, per-pixel depth, and per-pixel world points, all jointly-consistent. You can think of it as the feed-forward replacement for SfM + monocular depth + pose optimization, packaged as a ViT.

For our purposes, the most useful output is world_points: a tensor of shape [B, N, H, W, 3] where each pixel of each frame has a 3D coordinate in a shared world frame. We do not have to unproject depth through the intrinsic and transform through the extrinsic ourselves — VGGT hands us the answer. (It took us three crashes to realize that. Chapter 9.)

ℹ What VGGT does not give you

Metric scale. VGGT's world frame is anchored to the first camera, and the unit is whatever-the-network-decided-on — roughly meters but not calibrated. If you want to say "the chair is 1.8 meters from the door", you need an absolute depth reference (a LiDAR frame, a known-scale object, Depth Anything V2 metric). For a semantic map with relative spatial relationships, VGGT is enough.

3.2 — Grounding DINO: text in, boxes out, no fine-tuning

The classical detector (Faster R-CNN) learns a fixed vocabulary at training time. Grounding DINO learns cross-attention between text embeddings and image features, so the vocabulary is set at inference time by the prompt. Give it the string "chair. table. door." and it returns boxes for those classes. Give it "fire extinguisher." and it finds fire extinguishers — with no re-training.

ℹ The period-separator convention

The prompt format is "class1. class2. class3." — lowercase phrases separated by periods, not commas. Internally, each period-segment becomes a separate "text query" that the model aligns to image regions. "chair, table" is treated as a single phrase ("the phrase chair comma table") and performs terribly.

3.3 — SAM 2: masks with a memory bank

SAM (2023) gives you a beautiful per-frame mask given a point or box prompt. SAM 2 (2024) adds a "memory bank" that makes masks persistent across a video. You prompt object #3 at frame 0; it produces a mask. It also stores an appearance embedding. At frame 50, when the object reappears after being occluded, SAM 2 matches against the memory and produces the mask for the same object. No IoU-matching hacks, no tracker state to tune.

In our pipeline, Grounding DINO produces boxes on keyframes (every 8th frame); SAM 2 seeds a track from each box and propagates masks through all frames. This is the fusion the paper recommends, and it works.

3.4 — Two models the paper doesn't mention

We added two tools not in the paper, purely for the output pipeline:

  • Open3D for point-cloud I/O and Poisson mesh reconstruction. The paper discusses 3D Gaussian Splatting as a representation; for a first pass, a voxel-downsampled PLY and a Poisson mesh give you something you can open in MeshLab.
  • Rerun for visualization. Rerun lets us log camera poses, images, masks, and point clouds under a time-indexed hierarchy, then scrub through the recording. It's the right tool for SLAM-style data because it understands the temporal and spatial structure natively.
Chapter 4

Writing the scaffold

Modal's mental model is: your app.py declares "functions that run in the cloud", a container image, and a persistent volume. You invoke functions locally; Modal serializes arguments, runs them remotely, and streams results back. For a stateful pipeline that loads 5 GB of weights, the right unit is an @app.cls — a class whose @modal.enter() method runs once per container and loads the models; subsequent @modal.method() calls reuse that warm state.

python
import modal

app = modal.App("gas-v2")

weights = modal.Volume.from_name("gas-v2-weights", create_if_missing=True)
WEIGHTS_DIR = "/weights"

image = (
    modal.Image.from_registry("nvidia/cuda:12.4.1-devel-ubuntu22.04", add_python="3.11")
    .apt_install("git", "ffmpeg", "libgl1", "libglib2.0-0")
    .pip_install("torch==2.4.0", "torchvision==0.19.0",
                 index_url="https://download.pytorch.org/whl/cu124")
    .pip_install("transformers>=4.44", "opencv-python-headless", "pillow",
                 "numpy<2", "huggingface_hub", "rerun-sdk>=0.20")
    .run_commands("pip install git+https://github.com/facebookresearch/sam2.git")
    .run_commands("pip install git+https://github.com/IDEA-Research/GroundingDINO.git")
    .run_commands("pip install git+https://github.com/facebookresearch/vggt.git")
)

@app.function(image=image, volumes={WEIGHTS_DIR: weights}, timeout=3600)
def download_weights():
    # one-time: pull 5 GB VGGT + 2 GB SAM 2 + 1 GB GDINO + 4 GB DINOv2 into the volume
    ...

@app.cls(image=image, gpu="A10G", volumes={WEIGHTS_DIR: weights},
         timeout=1800, scaledown_window=300)
class GasV2Pipeline:
    @modal.enter()
    def load(self):
        # load all four models, keep them in self for subsequent calls
        ...

    @modal.method()
    def run(self, video_bytes: bytes, text_prompt: str) -> dict:
        # the actual pipeline
        ...

That's the shape. The scaledown_window=300 means the container stays alive for 5 minutes after the last call — iterating with the same warm weights costs essentially nothing. Modal caches image layers by content hash, so re-runs with unchanged layers are instant.

Note one thing we got right on the first attempt: we used nvidia/cuda:...-devel because Grounding DINO's official install compiles a CUDA op at pip install time and needs nvcc. That's the textbook answer. It's also the thing that bit us.

Chapter 5

Run 1 · failDocker Hub's IPv6 tantrum

First modal run. Image build starts. Modal's build worker shells out to skopeo to pull nvidia/cuda:12.4.1-devel-ubuntu22.04 from Docker Hub, and:

=> Step 0: FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 Failed, retrying in 1s ... (1/1). Error: initializing source docker://nvidia/cuda:12.4.1-devel-ubuntu22.04: pinging container registry registry-1.docker.io: Get "https://registry-1.docker.io/v2/": dial tcp [2600:1f18:2148:bc01:ba5a:41ea:5a66:2c07]:443: connect: network is unreachable Runner failed with exit code: -1

Two retries, same error. That IPv6 address belongs to AWS's Docker Hub registry replica. Either Modal's workers have an IPv6 routing issue, or Docker Hub is having one — either way, the blast radius is "you can't use any base image hosted on docker.io."

The fix is not to fight the network. It's to move to a base that isn't hosted on docker.io. Modal has a built-in debian_slim base that it serves from its own infrastructure, always reachable from its own build workers. Switching to it drops the Docker Hub dependency entirely.

- modal.Image.from_registry("nvidia/cuda:12.4.1-devel-ubuntu22.04", add_python="3.11") + modal.Image.debian_slim(python_version="3.11")

But that breaks something else: debian_slim has no CUDA toolkit. No nvcc, no CUDA headers. Grounding DINO's official repo won't install. What now?

💡 Supply chains are failure domains

Any external registry in your build is a correlated failure domain — your deploy is alive only as long as everything it transitively depends on is alive. The standard practical response is: either use your provider's own-hosted images for base layers, or mirror the ones you need into your own registry. We took the first option because it was one line of diff.

Chapter 6

Run 2 · successWhy we never touched nvcc

The reason we thought we needed nvcc is Grounding DINO's custom CUDA op. The MultiScaleDeformableAttention kernel is written in C++/CUDA and compiled at install time. It makes inference somewhat faster and more memory-efficient than a pure-PyTorch implementation of deformable attention.

The important word is "somewhat." The HuggingFace transformers library has a port of Grounding DINO that reimplements MSDA in pure PyTorch. It's not as fast as the CUDA kernel. It needs no compile step. For our use case (8 detection calls per video, ~200ms each), the difference is imperceptible. For the purposes of shipping an integration, it's gold — we can drop the dev CUDA toolchain dependency entirely.

- .run_commands("pip install git+https://github.com/IDEA-Research/GroundingDINO.git") + # use transformers' port of Grounding DINO — pure PyTorch, no nvcc needed + # no extra install; already covered by transformers>=4.44

And in the model loader:

python
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

self.gdino_proc = AutoProcessor.from_pretrained(f"{WEIGHTS_DIR}/gdino")
self.gdino = AutoModelForZeroShotObjectDetection.from_pretrained(
    f"{WEIGHTS_DIR}/gdino"
).to(self.device).eval()

PyTorch's cu124 wheels bundle the CUDA runtime libraries — libcudart, libcublas, the whole set. You only need the CUDA toolkit (nvcc, headers) if you compile CUDA code at install time. For inference-only stacks where nothing compiles, runtime wheels are sufficient. That's our situation now.

Build succeeds. Weights download (VGGT 5 GB, SAM 2 2 GB, GDINO 1 GB, DINOv2 4 GB — we keep DINOv2 around for future CLIP-style feature merging). Container starts.

💡 The library version vs. the paper version

When you're doing inference integration (not research), prefer the library port over the paper repo. The paper repo optimizes for reproducing the paper. The library port optimizes for playing nicely with other libraries, installing without drama, and surviving your dependency graph. Both produce the same answers. One is two commands to get working; the other is a CUDA toolchain saga.

Chapter 7

Run 3 · failThe CUDA 13 ambush

Container starts. Models load. VGGT forward pass runs. Grounding DINO runs. SAM 2 gets seeded. SAM 2's mask decoder fires and:

RuntimeError: nvrtc: error: failed to open libnvrtc-builtins.so.13.0.

The libnvrtc-builtins shared object is part of NVIDIA's runtime compilation library — torch.compile, fused kernels, some autograd paths use it. The error is "we looked for the 13.0 version and couldn't find it."

We pinned torch==2.4.0 with cu124. Torch 2.4 + cu124 ships with NVIDIA's CUDA 12.4 runtime. Why is something looking for CUDA 13?

Answer: SAM 2's pyproject.toml has a dependency on torch>=2.5. When pip ran the SAM 2 install in a later image layer, it saw our pinned torch 2.4, decided that was too old, and silently upgraded torch to the latest (2.8+) from pip's default PyPI index. That wheel bundles the CUDA 13 runtime libraries, not the CUDA 12.4 ones our initial install pulled. Runtime is now a mix of CUDA 12.4 and CUDA 13 files, and libnvrtc-builtins.so.13.0 (from the upgrade) expects companions that never got installed.

This is a pin-resolution collision. We pinned torch 2.4 in our first pip_install layer. The SAM 2 install in a later layer saw that pin as "a starting point I'm allowed to upgrade to satisfy my own constraints." pip is not, by default, a strict pinning tool; it's a best-effort resolver that prefers to satisfy all constraints over respecting your earlier pins.

The fix is surgical. Add one more image layer after SAM 2 and VGGT install, force-reinstalling a pinned torch from pytorch.org's CUDA 12.4 index. "The last pip install wins":

python
# after all other installs: pin torch consistently for runtime
.run_commands(
    "pip install --upgrade --force-reinstall "
    "torch==2.5.1 torchvision==0.20.1 "
    "--index-url https://download.pytorch.org/whl/cu124"
)

torch==2.5.1 satisfies SAM 2's >=2.5 constraint (SAM 2 gets to use its APIs), and the cu124 index ensures the CUDA 12.4 runtime libraries are the ones actually installed. --force-reinstall kicks out whatever SAM 2 pulled.

ℹ How to see this coming next time

The signal that pip silently upgraded torch is in the build log — you'll see lines like Downloading .../nvidia_cublas-13.1.0.3-py3-none-manylinux...whl in a layer where you only expected SAM 2's deps. NVIDIA's CUDA 13 packages have -cu13 in the filename; CUDA 12's have -cu12. If you see cu13 anywhere in your build log and you pinned cu124, something upgraded torch under you.

💡 Pin the last install, not the first

In a multi-layer image, pip's resolver runs independently in each layer. Earlier pins can be overridden by later installs. The reliable pattern is to pin the thing you care about in the last layer that touches it, with --force-reinstall, so nothing gets to override it afterwards.

Chapter 8

Runs 4&5 · failHow much video fits in 24 GB?

Run 4 kicks off. Models load. Video uploads. Frames decode:

decoded 128 frames at 1280x720 AssertionError: Input image height 720 is not a multiple of patch height 14

VGGT's ViT uses patch size 14. It can only tokenize images whose height and width are divisible by 14. 720 / 14 = 51.4. Assertion.

Easy fix: resize frames to the nearest smaller multiple of 14 at decode time. 720 → 714, 1280 → 1274. Run 4 tries again and gets to VGGT's forward pass — where it immediately hits:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.54 GiB. GPU 0 has a total capacity of 22.06 GiB of which 3.43 GiB is free.

This deserves to be derived, not just fixed. Why does VGGT want 4.5 GiB for one tensor?

The activation memory math

VGGT processes all N frames as a single sequence. After patchification, each frame becomes P = ⌊H/14⌋ × ⌊W/14⌋ tokens. The full sequence is T = N × P tokens. The self-attention layer computes a full T × T attention matrix.

Attention memory

For a ViT processing N frames at resolution H × W with patch P=14 and attention dtype bf16 (2 bytes):

tokens_per_frame = ⌊H/14⌋ · ⌊W/14⌋
total_tokens     = N · tokens_per_frame
attn_bytes       = total_tokens² · 2          (per layer, per head, without flash-attn)

Plug in our Run-4 numbers: N=128, H=714, W=1274 → tokens/frame = 51 · 91 = 4641. Total tokens = 128 · 4641 = 594K. Attention matrix in bf16: 594K² · 2 bytes = 705 GB. That's obviously impossible; VGGT uses flash attention internally, which trades this O(T²) term for O(T·d) memory. But flash attention still has O(T·d) activations and other intermediates that scale with T. On the order of a few GB per block, stacked across layers, with a ViT-G depth.

The practical takeaway: activation memory scales with frames and resolution jointly. You have two knobs, and one of them helps a lot more than the other.

Visualizing how total_tokens = N · patches_per_frame, and how that translates to VRAM across three resolutions. Green = fits 24 GB, red = OOM. Our Runs 4 and 5 landed in red; Run 9 in green.

Fix step 1: cap the longest side. We set max_side=504 — that's 36 patches wide, well inside the memory budget. For our 720×1280 portrait video, the aspect-preserving resize is 280×504. That's small visually but still fine for detection (humans can identify a chair in a 280-pixel crop just fine; Grounding DINO does better than humans on this).

Fix step 2: cap max_frames to 64 as a safety margin. Then stride-sample across the whole clip, so we still cover all 55 seconds — we just sample every 25th frame (stride = 1644 / 64 ≈ 25) instead of reading the first 64 sequentially. At ~1.2 fps effective coverage, that's still dense enough for SAM 2's memory bank to track objects.

Fix step 3: turn on the expandable-segments allocator. PyTorch's default CUDA allocator fragments memory across repeated allocations; expandable_segments is a newer allocator that coalesces fragments. It's a free improvement under memory pressure.

python
.env({"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"})

With all three changes: 64 frames at 280×504, expandable-segments on. Total tokens now 64 · (20 · 36) = 46K. Peak memory during VGGT: ~18 GB. We fit. Onward.

💡 Know your VRAM before you write the loop

For any ViT-based video model, the cheap mental model is total_tokens = frames × patches_per_frame, and attention memory grows as a power of total tokens (quadratic without flash-attn, linear-ish with it, but still with large constants). Halving the resolution quarters the patches, halving the frames halves the tokens. If something doesn't fit, reach for resolution first (quadratic savings) before frames.

Chapter 9

Run 6 · failVGGT's real API

Memory fixed. VGGT forward pass runs. Next line:

KeyError: 'extrinsic'

My draft code assumed VGGT's output dict has an extrinsic key. It doesn't. What it actually returns:

pred = self.vggt(imgs) # dict with keys: pose_enc # [B, S, 9] 9-dim pose encoding, not a matrix depth # [B, S, H, W, 1] depth_conf # [B, S, H, W] world_points # [B, S, H, W, 3] per-pixel 3D points, already in world frame world_points_conf # [B, S, H, W]

Two things to note. First, extrinsic and intrinsic matrices aren't direct outputs; you decode them via a helper:

python
from vggt.utils.pose_enc import pose_encoding_to_extri_intri

ext34, intr = pose_encoding_to_extri_intri(pred["pose_enc"], imgs.shape[-2:])
# ext34: [1, S, 3, 4] — world-to-camera
# intr:  [1, S, 3, 3]

Second, and more importantly: VGGT already gives you per-pixel world points. We were about to write a pinhole back-projection:

# The manual way (wrong for VGGT — and unnecessary): - x_cam = (xs - K[0,2]) * z / K[0,0] - y_cam = (ys - K[1,2]) * z / K[1,1] - pts_cam = np.stack([x_cam, y_cam, z, np.ones_like(z)], axis=-1) - pts_world = pts_cam @ E.T

But VGGT's world_points[f, y, x] is already that value. We just index it:

+ world_points = pred["world_points"][0].float().cpu().numpy() # [S, H, W, 3] + # for each mask: + pts_world = world_points[frame_idx][mask] # [N, 3]

The other subtle detail: VGGT's extrinsic is world-to-camera (the standard computer vision convention). Rerun's Transform3D, by contrast, describes a parent-to-child transform — when you log it under world/cam, Rerun reads it as "how to go from world to cam." That sounds the same, but it's the inverse: what Rerun wants is the camera's pose in world coordinates (world_from_cam), not a world-to-camera transform. You invert:

python
E_wc = np.eye(4); E_wc[:3] = extrinsic_wc[i]   # VGGT: world-to-cam
E_cw = np.linalg.inv(E_wc)                      # Rerun wants world-from-cam
rr.log(cam_path, rr.Transform3D(
    translation=E_cw[:3, 3].tolist(),
    mat3x3=E_cw[:3, :3].tolist(),
))
💡 Read the source, not the README

Model README files describe what the model does. They often omit the exact keys of the output dict, the dtype expectations, the pose convention. For foundation models that ship every three months, the fastest path is: clone the repo, grep for the forward method, print the output dict once, keep going. Three minutes of reading source usually saves an hour of runtime error roulette.

Chapter 10

Run 7 · failA renamed argument

VGGT now works. Grounding DINO's inference runs. Post-processing line:

TypeError: GroundingDinoProcessor.post_process_grounded_object_detection() got an unexpected keyword argument 'box_threshold'

Somewhere between transformers 4.44 and 4.48, the argument got renamed:

- box_threshold=0.25, + threshold=0.25,

One-line fix. But the lesson is bigger than the fix: we pinned transformers>=4.44, not transformers==4.44. Between project setup and run time, the image build picked up 4.48.x, which has this rename. The official docs at the time of writing still show box_threshold.

💡 >= is not a pin

Minimum-version constraints are useful for library authors ("I need at least API X"). For reproducible application builds, they're a landmine — your next deploy might land on any later version, which might or might not be source-compatible. Build-time equality pins (== or exact-version lockfiles) are what you want for applications.

Chapter 11

Run 8 · failSAM 2's autocast requirement

Grounding DINO fixed. It seeds 40 object boxes across 8 keyframes. SAM 2's propagate_in_video starts, crunches for a second, and:

RuntimeError: mat1 and mat2 must have the same dtype, but got BFloat16 and Float

SAM 2 is an architecture with mixed precision. The image encoder (Hiera) is trained and stored in bf16 for efficiency. The mask decoder stores its weights in fp32. During inference, without an autocast context, image features arrive at the mask decoder as bf16 tensors; the decoder's fp32 weights can't matmul with them.

The standard fix is to wrap all SAM 2 calls in an autocast block, which promotes the inference path to a consistent dtype:

python
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    state = self.sam2.init_state(video_path=str(frames_dir))
    for k, boxes in detections.items():
        for label, box in boxes:
            self.sam2.add_new_points_or_box(state, frame_idx=k, obj_id=next_id, box=box)
            next_id += 1
    for f_idx, obj_ids, mask_logits in self.sam2.propagate_in_video(state):
        masks_per_frame[f_idx] = {
            oid: (mask_logits[i, 0] > 0).cpu().numpy()
            for i, oid in enumerate(obj_ids)
        }

This is the kind of model-specific calling convention that lives in example scripts, not in the module docstring. The fix is obvious once you know it; discovering it involves reading the repo's notebooks/video_predictor_example.ipynb.

ℹ Why not just .to(torch.bfloat16)?

You can manually cast SAM 2's weights to bf16 at load time. It works, but you lose precision on the fp32 parts (the mask decoder's learned prompt embeddings) that the authors intentionally kept in full precision. Autocast is surgical: it promotes dtype only for ops that benefit from bf16 and downcasts back for ones that don't. It's the right tool.

Chapter 12

Run 9 · successFirst working end-to-end

All fixes compound. Run 9 runs through every stage:

models ready sampled 64/1644 frames, resized 720x1280 → 280x504 (stride=25) VGGT: wp range -2.26..1.07, conf p50=1.01 keyframe 0: 2 detections keyframe 8: 10 detections keyframe 16: 5 detections ... 8 keyframes total: 39 detections seeded 39 object tracks rerun .rrd: 13.9 MB "n_objects": 39, wrote out/gas2.rrd

Pipeline works. Celebrate for thirty seconds, then open the output.

It is bad.

Thirty-nine objects, but six of them labeled "chair" and eight labeled "door" — clearly duplicates of the same physical furniture, seeded fresh at different keyframes. Rerun has sixty-four floating camera frustums but no obvious way to "play" the walkthrough and watch the POV move. The floorplan we generated (projecting X-Z of world points) is a mess of overlapping rectangles, each one spanning three meters in every direction, with labels stacked on top of each other.

The scene graph is correct but unusable. The rest of the build log is about making it usable.

A pipeline that finishes is half the work. The other half is the output pipeline — the visualization, the dedup, the spatial summarization. That's where the actual product lives.
Chapter 13

Making the outputs actually useful

Five distinct problems with Run 9's output, addressed in order:

  1. Rerun has no POV view — each camera is its own static entity.
  2. The 2D floorplan is in the wrong coordinate frame — VGGT's world is camera-centric, not gravity-aligned.
  3. There's no 3D mesh artifact for downstream use.
  4. The POV video isn't a video — it's a scrubbable Rerun recording.
  5. The scene graph has duplicates and bloated bounding boxes.

13.1 — Rerun POV: one entity that moves

Run 9 logged each frame's camera as a separate static entity:

world/cam_0000 (Transform3D) world/cam_0000/image (Pinhole) world/cam_0000/image/rgb (Image) world/cam_0001 (Transform3D) ... 64 of these ...

Result: 64 camera frustums floating in 3D space. To see frame 32's POV, you click world/cam_0032 in the entity tree, navigate through, and the 2D view updates. It's not a video experience; it's a static 3D scene with lots of cameras.

The Rerun-idiomatic way is a single dynamic camera entity that moves through time. You log world/cam once per frame on a time timeline; Rerun interpolates/holds between time steps, and the 2D "POV" view automatically shows the current-time image:

python
import rerun as rr
import rerun.blueprint as rrb

for i in range(len(frames)):
    rr.set_time_sequence("frame", i)
    rr.log("world/cam",       rr.Transform3D(translation=..., mat3x3=...))
    rr.log("world/cam/image", rr.Pinhole(focal_length=..., principal_point=...,
                                         width=W, height=H))
    rr.log("world/cam/image/rgb",   rr.Image(frames[i]))
    rr.log("world/cam/image/depth", rr.DepthImage(depth[i], meter=1.0))
    rr.log("world/cam/image/masks", rr.SegmentationImage(seg[i]))

# Also pin a camera-trajectory polyline so the walk is visible in 3D
rr.log("world/trajectory",
       rr.LineStrips3D([traj_pts], colors=[[200, 200, 200]]),
       static=True)

# And a blueprint: 3D scene left, POV right
rr.send_blueprint(rrb.Blueprint(
    rrb.Horizontal(
        rrb.Spatial3DView(name="Scene", origin="/world"),
        rrb.Spatial2DView(name="POV",   origin="/world/cam/image"),
        column_shares=[2, 1],
    )
))

Now the time scrubber advances the camera. The left panel shows the scene, the camera frustum sliding along the trajectory. The right panel shows the RGB from the current frame, with SAM 2 masks overlaid (from the SegmentationImage), plus 2D detection boxes from Grounding DINO on keyframes. It's the SLAM-visualization pattern Rerun was built for.

13.2 — 2D floorplan via PCA

The paper shows a satisfying 2D floorplan. We had a first attempt at replicating that: project VGGT's world points to the X-Z plane (drop Y), plot object bounding boxes. The result was unreadable — huge overlapping rectangles, labels piled on each other.

The issue: VGGT's world frame is not gravity-aligned. It's camera-centric — the first camera's optical axis defines +Z, its up vector defines -Y. If you held the phone upright perfectly, -Y is world-up and X-Z is the floor. If you didn't, it's tilted by whatever your phone was tilted by, and "X-Z" is an arbitrary diagonal slice through the room.

We need a gravity-aligned 2D plane. You could ask IMU data, but phone videos don't ship one by default, and VGGT ignores it anyway. You could ask the depth gradient's mode (floors are locally planar), but that's surgery.

Here's the trick: when you walk through a room carrying a camera, your camera positions live in an approximately 2D plane — the plane of the floor, one and a half meters up. You walk along it, you don't levitate. The first two principal components of the camera trajectory are the floor plane.

PCA alignment — derivation

Collect camera positions ti ∈ ℝ3, i = 1..N. Center and SVD:

μ           = (1/N) ∑ ti
T              = stack(ti - μ)                — N×3 matrix
U Σ VT        = SVD(T)
floor_basis    = V[:, :2]                    — 3×2, top two principal axes
ui             = (ti - μ) · floor_basis        — 2D projection

Apply the same (x - μ) · floor_basis to all 3D points (object centroids, AABB corners) to put them in floorplan coordinates. The third principal component (smallest singular value) points roughly along gravity — we drop it.

In code, ten lines:

python
traj = np.array(trajectory)
center = traj.mean(axis=0)
T = traj - center
U, S, Vt = np.linalg.svd(T, full_matrices=False)
basis = Vt[:2]  # 2x3, rows = floor-plane axes

def project(p):
    return (np.asarray(p) - center) @ basis.T

traj_2d = np.array([project(p) for p in traj])
for node in graph:
    bmin_2d = project(node["bbox_min"])
    bmax_2d = project(node["bbox_max"])
    # draw rectangle from bmin_2d to bmax_2d
PCA finds the floor plane Diagram
raw camera positions (tilted) SVD PCA-aligned floor plane e₁ e₂

Left: raw camera positions form a flat cloud oriented at a random angle because VGGT's world isn't gravity-aligned. Right: SVD recovers the two dominant axes (orange = floor plane); projecting onto them gives a clean top-down view.

Add to that: a gradient-colored trajectory line (teal start, amber end), transparent AABB outlines instead of filled rectangles (so overlaps are legible), labels above the rectangles not inside them, and the floorplan becomes readable.

PCA-aligned floorplan with 26 labeled objects and the walk trajectory
Final floorplan from Run 10. Trajectory is the yellow-green polyline in the middle (starts teal, ends amber). Rectangle outlines are per-object AABBs, projected onto the PCA-recovered floor plane. Labels sit above each rectangle. The axes are in VGGT's arbitrary units (roughly meters).

13.3 — Point cloud + Poisson mesh

The scene graph has per-object point clouds. A common next artifact is a unified mesh. Open3D does this in three steps:

  1. Concatenate all object points into one cloud, coloring each by its object ID.
  2. Voxel-downsample to ~1 cm resolution — removes duplicate points from overlapping views.
  3. Estimate normals (Poisson needs them), then run Poisson surface reconstruction.
  4. Trim low-density vertices — Poisson hallucinates smooth surfaces into unobserved regions; we clip the bottom 10% of vertex density to kill the worst of it.
python
import open3d as o3d

pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(all_points)
pcd.colors = o3d.utility.Vector3dVector(all_colors)
pcd = pcd.voxel_down_sample(voxel_size=0.01)
pcd.estimate_normals(o3d.geometry.KDTreeSearchParamHybrid(radius=0.05, max_nn=30))

o3d.io.write_point_cloud("scene.ply", pcd)

mesh, densities = o3d.geometry.TriangleMesh.create_from_point_cloud_poisson(pcd, depth=8)
d = np.asarray(densities)
mesh.remove_vertices_by_mask(d <= np.quantile(d, 0.1))
mesh.compute_vertex_normals()
o3d.io.write_triangle_mesh("scene.obj", mesh)

Output: a 2.4 MB PLY (47K points) and a 7.9 MB OBJ (96K triangles). Good enough to open in MeshLab or drag into a 3D viewer for orientation. Not production-grade; for that you'd use TSDF fusion or a neural SDF. But as a "here's what the pipeline built" artifact, it's a real file someone can interact with.

13.4 — POV MP4 with overlays

Rerun is great for exploratory viewing but assumes the user installs rerun-sdk and runs the viewer. For sharing — a Slack thread, a blog post (hi) — you want a plain MP4. We render one server-side: RGB with mask alpha-blend and detection boxes on top, colorized depth below, stacked vertically for portrait playback.

POV from Run 10. Left: RGB with SAM 2 masks (alpha-blended per object) and Grounding DINO detection boxes on keyframes. Right: VGGT depth in the Turbo colormap. Frame rate 6 fps, so the walkthrough compresses to ~10 seconds from 55 seconds of source.

The rendering loop is unremarkable — OpenCV VideoWriter with mp4v fourcc, one frame per sample, mask overlay computed as cv2.addWeighted(rgb, 0.55, mask_colored, 0.45, 0). Depth is cv2.applyColorMap on percentile-normalized values. The only thing to be careful about is that SAM 2's old track IDs need to be remapped through the dedup merge (see 13.5 below) before they're rendered, so the video's label legend matches the scene graph's label legend.

13.5 — The duplicate-track problem

Run 9 had 39 objects, including six "chair" tracks and eight "door" tracks. The cause is simple: SAM 2's add_new_points_or_box at frame 8 with a fresh obj_id doesn't know that track #1 (seeded at frame 0) is probably the same physical chair. Every keyframe's detections seed new IDs, even when they point at the same object.

The proper fix, and what the paper does, is ConceptGraphs-style CLIP-feature merging: for each track, compute a CLIP embedding of its masked region. Merge tracks whose embeddings are cosine-similar and whose centroids are spatially close. That handles "same object seen from different angles produces different masks but the same semantic content."

We took a cheaper first pass: merge tracks that share a label and are within a small spatial distance. The assumption is stronger (Grounding DINO has to give the same label each time; we rely on that), but the code is twenty lines and fits the data we have.

python
# For each track (old_id), try to find an existing merged group with the
# same label whose aggregated centroid is close enough.
merge_dist = 0.15  # VGGT world units, roughly 15 cm
merged = {}        # new_id -> {"label", "pts" (list), "ids"}
id_map = {}        # old_id -> new_id

for old_id, rec in raw.items():
    assigned = None
    for new_id, m in merged.items():
        if m["label"] != rec["label"]:
            continue
        m_centroid = np.concatenate(m["pts"], axis=0).mean(axis=0)
        if np.linalg.norm(m_centroid - rec["centroid"]) < merge_dist:
            assigned = new_id; break
    if assigned is None:
        assigned = len(merged) + 1
        merged[assigned] = {"label": rec["label"], "pts": [], "ids": []}
    merged[assigned]["pts"].append(rec["pts"])
    merged[assigned]["ids"].append(old_id)
    id_map[old_id] = assigned

Applied to Run 10, this collapses 35 raw tracks → 26 unique objects. Not perfect (the label-sharing assumption fails when Grounding DINO flip-flops between "chair" and "sofa chair" for the same object), but a massive qualitative improvement.

13.6 — The fat-AABB problem

In Run 9, object bounding boxes spanned the whole room. That's not a bug in the AABB computation; it's correct. VGGT's per-pixel world points have a heavy tail of low-confidence points at mask boundaries, in reflections, at occlusion edges. If you take min/max over all points in a mask, the min/max are dominated by these outliers. Each AABB becomes the convex hull of the object plus its weirdest few pixels.

The fix is to use percentile-based bounds: 10th percentile for the min, 90th percentile for the max. The bulk of the cloud is inside that box; the tails are clipped.

- "bbox_min": pts.min(axis=0).tolist(), - "bbox_max": pts.max(axis=0).tolist(), + "bbox_min": np.percentile(pts, 10, axis=0).tolist(), + "bbox_max": np.percentile(pts, 90, axis=0).tolist(),
Min/max vs. percentile bounds Diagram
min / max bounds 10 / 90 percentile bounds tails drag the box out tight around the bulk

Left: min/max takes the extent of every pixel in the mask, including low-confidence tails that reach far into neighboring surfaces. Right: 10th/90th percentile keeps the dense center.

For good measure we also clip the centroid to the percentile-bound inliers. The centroid of all points (including tails) is pulled toward whichever tail is longest; the centroid of just the percentile inliers sits where the object's visual mass is.

💡 When the data has tails, use percentiles

Nearly every 3D-from-image pipeline produces heavy-tailed point distributions, because the "hard" pixels of depth estimation (mask edges, specular reflections, thin structures, far surfaces) are the ones with the worst depth. Min/max statistics are the wrong summary for heavy-tailed data. 10/90 percentiles are usually the first thing to reach for.

Chapter 14

Run 10 · successThe final pipeline

All improvements stacked. One modal run, one command, six artifacts:

bash
$ modal run app.py::analyze
sending 69.7 MB to Modal; prompt='sofa. couch. chair. table. ...'
  max_frames=64 stride=8 box=0.35 text=0.3 merge=0.15
models ready
sampled 64/1644 frames, resized 720x1280 → 280x504 (stride=25)
VGGT: wp range -2.26..1.07, conf p50=1.01
keyframe 0: 4 detections
keyframe 8: 3 detections
keyframe 16: 3 detections
keyframe 24: 3 detections
keyframe 32: 2 detections
keyframe 40: 4 detections
keyframe 48: 5 detections
keyframe 56: 9 detections
seeded 33 object tracks
merged 33 tracks → 26 unique objects
rerun .rrd: 62.0 MB
scene.ply: 47375 pts, 2.4 MB
scene.obj: 96418 tris, 7.9 MB
pov.mp4: 1.2 MB

wrote out/gas2.json
wrote out/floorplan.png
wrote out/scene.ply (2.4 MB)
wrote out/scene.obj (7.9 MB)
wrote out/pov.mp4 (1.2 MB)
wrote out/gas2.rrd (62.0 MB)

Total runtime, warm container: ~90 seconds. Cost on A10G: about $0.04. If the container was cold (image build + model load), add ~2 minutes on top.

POV
1.2 MB
pov.mp4, RGB + masks + depth
Floorplan
0.1 MB
floorplan.png, PCA-aligned
Point cloud
2.4 MB
scene.ply, 47K points
Mesh
7.9 MB
scene.obj, 96K triangles
Scene graph
0.1 MB
gas2.json, 26 objects
Rerun recording
62 MB
gas2.rrd, scrubbable
Update · April 2026

Run 11 · finishFrom pile of points to dollhouse

Run 10 worked. It crashed six times on the way there, but it worked. Twenty-six labeled objects, six artifacts, about a nickel per clip. So why am I writing this? Because "works" is not the same as "finished." If you put Run 10's Rerun recording next to Apple's RoomPlan dollhouse, you can see the gap: RoomPlan looks like a product and Run 10 looks like an experiment. Axis-aligned boxes around rotated sofas, a floorplan that's PCA scatter instead of an indoor layout, a palette where colors mean object IDs rather than categories, and the whole thing driven by a CLI. This update closes that gap.

ℹ What changed between Run 10 and Run 11

Same VGGT → Grounding DINO → SAM 2 backbone. We added Apple's Cubify Transformer (CuTR) per keyframe, fused oriented boxes across frames with a greedy 3D-IoU scheme, kept a PCA-OBB fallback running in parallel so nothing gets lost when CuTR misses, rendered a stylized dollhouse MP4 and a category-keyed floorplan, and put the whole pipeline behind a local FastAPI dashboard. No model was fine-tuned. No dataset was collected. One new checkpoint was downloaded. The rest is plumbing and taste.

The bar we hadn't cleared

Four specific failures of v1:

  • Rotated furniture, axis-aligned boxes. A sofa at 30° fills its AABB with two corners of air. The AABB is the 2×-too-big "shadow" of the object on the world axes.
  • “Heavy tail” geometry on thin things. VGGT's per-pixel world points have a long tail of low-confidence outliers. p10/p90 clipping helps, but it still inflates a poster into a thin slab.
  • Palette = object ID. obj_color(oid) = HSV(oid * 0.37) means the same physical chair is a different color on every rerun. The eye can't lean on “chair is apricot.”
  • CLI as UI. Running the pipeline requires a terminal, a Modal profile, and a file path. The output is a bag of files in out/. Nothing about that feels like a product.

A 90-second research tree

The plan's operating principle — stolen from The AI Scientist-v2 — is tree-search over hypotheses, not linear pursuit. Four branches were on the table. One got chosen. One got kept as an always-on safety net. Two got politely declined.

BranchWhat it isDecision
CuTR (Apple) Single-image transformer for class-agnostic 3D OBBs. Trained on CA-1M (1K laser-scanned rooms, 400K objects). 2412.04458 Chosen as the per-keyframe detector. Clean OBBs, open source, pairs cleanly with our existing labels from Grounding DINO.
PCA-OBB Covariance-SVD on each SAM 2 track's points, p10/p90 extents. ~20 lines of numpy. Kept always on. Free, cheap, never fails. Becomes the fallback when CuTR misses a track or kills itself on a bad scene.
Boxer (Meta) OWLv2 + DINOv3 + BoxerNet + Hungarian OBB fusion. 2604.05212 Declined. Two weeks of re-plumbing to swap our entire front end, CC-BY-NC weights, and gives up the SAM 2 track identity we already have.
Rooms from Motion Un-posed images → poses + OBBs jointly. Objects-as-features SfM. 2505.23756 Declined. No public implementation; re-implementing is a season of work.

The architecture

The trick is that we don't swap any of the v1 components. VGGT still owns poses and depth. Grounding DINO still owns labels. SAM 2 still owns track identity. CuTR is bolted on as a sibling detector that produces geometry only — oriented boxes, no class — and a fusion stage at the end stitches CuTR geometry to SAM 2 labels.

RGB frames
VGGT
poses, depth, world points
Grounding DINO
2D boxes + labels (keyframes)
SAM 2
mask tracks
CuTR RGB-only
per-keyframe 3D OBB (class-agnostic)
PCA-OBB
per track, always on
Depth sanity gate
reject if z disagrees > 30%
3D-IoU fusion + SAM 2 track match
CuTR OBB if matched, PCA-OBB otherwise, killswitch < 30 %
gas2.rrd
pov.mp4
oriented wireframes
dollhouse.mp4
pastel orbit
floorplan_v2.png
OBB footprints

The fusion rule is boring, which is the point. Two OBBs a and b merge when 3D-IoU exceeds a threshold and they agree on label, with a centroid-distance fallback for thin objects where small rotation errors collapse IoU to zero.

∑ Fusion criterion

IoU3D(a, b) > 0.3 ∧ label(a) ∈ labels(b)    or    IoU3D(a, b) = 0 ∧ ∥ca − cb∥< 0.2 m ∧ label(a) ∈ labels(b)

3D-IoU is computed by rasterizing both OBBs into a shared 24×24×24 voxel grid over their union AABB. Not exact — but within 2 % for furniture-scale boxes (0.3–3 m) and takes under 5 ms per comparison. Good enough, and debuggable.

And there's a killswitch. If fewer than 30 % of merged SAM 2 tracks match a CuTR cluster, the whole scene falls back to PCA-OBB for every object. Because CuTR was trained on iPad LiDAR walkthroughs, the RGB-only variant can be brittle on dim or motion-blurred phone footage. The killswitch turns “CuTR had a bad day on this clip” into “still a clean scene, just without CuTR's orientation gain.”

What you get now

Every run lives under runs/<run_id>/ where <run_id> is YYYYMMDD_HHMMSS_<6char>. No more overwriting, no more “which out/ is which?” Six new or upgraded artifacts:

ArtifactWhat it is
gas2.rrd Same Rerun recording as v1, but the Boxes3D entries now carry quaternions — boxes render oriented. Labels include the source: [cutr] vs [pca].
pov.mp4 Every frame gets oriented wireframes drawn by projecting each fused OBB's 8 corners through that frame's intrinsic + extrinsic. The raw Grounding DINO 2D boxes stay as thin grey rectangles for debugging.
dollhouse.mp4 A 4 second matplotlib 3D orbit over pastel OBB solids sitting on a PCA-fitted floor plane. Rendered locally on the Mac, not on Modal — no EGL dance.
floorplan_v2.png Pastel OBB footprints (the rotated rectangles, not axis-aligned bounding rectangles). The v1 floorplan.png is kept alongside for a before/after comparison.
scene.ply / scene.obj Unchanged.
run_manifest.json A single typed contract: run_id, pipeline_version, every artifact filename, the scene graph with OBB fields, the camera trajectory. The dashboard and the scorecard both read from this.

Scene-graph records gained five fields. The v1 bbox_min/bbox_max/centroid/n_points fields stay for backward compatibility.

json
{
  "id": 7,
  "label": "sofa",
  "obb_center":         [-1.42,  0.05,  1.88],
  "obb_extent":         [ 1.84,  0.72,  0.82],
  "obb_quaternion_xyzw":[ 0.00,  0.38,  0.00,  0.93],
  "obb_source":         "cutr",
  "obb_confidence":      0.71,
  "bbox_min": [...], "bbox_max": [...], "centroid": [...], "n_points": 4721
}

A dashboard, because “finished” means not-in-a-terminal

Local FastAPI on the Mac, not Modal ASGI. Artifacts land on local disk by design — the user asked for “find them in some local folder,” and that's easier when there's no authed service boundary. The dashboard is a thin wrapper that uploads the video, kicks off the Modal pipeline as a subprocess, tails stdout, and pattern-matches log lines to pipeline stages for a live SVG flow view.

bash
pip install -e '.[dashboard]'
cd gas2/dashboard
uvicorn main:app --reload      # → http://localhost:8000

Three tabs:

  • Upload & Results. Drag-drop video, prompt field, past runs in the sidebar. When a run completes the dollhouse.mp4 plays as a hero tile and every artifact grid-tiles below with a direct download link.
  • Pipeline Flow. 11-node SVG DAG, each node lit up as stdout announces “VGGT loaded,” “keyframe 0: 4 detections,” “fusion: 6 CuTR clusters; 22/26 tracks matched” and so on. The stdout tail streams into a log pane below the DAG.
  • Rerun (technical). The dashboard spawns rerun serve --web-viewer runs/<id>/gas2.rrd on a fresh port and iframes it. A “launch desktop” button is the escape hatch if the web viewer is behind a firewall.

A scorecard you can hold up to numbers

“Feels better” is not evidence. Five metrics are tracked per scene, written into scorecard.json per run and rolled up to docs/scorecard.csv by the aggregate_scorecard.py script.

MetricHow computedv2 target
Stability CVstd/mean of |objects| across --keyframe-stride ∈ {6, 8, 12}< 0.15
Label entropydistinct labels / total objects0.5–0.8
OBB tightnessmedian (OBB volume) / (convex-hull volume of source points)1.1–1.5
Floor plane residualRANSAC plane inlier ratio + RMS> 0.7 / < 5 cm
Manual rubric1–5 on duplicates, OBB orientation, label plausibility, dollhouse aesthetic≥ 4 on all four
💡 What the killswitch bought us

Without it, the pipeline used to have two failure modes: “works and looks great” or “crashes.” Now there's a third: “works with reduced orientation fidelity on this scene.” That third mode is the one that actually lets you run v2 on arbitrary phone footage without hand-tuning — especially in dim rooms where VGGT depth gets noisy and CuTR's training distribution (bright iPad LiDAR captures) stops applying. The scorecard's obb_sources.cutr / .pca counts are the signal for when you've left CuTR's comfort zone.

Update · April 2026 (day two)

Run 12 · course-correctThe Boxer turnaround & the Modal internals

Run 11 shipped. The dollhouse looked stylized, the floorplan fit a rectangle, gravity pointed down. And then I watched the outputs on a second scene and a third. The CuTR side of the pipeline was working on our one calibration clip and getting progressively worse as the room changed. Furniture drifted through walls, volumes inflated, the depth-sanity gate killed 890/1353 candidates on one run, and the chairs refused to show up. So I did something I had explicitly declined in the Run 11 plan — reached for Meta's Boxer checkpoint, isolated it in its own Modal app, and wired it behind the existing dashboard as a second pipeline choice.

💡 The lesson from the reversal

The Run 11 plan rejected Boxer on licensing (CC-BY-NC) and integration cost. Both were real concerns, but weighing them before putting the model on a test clip was wrong. Day 2's question wasn't “is Boxer's license friendly?” — this is a personal, non-commercial blog experiment — it was “does CuTR's training distribution actually apply to phone walkthroughs of arbitrary rooms?” Running Boxer took four hours end-to-end in a sandbox app. Skipping that for two weeks would have been the real mistake.

Demo

The three artifacts below are the full output of a single 78 MB phone clip, fed through the new dashboard with the Boxer pipeline selected. No fine-tuning. No calibration. The video goes in, the three files below come out. Total Modal spend: a few cents.

dollhouse.mp4 — pastel orbit auto-generated

Room reconstruction from one phone video. Oriented bounding boxes, pastel RoomPlan palette, gravity-snapped, floor rectangle-fitted. Rendered locally by matplotlib after Modal returns the scene graph.

pov.mp4 — oriented wireframes back-projected into every frame auto-generated

Each fused 3D OBB is projected back through every frame's intrinsic + extrinsic, so you can spot when a box is wrong immediately — it won't stick to its object.

floorplan_v2.png — pastel footprints with trajectory auto-generated
Run 12 floorplan with pastel furniture footprints
ℹ What is “far from done” in these outputs

Recall still drops on chairs, plants, and mirrors at the 512-pixel detection resolution Boxer runs at on an A10G. Walls are empty in this scene — phone footage at 64 frames rarely gives the wall band enough confident points to RANSAC a plane. And the Lingbot geometry is not metric, so volumes are a pipeline-internal unit, not square meters. Each of those is a dial to turn — A100 for 960-pixel Boxer, a camera-height prior for metric scale, denser frame sampling for the wall band — not a rewrite. Future runs will knock them down one at a time.

How Modal is actually used (three apps, one dashboard)

Run 11 was one Modal app. Run 12 is three. The same phone clip now passes through two GPU sandboxes that run on different Modal apps and one local process that glues them together. Each Modal app owns a single concern — weights, image, and cold-start latency — so you can change one without re-downloading the other, and modal deploy each independently.

Phone video (local, drag—drop onto dashboard)
FastAPI on Mac
upload, WS, subprocess modal run app.py::analyze_boxer
gas-lingbot-experiment
A10G · Lingbot‑Map geometry: poses + depth + world points
gas-boxer-experiment
A10G · Meta Boxer: OWLv2 + BoxerNet 3D OBBs with Hungarian fusion
Room extraction + rectangle-fit + gravitize
local · no GPU, uses the two remote outputs
run_manifest.json
dollhouse.mp4
floorplan_v2.png
pov.mp4

The second orchestrator app (the “local” one you invoke with modal run app.py::analyze_boxer) is technically also running on Modal — but as a local_entrypoint, which means the Python body executes on your laptop while its .remote() calls round-trip to the two deployed apps. That's how a single modal run invocation can pull tensors off two separate A10G containers and render the artifacts with local matplotlib.

Modal appWhat's on GPUWhy it's its own app
gas-lingbot-experiment Robbyant/lingbot-map GCT checkpoint. Streams RGB frames → poses + depth + per-pixel world points. Heavy model (~3 GB weights, CUDA 12.8 torch 2.9.1). Isolated image so the Boxer app doesn't need the whole stack just to import.
gas-boxer-experiment Meta facebook/boxer (OWLv2 text-grounded 2D + BoxerNet 3D OBB head + offline 3D-IoU fusion). Weights are CC-BY-NC. Keeping them behind a non-production sandbox app matches the license. 0.42 s/frame steady, < 1 GB peak.
gas-v2 (legacy, Run 11) VGGT + Grounding DINO + SAM 2 + CuTR. Selectable from the same dashboard for comparison. Useful when Boxer's class set doesn't match the prompt.
💡 Why three apps instead of one giant image

Modal caches image layers but doesn't cache weights — those live in modal.Volume objects. Each pipeline has its own gas-<name>-weights volume, mounted at /weights by its own app. Re-downloading VGGT shouldn't cost you anything when you're only iterating on Boxer, and vice versa. When the Boxer team ships a new checkpoint, I re-run download_weights on exactly one app; the other two are untouched.

Dashboard × pipeline architecture (in software terms)

The UI lives at dashboard/main.py. It is intentionally boring FastAPI plus vanilla JS — no React, no SSR, no DB. The interesting part is the contract between the browser and the Modal subprocess.

  • Kickoff. A drag-dropped video plus a prompt lands at POST /api/upload. The server allocates run_id = YYYYMMDD_HHMMSS_<6hex>, writes the bytes to runs/<run_id>/, and starts a background thread that runs subprocess.Popen(["modal", "run", entrypoint, ...]). The entrypoint is whichever pipeline the user picked in the dropdown — app.py::analyze_boxer (Boxer) or app.py::analyze (legacy VGGT+CuTR). Everything is Modal — the dashboard never imports any of the heavy packages.
  • Progress streaming. The thread tails proc.stdout line by line, publishes each line to an in-memory RunBus, and also pattern-matches lines like [boxer] cached geometry or [boxer] raw= to a canonical stage id. Stage completions fire a structured event on the same bus.
  • WebSocket /ws/<run_id>. The browser opens one per run. Events arrive as JSON; the Flow tab's SVG DAG lights up each node with CSS classes .active / .done / .fail as events land. The log tail streams the same stdout for debuggers.
  • Two DAG layouts. flow.js holds a LAYOUTS dict keyed by pipeline id; the run-started event carries the pipeline id and the frontend rebuilds the DAG before the first stage lights. For Boxer that's decode → lingbot → boxer → room → fuse → ply → floorplan → dollhouse → pov → manifest.
  • Artifacts. Every artifact lands under runs/<run_id>/ on local disk. The dashboard serves them through GET /api/runs/<id>/artifacts/<filename>. The hero tile on the results pane is a Three.js RoomViewer that reads run_manifest.json and builds a lit 3D scene in the browser, falling back to the dollhouse MP4 when the viewer can't run.
  • Rerun tab. The dashboard lazily spawns rerun serve --web-viewer runs/<id>/gas2.rrd on a free port and redirects into it via an iframe. The Boxer path doesn't produce a Rerun file, so the tab stays idle for those runs; the legacy VGGT path fills it.

The code, simplified

The real files are ~2 000 lines across app.py / obb.py / room.py / render.py / fusion.py / dashboard/main.py. What follows strips every error handler, every cache, and every percentile clip so you can see the spine. The real code adds the percentile clips back in, because that's where most of the work lives.

python
# dashboard/main.py  — spine of the upload → Modal → stream flow.

@app.post("/api/upload")
async def upload(video: UploadFile, prompt: str = Form(...), pipeline: str = Form("boxer")):
    run_id  = datetime.now().strftime("%Y%m%d_%H%M%S_") + uuid.uuid4().hex[:6]
    out_dir = RUNS_DIR / run_id; out_dir.mkdir()
    video_path = out_dir / video.filename
    video_path.write_bytes(await video.read())

    threading.Thread(target=_kickoff_modal_run,
                     args=(run_id, video_path, prompt, pipeline), daemon=True).start()
    return {"run_id": run_id, "pipeline": pipeline}

def _kickoff_modal_run(run_id, video_path, prompt, pipeline):
    entrypoint = {"boxer": "app.py::analyze_boxer", "vggt": "app.py::analyze"}[pipeline]
    cmd = ["modal", "run", entrypoint,
           "--video", str(video_path), "--prompt", prompt, "--run-id", run_id]

    proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
    for line in proc.stdout:
        bus.publish(run_id, {"stage": "log", "status": "info", "message": line.rstrip()})
        stage = _classify_log_line(line, pipeline)                  # e.g. "[boxer] raw=" -> "boxer"
        if stage:
            bus.publish(run_id, {"stage": stage, "status": "ok"})
    bus.publish(run_id, {"stage": "run", "status": "ok" if proc.wait()==0 else "failed"})

That's the entire frontend/backend boundary. The dashboard has no idea what a “VGGT” is. It just knows how to spawn a Modal subprocess and how to read stdout.

python
# app.py  — the Boxer local_entrypoint. Runs on your laptop; the .remote()
#             calls round-trip to two deployed Modal apps.

@app.local_entrypoint()
def analyze_boxer(video: str, prompt: str, run_id: str = None):
    run_id  = run_id or _new_run_id()
    out_dir = (Path(__file__).parent / "runs" / run_id); out_dir.mkdir(parents=True)
    video_bytes = Path(video).read_bytes()

    # 1. Geometry: frames → poses + depth + world points. Remote A10G.
    LingBot = modal.Cls.from_name("gas-lingbot-experiment", "LingBotPipeline")
    geom    = LingBot().run.remote(video_bytes, max_frames=64, max_side=512)
    (out_dir / "geometry.npz").write_bytes(geom["geometry_npz"])
    with np.load(io.BytesIO(geom["geometry_npz"])) as z:
        frames, ext, K, world_points, wp_conf, depth = (
            z["frames"], z["extrinsic_wc"], z["intrinsic"],
            z["world_points"], z["world_points_conf"], z["depth"])

    # 2. Detection: frames + extrinsics → oriented 3D boxes + labels. Remote A10G.
    Boxer     = modal.Cls.from_name("gas-boxer-experiment", "BoxerPipeline")
    boxer_out = Boxer().run.remote(video_bytes=video_bytes, text_prompt=prompt,
                                   extrinsic_wc=npy(ext), intrinsic=npy(K),
                                   frames_bytes=npy(frames), depth_bytes=npy(depth),
                                   gravity=[0, 0, 1])

    # 3. Room extraction (local, no GPU): world points → floor polygon + walls.
    room = extract_room(world_points, wp_conf, trajectory=inverse_camera_centers(ext))
    rect = fit_oriented_rectangle(room["floor"]["polygon_xy"])
    room["floor"]["polygon_xy"], room["floor"]["area_m2"] = rect.tolist(), polygon_area(rect)

    # 4. Transform each Boxer OBB to room frame, snap bottom to floor, keep yaw only.
    R_w2r, t_w2r = room["room_transform"]["R"], room["room_transform"]["t"]
    scene_graph = []
    for obj in boxer_out["scene_graph"]:
        obb = OBB.from_record(obj, frame="world").transformed(R_w2r, t_w2r).gravitize(floor_z=0)
        if 1e-4 < np.prod(obb.extent) < 6.0:          # volume sanity
            scene_graph.append(obb.to_record(label=obj["label"]))

    # 5. Artifacts. All local matplotlib + Open3D.
    render_floorplan_v2(scene_graph, out_dir / "floorplan_v2.png")
    render_dollhouse_mp4(scene_graph, out_dir / "dollhouse.mp4")
    render_pov_mp4(frames, depth, ext, K, scene_graph, out_dir / "pov.mp4")

    # 6. Canonical manifest the dashboard reads.
    (out_dir / "run_manifest.json").write_text(json.dumps({
        "run_id": run_id, "pipeline_version": "gas2@0.3-boxer",
        "scene_graph": scene_graph, "room": room,
        "artifacts": {"dollhouse_mp4": "dollhouse.mp4",
                      "floorplan_v2_png": "floorplan_v2.png",
                      "pov_mp4": "pov.mp4"},
    }, indent=2))

Three Modal apps, six steps, one JSON file to rule them all. Every subsequent improvement — metric scale, more detection recall, an A100 for higher-res Boxer, a real Rerun recording for the Boxer path — lands as one more .remote() call or one more transform on the same scene graph.

ℹ Still far from done

The dashboard kicks off, the DAG lights up, the artifacts land. That's the product skin. Underneath, the detection recall is still scene-dependent, the walls are empty on ordinary phone footage, and the geometry is not metric. I chose to ship what exists today because shipping and iterating was the point. Tomorrow's list starts with a camera-height prior for metric conversion and an A100 run of Boxer at 960 px. None of that invalidates what you see above — it just gets us further.

Chapter 15

Honest evaluation

What works:

  • Pipeline is genuinely zero-shot. Change the prompt, change the classes. No training data, no fine-tuning.
  • POV video is usable product-quality: masks track, detection boxes persist, depth colorization reads.
  • Floorplan is recognizable as a room layout. Trajectory is visible, furniture is in roughly the right relative positions.
  • Scene graph centroids are within ~10-20 cm of the true positions in our test video (eyeballed against the ground truth of the room we shot).
  • Total cost of producing all six artifacts from a 55-sec video: under five cents on A10G.

What's broken or limited:

  • No metric scale. VGGT's world units aren't meters. Our centroids are "roughly meters" but that's an accident of training data distributions, not a guarantee. If you want to say "the chair is 1.8 m from the door", you need a reference — Depth Anything V2 metric, a LiDAR frame, or a known-scale object in view.
  • Label duplicates still exist. Our label+proximity merge collapses the easy cases but fails when Grounding DINO flip-flops between adjacent concepts ("chair" vs. "sofa chair"). CLIP-feature merging is the right fix and we haven't done it.
  • 64-frame ceiling. On A10G. An A100-40GB would let us push to ~256 frames. For a long walkthrough (> 2 minutes at 30 fps), you'd want to chunk frames with overlapping VGGT passes and stitch the poses — or swap VGGT for MASt3R-SLAM, which handles arbitrary length with loop closure.
  • Poisson mesh is rough. The hallucinated-surface clip helps, but there's no avoiding that Poisson assumes a closed manifold; point clouds from single-pass monocular depth rarely qualify. For a real mesh, TSDF fusion or a per-object SDF is the right tool.
  • Cross-session alignment is not solved. Record the same room twice, and the two scene graphs don't register to each other — each has its own VGGT-anchored world frame. CLIP features + ICP on the point clouds would fix this; we haven't done it.
ℹ What we did not measure

We didn't run this on the Replica dataset to compare with GAS v1's quantitative numbers. That's the next Mirdan entry — a proper eval with IoU of semantic regions vs. the paper's ground truth. This article was about the integration, not the benchmark.

Chapter 16

What's next

The shortest backlog, ranked by leverage:

  1. CLIP-based track merging. Replace the label+proximity merge with cosine-similarity merging on CLIP embeddings of mask regions. This is the "proper" ConceptGraphs step. Expected improvement: the label-flip duplicates (chair vs. sofa chair) collapse correctly; cross-session alignment becomes possible.
  2. Metric calibration via Depth Anything V2. Run DAv2 metric on a few keyframes, solve a single scalar scale factor against VGGT's depth. Now the centroid positions are in real meters.
  3. Long-video ingestion. Chunk frames into overlapping 64-frame windows, run VGGT per window, stitch poses using the overlap. Or just swap in MASt3R-SLAM, which is designed for this. Either way, the 64-frame ceiling goes away.
  4. A queryable LLM interface. The scene graph has centroids + labels. Dump it into a prompt with a system message like "you can answer spatial queries about the room below" and you have natural-language floor plans. Mirdan experiment 02 might be this.
  5. Object-aware mesh reconstruction. Run Poisson per-object with its own voxel grid. The global mesh approach merges objects that should be distinct.
Chapter 17

The full recipe

Everything compressed to a runbook:

bash
mkdir gas2 && cd gas2
cat > pyproject.toml << 'EOF'
[project]
name = "gas2"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = ["modal>=0.64", "rerun-sdk>=0.20", "matplotlib>=3.8"]
EOF

# Put app.py here — the full ~500 lines are in the repo. Key design choices:
#   - modal.Image.debian_slim()           (not nvidia/cuda, avoids Docker Hub)
#   - torch==2.5.1 + cu124 in final layer (force-reinstall to avoid cu13 drift)
#   - transformers>=4.44 for Grounding DINO  (HF port, no nvcc)
#   - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
#   - max_side=504, max_frames=64         (fits A10G 24 GB)
#   - bbox via 10/90 percentile           (not min/max)
#   - label+proximity merge dedup         (until CLIP-based)

pip install modal
modal setup                                     # browser OAuth

modal run app.py::download_weights              # one-time, ~10 min, ~50 GB
modal run app.py::analyze --video path/to/your.mp4

# Outputs in out/:
#   floorplan.png  (PCA-aligned top-down, ~0.1 MB)
#   pov.mp4        (RGB + masks + depth, ~1 MB)
#   scene.ply      (colored point cloud, ~2 MB)
#   scene.obj      (Poisson mesh, ~8 MB)
#   gas2.json      (scene graph with centroids)
#   gas2.rrd       (Rerun recording with POV)

# To scrub the POV:
pip install rerun-sdk
rerun out/gas2.rrd
💡 The one-line summary of everything above

To wire four foundation models into a pipeline, you need to know: each model's output schema (not what the README says it is — what it literally returns), each model's dtype regime (does it expect autocast?), each model's input constraints (patch alignment, image dtype), and how your dependency graph resolves across image layers (the last pip install wins). The rest is Python.

Full source: gas2/app.py in the accompanying repo. Next Mirdan experiment: quantitative eval of this pipeline on Replica vs. the GAS v1 baseline — a proper "is this actually better" comparison, with plots.