TL;DR
We read the GAS v2 Veanor, decided we wanted a working implementation, and built it on Modal in an afternoon. Input: a 55-second phone walkthrough. Output: a 3D scene graph with 26 labeled objects, a first-person POV video with mask and depth overlays, a 2D floorplan, and a triangle mesh of the point cloud. No fine-tuning, no camera calibration, no local GPU.
Stacking four foundation models sounds like "four API calls." It's not. Each model has its own dtype convention, its own memory profile, its own install-time trap, and its own recent API rename. The paper describes a pipeline; the runnable version is the pipeline plus the six hours of version wrangling between them. This article is that six hours.
If you want the final code, it lives at gas2/app.py in the repo — one file, ~500 lines,
deployable with modal run. If you want to understand how it got there, read on.
You've seen the GAS v2 Veanor or are comfortable with: ViTs, monocular depth, open-vocabulary detection, mask tracking. We don't re-derive the models themselves — we derive the integration. The output of every chapter is either a code diff, a failure mode with an explanation, or a piece of math you need to size things correctly.
Why this problem
The thing we want is simple to state: I point my phone at a room, record a walkthrough, and get back a labeled 3D map. "There's a chair at (1.2, 0, 0.5). There's a door 2 meters north. There's a bookshelf along the east wall." Queryable in natural language. Works on any room I haven't pre-scanned.
The 2024 paper — "GAS" — did this with the classical recipe: SLAM for 3D, Faster R-CNN for detection, SAM for masks, glue code for fusion. That works, but "works" is carrying weight. To fine-tune Faster R-CNN for the 10 object classes in GAS, the authors assembled 12,000 images across 5 datasets and converted everything to COCO format. Add a new class? Restart the data pipeline.
The 2025 thesis behind GAS v2 is: every painful step above has been obviated by a foundation model that shipped in the last 18 months. Zero-shot open-vocabulary detection (Grounding DINO) eliminates the fine-tuning. One-shot monocular geometry (VGGT) eliminates the calibration step. Video-native mask tracking with a memory bank (SAM 2) eliminates the brittle IoU-matching frame-by-frame. You describe what you want to find, point the camera, and the pipeline runs.
The paper is a good read. But a paper's claim is never the same thing as "a script I can run on my phone video." The gap between them is where Mirdan lives. We set ourselves a weekend budget and went looking for the bill of materials.
Every system-build piece should set a concrete acceptance criterion before you touch code. Ours: given an arbitrary phone video of a room, produce a 3D scene graph with correctly labeled objects, their approximate positions, and a viewable POV video — in under ten minutes of wall-clock time on a cloud GPU, with under $5 total cost. Everything that follows is in service of that target, or a pivot forced by reality.
Why not a Mac
The first instinct for a weekend project is to run it on your laptop. Apple Silicon has respectable GPU performance and MPS is reasonably mature now. So let's ask the obvious question: does this stack run on an M-series Mac?
Three of the four models have at least one piece that refuses MPS:
-
Grounding DINO (official repo) ships a custom CUDA kernel for
MultiScaleDeformableAttention, the core op of deformable attention. It compiles atpip installtime against the installed CUDA toolchain. There is no MPS backend. Without this op, the model doesn't run. Period. - VGGT is written as PyTorch modules (no custom kernels), but in practice the released checkpoints, their mixed-precision regime (bf16), and the attention shapes they produce are tuned for CUDA. MPS will "work" in the sense that you can load the model, but inference on a 64-frame batch is orders of magnitude slower — and that's before we discuss bf16 support gaps.
- SAM 2 has optional fused CUDA ops for connected-components post-processing. These are technically disable-able, but you lose small gains; the memory-bank logic itself runs anywhere PyTorch runs, so this is the least-blocking of the three.
(There is a way to run Grounding DINO on CPU/MPS: use the HuggingFace transformers
port, which reimplements MSDA in pure PyTorch. We'll come back to this — it turns out to
be load-bearing on Modal too, for a different reason. Chapter 6.)
So: we need a CUDA machine. The options, ranked by how close "dev loop" is to "nothing to set up":
| Provider | Billing | Dev loop | Verdict |
|---|---|---|---|
| Modal | Per-second, no minimum | modal run app.py, no Docker, no k8s, no idle servers |
Chosen. Python-native. |
| Colab Pro+ | Monthly, unit credits | Fastest to try one thing; A100s are flaky and you can't persist 50 GB of weights | Fine for a quick poke, not for iteration |
| Runpod / Lambda / Vast.ai | Per-hour, cheaper | You manage the box; ssh in, scp, manage Docker yourself | Right if you're doing a long batch job |
| GCP / AWS | Per-hour + reserved | Quota requests for A100s; VPC configuration; IAM | Overkill. Skip unless you have credits. |
Modal's pitch specifically: you write a Python function, add one decorator, and that function now
runs on an A10G (24 GB, $1.10/hr). You can mount a persistent volume, so you download VGGT's 5 GB
checkpoint once and keep it mounted across future runs. The container stays warm for five minutes
after your last call, so the second modal run starts in seconds, not minutes.
A typical iteration of "edit code, run, see output" costs a few cents.
Everything else about cloud GPU providers is downstream of one number: what's the smallest unit of time you pay for? If you pay by the hour, you keep the box up all day and your "debug cycle" is "write a script and run a test". If you pay by the second, your debug cycle is the same as local — hit run, see output, fix, hit run again — and the cost scales with your wall-clock, not with your idleness. For a Mirdan-style experimental build-log where you break things twelve times, per-second is orders of magnitude cheaper.
The stack, from first principles
Four foundation models, each doing one thing, pipelined.
RGB frames split three ways: VGGT recovers geometry, Grounding DINO detects objects on keyframes, SAM 2 propagates masks across the full clip. The three outputs converge on a 3D lifting + dedup step that produces the scene graph, which fans out to six artifacts.
3.1 — VGGT: one forward pass, three geometric outputs
VGGT (Visual Geometry Grounded Transformer, Meta, 2025) takes a batch of N RGB frames and returns, in a single forward pass: camera extrinsics, camera intrinsics, per-pixel depth, and per-pixel world points, all jointly-consistent. You can think of it as the feed-forward replacement for SfM + monocular depth + pose optimization, packaged as a ViT.
For our purposes, the most useful output is world_points: a tensor of shape
[B, N, H, W, 3] where each pixel of each frame has a 3D coordinate in a shared world
frame. We do not have to unproject depth through the intrinsic and transform through the extrinsic
ourselves — VGGT hands us the answer. (It took us three crashes to realize that. Chapter 9.)
Metric scale. VGGT's world frame is anchored to the first camera, and the unit is whatever-the-network-decided-on — roughly meters but not calibrated. If you want to say "the chair is 1.8 meters from the door", you need an absolute depth reference (a LiDAR frame, a known-scale object, Depth Anything V2 metric). For a semantic map with relative spatial relationships, VGGT is enough.
3.2 — Grounding DINO: text in, boxes out, no fine-tuning
The classical detector (Faster R-CNN) learns a fixed vocabulary at training time. Grounding DINO
learns cross-attention between text embeddings and image features, so the vocabulary is
set at inference time by the prompt. Give it the string "chair. table. door."
and it returns boxes for those classes. Give it "fire extinguisher." and it finds
fire extinguishers — with no re-training.
The prompt format is "class1. class2. class3." — lowercase phrases separated
by periods, not commas. Internally, each period-segment becomes a separate "text query" that
the model aligns to image regions. "chair, table" is treated as a single phrase
("the phrase chair comma table") and performs terribly.
3.3 — SAM 2: masks with a memory bank
SAM (2023) gives you a beautiful per-frame mask given a point or box prompt. SAM 2 (2024) adds a "memory bank" that makes masks persistent across a video. You prompt object #3 at frame 0; it produces a mask. It also stores an appearance embedding. At frame 50, when the object reappears after being occluded, SAM 2 matches against the memory and produces the mask for the same object. No IoU-matching hacks, no tracker state to tune.
In our pipeline, Grounding DINO produces boxes on keyframes (every 8th frame); SAM 2 seeds a track from each box and propagates masks through all frames. This is the fusion the paper recommends, and it works.
3.4 — Two models the paper doesn't mention
We added two tools not in the paper, purely for the output pipeline:
- Open3D for point-cloud I/O and Poisson mesh reconstruction. The paper discusses 3D Gaussian Splatting as a representation; for a first pass, a voxel-downsampled PLY and a Poisson mesh give you something you can open in MeshLab.
- Rerun for visualization. Rerun lets us log camera poses, images, masks, and point clouds under a time-indexed hierarchy, then scrub through the recording. It's the right tool for SLAM-style data because it understands the temporal and spatial structure natively.
Writing the scaffold
Modal's mental model is: your app.py declares "functions that run in the cloud",
a container image, and a persistent volume. You invoke functions locally; Modal serializes arguments,
runs them remotely, and streams results back. For a stateful pipeline that loads 5 GB of weights, the
right unit is an @app.cls — a class whose @modal.enter() method runs
once per container and loads the models; subsequent @modal.method() calls reuse that
warm state.
import modal
app = modal.App("gas-v2")
weights = modal.Volume.from_name("gas-v2-weights", create_if_missing=True)
WEIGHTS_DIR = "/weights"
image = (
modal.Image.from_registry("nvidia/cuda:12.4.1-devel-ubuntu22.04", add_python="3.11")
.apt_install("git", "ffmpeg", "libgl1", "libglib2.0-0")
.pip_install("torch==2.4.0", "torchvision==0.19.0",
index_url="https://download.pytorch.org/whl/cu124")
.pip_install("transformers>=4.44", "opencv-python-headless", "pillow",
"numpy<2", "huggingface_hub", "rerun-sdk>=0.20")
.run_commands("pip install git+https://github.com/facebookresearch/sam2.git")
.run_commands("pip install git+https://github.com/IDEA-Research/GroundingDINO.git")
.run_commands("pip install git+https://github.com/facebookresearch/vggt.git")
)
@app.function(image=image, volumes={WEIGHTS_DIR: weights}, timeout=3600)
def download_weights():
# one-time: pull 5 GB VGGT + 2 GB SAM 2 + 1 GB GDINO + 4 GB DINOv2 into the volume
...
@app.cls(image=image, gpu="A10G", volumes={WEIGHTS_DIR: weights},
timeout=1800, scaledown_window=300)
class GasV2Pipeline:
@modal.enter()
def load(self):
# load all four models, keep them in self for subsequent calls
...
@modal.method()
def run(self, video_bytes: bytes, text_prompt: str) -> dict:
# the actual pipeline
...
That's the shape. The scaledown_window=300 means the container stays alive for 5 minutes
after the last call — iterating with the same warm weights costs essentially nothing. Modal
caches image layers by content hash, so re-runs with unchanged layers are instant.
Note one thing we got right on the first attempt: we used nvidia/cuda:...-devel
because Grounding DINO's official install compiles a CUDA op at pip install time and
needs nvcc. That's the textbook answer. It's also the thing that bit us.
Run 1 · failDocker Hub's IPv6 tantrum
First modal run. Image build starts. Modal's build worker shells out to skopeo
to pull nvidia/cuda:12.4.1-devel-ubuntu22.04 from Docker Hub, and:
Two retries, same error. That IPv6 address belongs to AWS's Docker Hub registry replica. Either Modal's workers have an IPv6 routing issue, or Docker Hub is having one — either way, the blast radius is "you can't use any base image hosted on docker.io."
The fix is not to fight the network. It's to move to a base that isn't hosted on docker.io.
Modal has a built-in debian_slim base that it serves from its own infrastructure, always
reachable from its own build workers. Switching to it drops the Docker Hub dependency entirely.
But that breaks something else: debian_slim has no CUDA toolkit. No nvcc,
no CUDA headers. Grounding DINO's official repo won't install. What now?
Any external registry in your build is a correlated failure domain — your deploy is alive only as long as everything it transitively depends on is alive. The standard practical response is: either use your provider's own-hosted images for base layers, or mirror the ones you need into your own registry. We took the first option because it was one line of diff.
Run 2 · successWhy we never touched nvcc
The reason we thought we needed nvcc is Grounding DINO's custom CUDA op. The
MultiScaleDeformableAttention kernel is written in C++/CUDA and compiled at install time.
It makes inference somewhat faster and more memory-efficient than a pure-PyTorch implementation of
deformable attention.
The important word is "somewhat." The HuggingFace transformers library has a
port of Grounding DINO that reimplements MSDA in pure PyTorch. It's not as fast as
the CUDA kernel. It needs no compile step. For our use case (8 detection calls per video, ~200ms each),
the difference is imperceptible. For the purposes of shipping an integration, it's gold — we
can drop the dev CUDA toolchain dependency entirely.
And in the model loader:
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
self.gdino_proc = AutoProcessor.from_pretrained(f"{WEIGHTS_DIR}/gdino")
self.gdino = AutoModelForZeroShotObjectDetection.from_pretrained(
f"{WEIGHTS_DIR}/gdino"
).to(self.device).eval()
PyTorch's cu124 wheels bundle the CUDA runtime libraries — libcudart,
libcublas, the whole set. You only need the CUDA toolkit (nvcc, headers) if you
compile CUDA code at install time. For inference-only stacks where nothing compiles, runtime wheels
are sufficient. That's our situation now.
Build succeeds. Weights download (VGGT 5 GB, SAM 2 2 GB, GDINO 1 GB, DINOv2 4 GB — we keep DINOv2 around for future CLIP-style feature merging). Container starts.
When you're doing inference integration (not research), prefer the library port over the paper repo. The paper repo optimizes for reproducing the paper. The library port optimizes for playing nicely with other libraries, installing without drama, and surviving your dependency graph. Both produce the same answers. One is two commands to get working; the other is a CUDA toolchain saga.
Run 3 · failThe CUDA 13 ambush
Container starts. Models load. VGGT forward pass runs. Grounding DINO runs. SAM 2 gets seeded. SAM 2's mask decoder fires and:
The libnvrtc-builtins shared object is part of NVIDIA's runtime compilation library —
torch.compile, fused kernels, some autograd paths use it. The error is "we looked for
the 13.0 version and couldn't find it."
We pinned torch==2.4.0 with cu124. Torch 2.4 + cu124 ships with NVIDIA's
CUDA 12.4 runtime. Why is something looking for CUDA 13?
Answer: SAM 2's pyproject.toml has a dependency on torch>=2.5. When pip
ran the SAM 2 install in a later image layer, it saw our pinned torch 2.4, decided that was too old,
and silently upgraded torch to the latest (2.8+) from pip's default PyPI index. That
wheel bundles the CUDA 13 runtime libraries, not the CUDA 12.4 ones our initial install pulled.
Runtime is now a mix of CUDA 12.4 and CUDA 13 files, and libnvrtc-builtins.so.13.0
(from the upgrade) expects companions that never got installed.
This is a pin-resolution collision. We pinned torch 2.4 in our first pip_install
layer. The SAM 2 install in a later layer saw that pin as "a starting point I'm allowed to
upgrade to satisfy my own constraints." pip is not, by default, a strict pinning tool; it's a
best-effort resolver that prefers to satisfy all constraints over respecting your earlier pins.
The fix is surgical. Add one more image layer after SAM 2 and VGGT install, force-reinstalling
a pinned torch from pytorch.org's CUDA 12.4 index. "The last pip install wins":
# after all other installs: pin torch consistently for runtime
.run_commands(
"pip install --upgrade --force-reinstall "
"torch==2.5.1 torchvision==0.20.1 "
"--index-url https://download.pytorch.org/whl/cu124"
)
torch==2.5.1 satisfies SAM 2's >=2.5 constraint (SAM 2 gets to use its
APIs), and the cu124 index ensures the CUDA 12.4 runtime libraries are the ones actually
installed. --force-reinstall kicks out whatever SAM 2 pulled.
The signal that pip silently upgraded torch is in the build log — you'll see lines like
Downloading .../nvidia_cublas-13.1.0.3-py3-none-manylinux...whl in a layer where you
only expected SAM 2's deps. NVIDIA's CUDA 13 packages have -cu13 in the filename;
CUDA 12's have -cu12. If you see cu13 anywhere in your build log and you
pinned cu124, something upgraded torch under you.
In a multi-layer image, pip's resolver runs independently in each layer. Earlier pins can be
overridden by later installs. The reliable pattern is to pin the thing you care about in the
last layer that touches it, with --force-reinstall, so nothing gets to override
it afterwards.
Runs 4&5 · failHow much video fits in 24 GB?
Run 4 kicks off. Models load. Video uploads. Frames decode:
VGGT's ViT uses patch size 14. It can only tokenize images whose height and width are divisible by 14. 720 / 14 = 51.4. Assertion.
Easy fix: resize frames to the nearest smaller multiple of 14 at decode time. 720 → 714, 1280 → 1274. Run 4 tries again and gets to VGGT's forward pass — where it immediately hits:
This deserves to be derived, not just fixed. Why does VGGT want 4.5 GiB for one tensor?
The activation memory math
VGGT processes all N frames as a single sequence. After patchification, each frame becomes
P = ⌊H/14⌋ × ⌊W/14⌋ tokens. The full sequence is
T = N × P tokens. The self-attention layer computes a full T × T
attention matrix.
For a ViT processing N frames at resolution H × W with patch P=14 and attention dtype bf16 (2 bytes):
tokens_per_frame = ⌊H/14⌋ · ⌊W/14⌋ total_tokens = N · tokens_per_frame attn_bytes = total_tokens² · 2 (per layer, per head, without flash-attn)
Plug in our Run-4 numbers: N=128, H=714, W=1274 → tokens/frame = 51 · 91 = 4641. Total tokens = 128 · 4641 = 594K. Attention matrix in bf16: 594K² · 2 bytes = 705 GB. That's obviously impossible; VGGT uses flash attention internally, which trades this O(T²) term for O(T·d) memory. But flash attention still has O(T·d) activations and other intermediates that scale with T. On the order of a few GB per block, stacked across layers, with a ViT-G depth.
The practical takeaway: activation memory scales with frames and resolution jointly. You have two knobs, and one of them helps a lot more than the other.
Fix step 1: cap the longest side. We set max_side=504 — that's 36 patches wide, well
inside the memory budget. For our 720×1280 portrait video, the aspect-preserving resize is
280×504. That's small visually but still fine for detection (humans can identify
a chair in a 280-pixel crop just fine; Grounding DINO does better than humans on this).
Fix step 2: cap max_frames to 64 as a safety margin. Then stride-sample across the whole
clip, so we still cover all 55 seconds — we just sample every 25th frame (stride = 1644 / 64 ≈ 25)
instead of reading the first 64 sequentially. At ~1.2 fps effective coverage, that's still dense enough
for SAM 2's memory bank to track objects.
Fix step 3: turn on the expandable-segments allocator. PyTorch's default CUDA allocator fragments
memory across repeated allocations; expandable_segments is a newer allocator that
coalesces fragments. It's a free improvement under memory pressure.
.env({"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"})
With all three changes: 64 frames at 280×504, expandable-segments on. Total tokens now 64 · (20 · 36) = 46K. Peak memory during VGGT: ~18 GB. We fit. Onward.
For any ViT-based video model, the cheap mental model is total_tokens = frames × patches_per_frame, and attention memory grows as a power of total tokens (quadratic without flash-attn, linear-ish with it, but still with large constants). Halving the resolution quarters the patches, halving the frames halves the tokens. If something doesn't fit, reach for resolution first (quadratic savings) before frames.
Run 6 · failVGGT's real API
Memory fixed. VGGT forward pass runs. Next line:
My draft code assumed VGGT's output dict has an extrinsic key. It doesn't. What it actually
returns:
Two things to note. First, extrinsic and intrinsic matrices aren't direct outputs; you decode them via a helper:
from vggt.utils.pose_enc import pose_encoding_to_extri_intri
ext34, intr = pose_encoding_to_extri_intri(pred["pose_enc"], imgs.shape[-2:])
# ext34: [1, S, 3, 4] — world-to-camera
# intr: [1, S, 3, 3]
Second, and more importantly: VGGT already gives you per-pixel world points. We were about to write a pinhole back-projection:
But VGGT's world_points[f, y, x] is already that value. We just index it:
The other subtle detail: VGGT's extrinsic is world-to-camera (the standard computer vision
convention). Rerun's Transform3D, by contrast, describes a parent-to-child
transform — when you log it under world/cam, Rerun reads it as "how to go from
world to cam." That sounds the same, but it's the inverse: what Rerun wants is
the camera's pose in world coordinates (world_from_cam), not a world-to-camera transform. You invert:
E_wc = np.eye(4); E_wc[:3] = extrinsic_wc[i] # VGGT: world-to-cam
E_cw = np.linalg.inv(E_wc) # Rerun wants world-from-cam
rr.log(cam_path, rr.Transform3D(
translation=E_cw[:3, 3].tolist(),
mat3x3=E_cw[:3, :3].tolist(),
))
Model README files describe what the model does. They often omit the exact keys of the
output dict, the dtype expectations, the pose convention. For foundation models that ship every
three months, the fastest path is: clone the repo, grep for the forward method,
print the output dict once, keep going. Three minutes of reading source usually saves an hour of
runtime error roulette.
Run 7 · failA renamed argument
VGGT now works. Grounding DINO's inference runs. Post-processing line:
Somewhere between transformers 4.44 and 4.48, the argument got renamed:
One-line fix. But the lesson is bigger than the fix: we pinned transformers>=4.44,
not transformers==4.44. Between project setup and run time, the image build picked up
4.48.x, which has this rename. The official docs at the time of writing still show box_threshold.
>= is not a pin
Minimum-version constraints are useful for library authors ("I need at least API X"). For
reproducible application builds, they're a landmine — your next deploy might land on any
later version, which might or might not be source-compatible. Build-time equality pins (==
or exact-version lockfiles) are what you want for applications.
Run 8 · failSAM 2's autocast requirement
Grounding DINO fixed. It seeds 40 object boxes across 8 keyframes. SAM 2's propagate_in_video
starts, crunches for a second, and:
SAM 2 is an architecture with mixed precision. The image encoder (Hiera) is trained and stored in bf16 for efficiency. The mask decoder stores its weights in fp32. During inference, without an autocast context, image features arrive at the mask decoder as bf16 tensors; the decoder's fp32 weights can't matmul with them.
The standard fix is to wrap all SAM 2 calls in an autocast block, which promotes the
inference path to a consistent dtype:
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
state = self.sam2.init_state(video_path=str(frames_dir))
for k, boxes in detections.items():
for label, box in boxes:
self.sam2.add_new_points_or_box(state, frame_idx=k, obj_id=next_id, box=box)
next_id += 1
for f_idx, obj_ids, mask_logits in self.sam2.propagate_in_video(state):
masks_per_frame[f_idx] = {
oid: (mask_logits[i, 0] > 0).cpu().numpy()
for i, oid in enumerate(obj_ids)
}
This is the kind of model-specific calling convention that lives in example scripts, not in the
module docstring. The fix is obvious once you know it; discovering it involves reading the repo's
notebooks/video_predictor_example.ipynb.
.to(torch.bfloat16)?You can manually cast SAM 2's weights to bf16 at load time. It works, but you lose precision on the fp32 parts (the mask decoder's learned prompt embeddings) that the authors intentionally kept in full precision. Autocast is surgical: it promotes dtype only for ops that benefit from bf16 and downcasts back for ones that don't. It's the right tool.
Run 9 · successFirst working end-to-end
All fixes compound. Run 9 runs through every stage:
Pipeline works. Celebrate for thirty seconds, then open the output.
It is bad.
Thirty-nine objects, but six of them labeled "chair" and eight labeled "door" — clearly duplicates of the same physical furniture, seeded fresh at different keyframes. Rerun has sixty-four floating camera frustums but no obvious way to "play" the walkthrough and watch the POV move. The floorplan we generated (projecting X-Z of world points) is a mess of overlapping rectangles, each one spanning three meters in every direction, with labels stacked on top of each other.
The scene graph is correct but unusable. The rest of the build log is about making it usable.
Making the outputs actually useful
Five distinct problems with Run 9's output, addressed in order:
- Rerun has no POV view — each camera is its own static entity.
- The 2D floorplan is in the wrong coordinate frame — VGGT's world is camera-centric, not gravity-aligned.
- There's no 3D mesh artifact for downstream use.
- The POV video isn't a video — it's a scrubbable Rerun recording.
- The scene graph has duplicates and bloated bounding boxes.
13.1 — Rerun POV: one entity that moves
Run 9 logged each frame's camera as a separate static entity:
Result: 64 camera frustums floating in 3D space. To see frame 32's POV, you click
world/cam_0032 in the entity tree, navigate through, and the 2D view updates. It's not
a video experience; it's a static 3D scene with lots of cameras.
The Rerun-idiomatic way is a single dynamic camera entity that moves through time. You log
world/cam once per frame on a time timeline; Rerun interpolates/holds between time steps,
and the 2D "POV" view automatically shows the current-time image:
import rerun as rr
import rerun.blueprint as rrb
for i in range(len(frames)):
rr.set_time_sequence("frame", i)
rr.log("world/cam", rr.Transform3D(translation=..., mat3x3=...))
rr.log("world/cam/image", rr.Pinhole(focal_length=..., principal_point=...,
width=W, height=H))
rr.log("world/cam/image/rgb", rr.Image(frames[i]))
rr.log("world/cam/image/depth", rr.DepthImage(depth[i], meter=1.0))
rr.log("world/cam/image/masks", rr.SegmentationImage(seg[i]))
# Also pin a camera-trajectory polyline so the walk is visible in 3D
rr.log("world/trajectory",
rr.LineStrips3D([traj_pts], colors=[[200, 200, 200]]),
static=True)
# And a blueprint: 3D scene left, POV right
rr.send_blueprint(rrb.Blueprint(
rrb.Horizontal(
rrb.Spatial3DView(name="Scene", origin="/world"),
rrb.Spatial2DView(name="POV", origin="/world/cam/image"),
column_shares=[2, 1],
)
))
Now the time scrubber advances the camera. The left panel shows the scene, the camera frustum
sliding along the trajectory. The right panel shows the RGB from the current frame, with SAM 2
masks overlaid (from the SegmentationImage), plus 2D detection boxes from Grounding DINO
on keyframes. It's the SLAM-visualization pattern Rerun was built for.
13.2 — 2D floorplan via PCA
The paper shows a satisfying 2D floorplan. We had a first attempt at replicating that: project VGGT's world points to the X-Z plane (drop Y), plot object bounding boxes. The result was unreadable — huge overlapping rectangles, labels piled on each other.
The issue: VGGT's world frame is not gravity-aligned. It's camera-centric — the first camera's optical axis defines +Z, its up vector defines -Y. If you held the phone upright perfectly, -Y is world-up and X-Z is the floor. If you didn't, it's tilted by whatever your phone was tilted by, and "X-Z" is an arbitrary diagonal slice through the room.
We need a gravity-aligned 2D plane. You could ask IMU data, but phone videos don't ship one by default, and VGGT ignores it anyway. You could ask the depth gradient's mode (floors are locally planar), but that's surgery.
Here's the trick: when you walk through a room carrying a camera, your camera positions live in an approximately 2D plane — the plane of the floor, one and a half meters up. You walk along it, you don't levitate. The first two principal components of the camera trajectory are the floor plane.
Collect camera positions ti ∈ ℝ3, i = 1..N. Center and SVD:
μ = (1/N) ∑ ti T = stack(ti - μ) — N×3 matrix U Σ VT = SVD(T) floor_basis = V[:, :2] — 3×2, top two principal axes ui = (ti - μ) · floor_basis — 2D projection
Apply the same (x - μ) · floor_basis to all 3D points (object centroids, AABB
corners) to put them in floorplan coordinates. The third principal component (smallest singular value)
points roughly along gravity — we drop it.
In code, ten lines:
traj = np.array(trajectory)
center = traj.mean(axis=0)
T = traj - center
U, S, Vt = np.linalg.svd(T, full_matrices=False)
basis = Vt[:2] # 2x3, rows = floor-plane axes
def project(p):
return (np.asarray(p) - center) @ basis.T
traj_2d = np.array([project(p) for p in traj])
for node in graph:
bmin_2d = project(node["bbox_min"])
bmax_2d = project(node["bbox_max"])
# draw rectangle from bmin_2d to bmax_2d
Left: raw camera positions form a flat cloud oriented at a random angle because VGGT's world isn't gravity-aligned. Right: SVD recovers the two dominant axes (orange = floor plane); projecting onto them gives a clean top-down view.
Add to that: a gradient-colored trajectory line (teal start, amber end), transparent AABB outlines instead of filled rectangles (so overlaps are legible), labels above the rectangles not inside them, and the floorplan becomes readable.
13.3 — Point cloud + Poisson mesh
The scene graph has per-object point clouds. A common next artifact is a unified mesh. Open3D does this in three steps:
- Concatenate all object points into one cloud, coloring each by its object ID.
- Voxel-downsample to ~1 cm resolution — removes duplicate points from overlapping views.
- Estimate normals (Poisson needs them), then run Poisson surface reconstruction.
- Trim low-density vertices — Poisson hallucinates smooth surfaces into unobserved regions; we clip the bottom 10% of vertex density to kill the worst of it.
import open3d as o3d
pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(all_points)
pcd.colors = o3d.utility.Vector3dVector(all_colors)
pcd = pcd.voxel_down_sample(voxel_size=0.01)
pcd.estimate_normals(o3d.geometry.KDTreeSearchParamHybrid(radius=0.05, max_nn=30))
o3d.io.write_point_cloud("scene.ply", pcd)
mesh, densities = o3d.geometry.TriangleMesh.create_from_point_cloud_poisson(pcd, depth=8)
d = np.asarray(densities)
mesh.remove_vertices_by_mask(d <= np.quantile(d, 0.1))
mesh.compute_vertex_normals()
o3d.io.write_triangle_mesh("scene.obj", mesh)
Output: a 2.4 MB PLY (47K points) and a 7.9 MB OBJ (96K triangles). Good enough to open in MeshLab or drag into a 3D viewer for orientation. Not production-grade; for that you'd use TSDF fusion or a neural SDF. But as a "here's what the pipeline built" artifact, it's a real file someone can interact with.
13.4 — POV MP4 with overlays
Rerun is great for exploratory viewing but assumes the user installs rerun-sdk and runs
the viewer. For sharing — a Slack thread, a blog post (hi) — you want a plain MP4. We
render one server-side: RGB with mask alpha-blend and detection boxes on top, colorized depth below,
stacked vertically for portrait playback.
The rendering loop is unremarkable — OpenCV VideoWriter with mp4v fourcc,
one frame per sample, mask overlay computed as cv2.addWeighted(rgb, 0.55, mask_colored, 0.45, 0).
Depth is cv2.applyColorMap on percentile-normalized values. The only thing to be careful
about is that SAM 2's old track IDs need to be remapped through the dedup merge (see 13.5 below) before
they're rendered, so the video's label legend matches the scene graph's label legend.
13.5 — The duplicate-track problem
Run 9 had 39 objects, including six "chair" tracks and eight "door" tracks. The cause
is simple: SAM 2's add_new_points_or_box at frame 8 with a fresh obj_id
doesn't know that track #1 (seeded at frame 0) is probably the same physical chair. Every keyframe's
detections seed new IDs, even when they point at the same object.
The proper fix, and what the paper does, is ConceptGraphs-style CLIP-feature merging: for each track, compute a CLIP embedding of its masked region. Merge tracks whose embeddings are cosine-similar and whose centroids are spatially close. That handles "same object seen from different angles produces different masks but the same semantic content."
We took a cheaper first pass: merge tracks that share a label and are within a small spatial distance. The assumption is stronger (Grounding DINO has to give the same label each time; we rely on that), but the code is twenty lines and fits the data we have.
# For each track (old_id), try to find an existing merged group with the
# same label whose aggregated centroid is close enough.
merge_dist = 0.15 # VGGT world units, roughly 15 cm
merged = {} # new_id -> {"label", "pts" (list), "ids"}
id_map = {} # old_id -> new_id
for old_id, rec in raw.items():
assigned = None
for new_id, m in merged.items():
if m["label"] != rec["label"]:
continue
m_centroid = np.concatenate(m["pts"], axis=0).mean(axis=0)
if np.linalg.norm(m_centroid - rec["centroid"]) < merge_dist:
assigned = new_id; break
if assigned is None:
assigned = len(merged) + 1
merged[assigned] = {"label": rec["label"], "pts": [], "ids": []}
merged[assigned]["pts"].append(rec["pts"])
merged[assigned]["ids"].append(old_id)
id_map[old_id] = assigned
Applied to Run 10, this collapses 35 raw tracks → 26 unique objects. Not perfect (the label-sharing assumption fails when Grounding DINO flip-flops between "chair" and "sofa chair" for the same object), but a massive qualitative improvement.
13.6 — The fat-AABB problem
In Run 9, object bounding boxes spanned the whole room. That's not a bug in the AABB computation; it's correct. VGGT's per-pixel world points have a heavy tail of low-confidence points at mask boundaries, in reflections, at occlusion edges. If you take min/max over all points in a mask, the min/max are dominated by these outliers. Each AABB becomes the convex hull of the object plus its weirdest few pixels.
The fix is to use percentile-based bounds: 10th percentile for the min, 90th percentile for the max. The bulk of the cloud is inside that box; the tails are clipped.
Left: min/max takes the extent of every pixel in the mask, including low-confidence tails that reach far into neighboring surfaces. Right: 10th/90th percentile keeps the dense center.
For good measure we also clip the centroid to the percentile-bound inliers. The centroid of all points (including tails) is pulled toward whichever tail is longest; the centroid of just the percentile inliers sits where the object's visual mass is.
Nearly every 3D-from-image pipeline produces heavy-tailed point distributions, because the "hard" pixels of depth estimation (mask edges, specular reflections, thin structures, far surfaces) are the ones with the worst depth. Min/max statistics are the wrong summary for heavy-tailed data. 10/90 percentiles are usually the first thing to reach for.
Run 10 · successThe final pipeline
All improvements stacked. One modal run, one command, six artifacts:
$ modal run app.py::analyze
sending 69.7 MB to Modal; prompt='sofa. couch. chair. table. ...'
max_frames=64 stride=8 box=0.35 text=0.3 merge=0.15
models ready
sampled 64/1644 frames, resized 720x1280 → 280x504 (stride=25)
VGGT: wp range -2.26..1.07, conf p50=1.01
keyframe 0: 4 detections
keyframe 8: 3 detections
keyframe 16: 3 detections
keyframe 24: 3 detections
keyframe 32: 2 detections
keyframe 40: 4 detections
keyframe 48: 5 detections
keyframe 56: 9 detections
seeded 33 object tracks
merged 33 tracks → 26 unique objects
rerun .rrd: 62.0 MB
scene.ply: 47375 pts, 2.4 MB
scene.obj: 96418 tris, 7.9 MB
pov.mp4: 1.2 MB
wrote out/gas2.json
wrote out/floorplan.png
wrote out/scene.ply (2.4 MB)
wrote out/scene.obj (7.9 MB)
wrote out/pov.mp4 (1.2 MB)
wrote out/gas2.rrd (62.0 MB)
Total runtime, warm container: ~90 seconds. Cost on A10G: about $0.04. If the container was cold (image build + model load), add ~2 minutes on top.
Run 11 · finishFrom pile of points to dollhouse
Run 10 worked. It crashed six times on the way there, but it worked. Twenty-six labeled objects, six artifacts, about a nickel per clip. So why am I writing this? Because "works" is not the same as "finished." If you put Run 10's Rerun recording next to Apple's RoomPlan dollhouse, you can see the gap: RoomPlan looks like a product and Run 10 looks like an experiment. Axis-aligned boxes around rotated sofas, a floorplan that's PCA scatter instead of an indoor layout, a palette where colors mean object IDs rather than categories, and the whole thing driven by a CLI. This update closes that gap.
Same VGGT → Grounding DINO → SAM 2 backbone. We added Apple's Cubify Transformer (CuTR) per keyframe, fused oriented boxes across frames with a greedy 3D-IoU scheme, kept a PCA-OBB fallback running in parallel so nothing gets lost when CuTR misses, rendered a stylized dollhouse MP4 and a category-keyed floorplan, and put the whole pipeline behind a local FastAPI dashboard. No model was fine-tuned. No dataset was collected. One new checkpoint was downloaded. The rest is plumbing and taste.
The bar we hadn't cleared
Four specific failures of v1:
- Rotated furniture, axis-aligned boxes. A sofa at 30° fills its AABB with two corners of air. The AABB is the 2×-too-big "shadow" of the object on the world axes.
- “Heavy tail” geometry on thin things. VGGT's per-pixel world points have a long tail of low-confidence outliers. p10/p90 clipping helps, but it still inflates a poster into a thin slab.
- Palette = object ID.
obj_color(oid) = HSV(oid * 0.37)means the same physical chair is a different color on every rerun. The eye can't lean on “chair is apricot.” - CLI as UI. Running the pipeline requires a terminal, a Modal
profile, and a file path. The output is a bag of files in
out/. Nothing about that feels like a product.
A 90-second research tree
The plan's operating principle — stolen from The AI Scientist-v2 — is tree-search over hypotheses, not linear pursuit. Four branches were on the table. One got chosen. One got kept as an always-on safety net. Two got politely declined.
| Branch | What it is | Decision |
|---|---|---|
| CuTR (Apple) | Single-image transformer for class-agnostic 3D OBBs. Trained on CA-1M (1K laser-scanned rooms, 400K objects). 2412.04458 | Chosen as the per-keyframe detector. Clean OBBs, open source, pairs cleanly with our existing labels from Grounding DINO. |
| PCA-OBB | Covariance-SVD on each SAM 2 track's points, p10/p90 extents. ~20 lines of numpy. | Kept always on. Free, cheap, never fails. Becomes the fallback when CuTR misses a track or kills itself on a bad scene. |
| Boxer (Meta) | OWLv2 + DINOv3 + BoxerNet + Hungarian OBB fusion. 2604.05212 | Declined. Two weeks of re-plumbing to swap our entire front end, CC-BY-NC weights, and gives up the SAM 2 track identity we already have. |
| Rooms from Motion | Un-posed images → poses + OBBs jointly. Objects-as-features SfM. 2505.23756 | Declined. No public implementation; re-implementing is a season of work. |
The architecture
The trick is that we don't swap any of the v1 components. VGGT still owns poses and depth. Grounding DINO still owns labels. SAM 2 still owns track identity. CuTR is bolted on as a sibling detector that produces geometry only — oriented boxes, no class — and a fusion stage at the end stitches CuTR geometry to SAM 2 labels.
poses, depth, world points
2D boxes + labels (keyframes)
mask tracks
per-keyframe 3D OBB (class-agnostic)
per track, always on
reject if z disagrees > 30%
CuTR OBB if matched, PCA-OBB otherwise, killswitch < 30 %
oriented wireframes
pastel orbit
OBB footprints
The fusion rule is boring, which is the point. Two OBBs a and b merge when 3D-IoU exceeds a threshold and they agree on label, with a centroid-distance fallback for thin objects where small rotation errors collapse IoU to zero.
IoU3D(a, b) > 0.3 ∧ label(a) ∈ labels(b)
or
IoU3D(a, b) = 0 ∧ ∥ca − cb∥< 0.2 m ∧ label(a) ∈ labels(b)
3D-IoU is computed by rasterizing both OBBs into a shared 24×24×24 voxel grid over their union AABB. Not exact — but within 2 % for furniture-scale boxes (0.3–3 m) and takes under 5 ms per comparison. Good enough, and debuggable.
And there's a killswitch. If fewer than 30 % of merged SAM 2 tracks match a CuTR cluster, the whole scene falls back to PCA-OBB for every object. Because CuTR was trained on iPad LiDAR walkthroughs, the RGB-only variant can be brittle on dim or motion-blurred phone footage. The killswitch turns “CuTR had a bad day on this clip” into “still a clean scene, just without CuTR's orientation gain.”
What you get now
Every run lives under runs/<run_id>/ where <run_id>
is YYYYMMDD_HHMMSS_<6char>. No more overwriting, no more
“which out/ is which?” Six new or upgraded artifacts:
| Artifact | What it is |
|---|---|
gas2.rrd |
Same Rerun recording as v1, but the Boxes3D entries now
carry quaternions — boxes render oriented. Labels include the
source: [cutr] vs [pca]. |
pov.mp4 |
Every frame gets oriented wireframes drawn by projecting each fused OBB's 8 corners through that frame's intrinsic + extrinsic. The raw Grounding DINO 2D boxes stay as thin grey rectangles for debugging. |
dollhouse.mp4 |
A 4 second matplotlib 3D orbit over pastel OBB solids sitting on a PCA-fitted floor plane. Rendered locally on the Mac, not on Modal — no EGL dance. |
floorplan_v2.png |
Pastel OBB footprints (the rotated rectangles, not axis-aligned bounding
rectangles). The v1 floorplan.png is kept alongside for a
before/after comparison. |
scene.ply / scene.obj |
Unchanged. |
run_manifest.json |
A single typed contract: run_id, pipeline_version,
every artifact filename, the scene graph with OBB fields, the camera trajectory.
The dashboard and the scorecard both read from this. |
Scene-graph records gained five fields. The v1 bbox_min/bbox_max/centroid/n_points
fields stay for backward compatibility.
{
"id": 7,
"label": "sofa",
"obb_center": [-1.42, 0.05, 1.88],
"obb_extent": [ 1.84, 0.72, 0.82],
"obb_quaternion_xyzw":[ 0.00, 0.38, 0.00, 0.93],
"obb_source": "cutr",
"obb_confidence": 0.71,
"bbox_min": [...], "bbox_max": [...], "centroid": [...], "n_points": 4721
}
A dashboard, because “finished” means not-in-a-terminal
Local FastAPI on the Mac, not Modal ASGI. Artifacts land on local disk by design — the user asked for “find them in some local folder,” and that's easier when there's no authed service boundary. The dashboard is a thin wrapper that uploads the video, kicks off the Modal pipeline as a subprocess, tails stdout, and pattern-matches log lines to pipeline stages for a live SVG flow view.
pip install -e '.[dashboard]'
cd gas2/dashboard
uvicorn main:app --reload # → http://localhost:8000
Three tabs:
- Upload & Results. Drag-drop video, prompt field, past runs
in the sidebar. When a run completes the
dollhouse.mp4plays as a hero tile and every artifact grid-tiles below with a direct download link. - Pipeline Flow. 11-node SVG DAG, each node lit up as stdout announces “VGGT loaded,” “keyframe 0: 4 detections,” “fusion: 6 CuTR clusters; 22/26 tracks matched” and so on. The stdout tail streams into a log pane below the DAG.
- Rerun (technical). The dashboard spawns
rerun serve --web-viewer runs/<id>/gas2.rrdon a fresh port and iframes it. A “launch desktop” button is the escape hatch if the web viewer is behind a firewall.
A scorecard you can hold up to numbers
“Feels better” is not evidence. Five metrics are tracked per scene, written
into scorecard.json per run and rolled up to docs/scorecard.csv
by the aggregate_scorecard.py script.
| Metric | How computed | v2 target |
|---|---|---|
| Stability CV | std/mean of |objects| across
--keyframe-stride ∈ {6, 8, 12} | < 0.15 |
| Label entropy | distinct labels / total objects | 0.5–0.8 |
| OBB tightness | median (OBB volume) / (convex-hull volume of source points) | 1.1–1.5 |
| Floor plane residual | RANSAC plane inlier ratio + RMS | > 0.7 / < 5 cm |
| Manual rubric | 1–5 on duplicates, OBB orientation, label plausibility, dollhouse aesthetic | ≥ 4 on all four |
Without it, the pipeline used to have two failure modes: “works and
looks great” or “crashes.” Now there's a third: “works
with reduced orientation fidelity on this scene.” That third mode is the
one that actually lets you run v2 on arbitrary phone footage without
hand-tuning — especially in dim rooms where VGGT depth gets
noisy and CuTR's training distribution (bright iPad LiDAR captures) stops
applying. The scorecard's obb_sources.cutr / .pca counts are
the signal for when you've left CuTR's comfort zone.
Run 12 · course-correctThe Boxer turnaround & the Modal internals
Run 11 shipped. The dollhouse looked stylized, the floorplan fit a rectangle, gravity pointed down. And then I watched the outputs on a second scene and a third. The CuTR side of the pipeline was working on our one calibration clip and getting progressively worse as the room changed. Furniture drifted through walls, volumes inflated, the depth-sanity gate killed 890/1353 candidates on one run, and the chairs refused to show up. So I did something I had explicitly declined in the Run 11 plan — reached for Meta's Boxer checkpoint, isolated it in its own Modal app, and wired it behind the existing dashboard as a second pipeline choice.
The Run 11 plan rejected Boxer on licensing (CC-BY-NC) and integration cost. Both were real concerns, but weighing them before putting the model on a test clip was wrong. Day 2's question wasn't “is Boxer's license friendly?” — this is a personal, non-commercial blog experiment — it was “does CuTR's training distribution actually apply to phone walkthroughs of arbitrary rooms?” Running Boxer took four hours end-to-end in a sandbox app. Skipping that for two weeks would have been the real mistake.
Demo
The three artifacts below are the full output of a single 78 MB phone clip, fed through the new dashboard with the Boxer pipeline selected. No fine-tuning. No calibration. The video goes in, the three files below come out. Total Modal spend: a few cents.
Room reconstruction from one phone video. Oriented bounding boxes, pastel RoomPlan palette, gravity-snapped, floor rectangle-fitted. Rendered locally by matplotlib after Modal returns the scene graph.
Each fused 3D OBB is projected back through every frame's intrinsic + extrinsic, so you can spot when a box is wrong immediately — it won't stick to its object.
Recall still drops on chairs, plants, and mirrors at the 512-pixel detection resolution Boxer runs at on an A10G. Walls are empty in this scene — phone footage at 64 frames rarely gives the wall band enough confident points to RANSAC a plane. And the Lingbot geometry is not metric, so volumes are a pipeline-internal unit, not square meters. Each of those is a dial to turn — A100 for 960-pixel Boxer, a camera-height prior for metric scale, denser frame sampling for the wall band — not a rewrite. Future runs will knock them down one at a time.
How Modal is actually used (three apps, one dashboard)
Run 11 was one Modal app. Run 12 is three. The same phone clip now passes
through two GPU sandboxes that run on different Modal apps and one local process that
glues them together. Each Modal app owns a single concern — weights, image,
and cold-start latency — so you can change one without re-downloading the
other, and modal deploy each independently.
upload, WS, subprocess
modal run app.py::analyze_boxergas-lingbot-experimentA10G · Lingbot‑Map geometry: poses + depth + world points
gas-boxer-experimentA10G · Meta Boxer: OWLv2 + BoxerNet 3D OBBs with Hungarian fusion
local · no GPU, uses the two remote outputs
The second orchestrator app (the “local” one you invoke with
modal run app.py::analyze_boxer) is technically also running on Modal —
but as a local_entrypoint, which means the Python body executes on your
laptop while its .remote() calls round-trip to the two deployed apps.
That's how a single modal run invocation can pull tensors off
two separate A10G containers and render the artifacts with local matplotlib.
| Modal app | What's on GPU | Why it's its own app |
|---|---|---|
gas-lingbot-experiment |
Robbyant/lingbot-map GCT checkpoint. Streams RGB frames → poses + depth + per-pixel world points. | Heavy model (~3 GB weights, CUDA 12.8 torch 2.9.1). Isolated image so the Boxer app doesn't need the whole stack just to import. |
gas-boxer-experiment |
Meta facebook/boxer (OWLv2 text-grounded 2D + BoxerNet 3D OBB head + offline 3D-IoU fusion). |
Weights are CC-BY-NC. Keeping them behind a non-production sandbox app matches the license. 0.42 s/frame steady, < 1 GB peak. |
gas-v2 (legacy, Run 11) |
VGGT + Grounding DINO + SAM 2 + CuTR. | Selectable from the same dashboard for comparison. Useful when Boxer's class set doesn't match the prompt. |
Modal caches image layers but doesn't cache weights — those live
in modal.Volume objects. Each pipeline has its own
gas-<name>-weights volume, mounted at /weights by its
own app. Re-downloading VGGT shouldn't cost you anything when you're only
iterating on Boxer, and vice versa. When the Boxer team ships a new checkpoint,
I re-run download_weights on exactly one app; the other two are
untouched.
Dashboard × pipeline architecture (in software terms)
The UI lives at dashboard/main.py. It is intentionally boring FastAPI
plus vanilla JS — no React, no SSR, no DB. The interesting part is the
contract between the browser and the Modal subprocess.
-
Kickoff. A drag-dropped video plus a prompt lands at
POST /api/upload. The server allocatesrun_id = YYYYMMDD_HHMMSS_<6hex>, writes the bytes toruns/<run_id>/, and starts a background thread that runssubprocess.Popen(["modal", "run", entrypoint, ...]). The entrypoint is whichever pipeline the user picked in the dropdown —app.py::analyze_boxer(Boxer) orapp.py::analyze(legacy VGGT+CuTR). Everything is Modal — the dashboard never imports any of the heavy packages. -
Progress streaming. The thread tails
proc.stdoutline by line, publishes each line to an in-memoryRunBus, and also pattern-matches lines like[boxer] cached geometryor[boxer] raw=to a canonical stage id. Stage completions fire a structured event on the same bus. -
WebSocket /ws/<run_id>. The browser opens one per run.
Events arrive as JSON; the Flow tab's SVG DAG lights up each node with
CSS classes
.active / .done / .failas events land. The log tail streams the same stdout for debuggers. -
Two DAG layouts.
flow.jsholds aLAYOUTSdict keyed by pipeline id; the run-started event carries the pipeline id and the frontend rebuilds the DAG before the first stage lights. For Boxer that'sdecode → lingbot → boxer → room → fuse → ply → floorplan → dollhouse → pov → manifest. -
Artifacts. Every artifact lands under
runs/<run_id>/on local disk. The dashboard serves them throughGET /api/runs/<id>/artifacts/<filename>. The hero tile on the results pane is a Three.jsRoomViewerthat readsrun_manifest.jsonand builds a lit 3D scene in the browser, falling back to the dollhouse MP4 when the viewer can't run. -
Rerun tab. The dashboard lazily spawns
rerun serve --web-viewer runs/<id>/gas2.rrdon a free port and redirects into it via an iframe. The Boxer path doesn't produce a Rerun file, so the tab stays idle for those runs; the legacy VGGT path fills it.
The code, simplified
The real files are ~2 000 lines across
app.py / obb.py / room.py /
render.py / fusion.py /
dashboard/main.py. What follows strips every error handler, every
cache, and every percentile clip so you can see the spine. The real code adds the
percentile clips back in, because that's where most of the work lives.
# dashboard/main.py — spine of the upload → Modal → stream flow.
@app.post("/api/upload")
async def upload(video: UploadFile, prompt: str = Form(...), pipeline: str = Form("boxer")):
run_id = datetime.now().strftime("%Y%m%d_%H%M%S_") + uuid.uuid4().hex[:6]
out_dir = RUNS_DIR / run_id; out_dir.mkdir()
video_path = out_dir / video.filename
video_path.write_bytes(await video.read())
threading.Thread(target=_kickoff_modal_run,
args=(run_id, video_path, prompt, pipeline), daemon=True).start()
return {"run_id": run_id, "pipeline": pipeline}
def _kickoff_modal_run(run_id, video_path, prompt, pipeline):
entrypoint = {"boxer": "app.py::analyze_boxer", "vggt": "app.py::analyze"}[pipeline]
cmd = ["modal", "run", entrypoint,
"--video", str(video_path), "--prompt", prompt, "--run-id", run_id]
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
for line in proc.stdout:
bus.publish(run_id, {"stage": "log", "status": "info", "message": line.rstrip()})
stage = _classify_log_line(line, pipeline) # e.g. "[boxer] raw=" -> "boxer"
if stage:
bus.publish(run_id, {"stage": stage, "status": "ok"})
bus.publish(run_id, {"stage": "run", "status": "ok" if proc.wait()==0 else "failed"})
That's the entire frontend/backend boundary. The dashboard has no idea what a “VGGT” is. It just knows how to spawn a Modal subprocess and how to read stdout.
# app.py — the Boxer local_entrypoint. Runs on your laptop; the .remote()
# calls round-trip to two deployed Modal apps.
@app.local_entrypoint()
def analyze_boxer(video: str, prompt: str, run_id: str = None):
run_id = run_id or _new_run_id()
out_dir = (Path(__file__).parent / "runs" / run_id); out_dir.mkdir(parents=True)
video_bytes = Path(video).read_bytes()
# 1. Geometry: frames → poses + depth + world points. Remote A10G.
LingBot = modal.Cls.from_name("gas-lingbot-experiment", "LingBotPipeline")
geom = LingBot().run.remote(video_bytes, max_frames=64, max_side=512)
(out_dir / "geometry.npz").write_bytes(geom["geometry_npz"])
with np.load(io.BytesIO(geom["geometry_npz"])) as z:
frames, ext, K, world_points, wp_conf, depth = (
z["frames"], z["extrinsic_wc"], z["intrinsic"],
z["world_points"], z["world_points_conf"], z["depth"])
# 2. Detection: frames + extrinsics → oriented 3D boxes + labels. Remote A10G.
Boxer = modal.Cls.from_name("gas-boxer-experiment", "BoxerPipeline")
boxer_out = Boxer().run.remote(video_bytes=video_bytes, text_prompt=prompt,
extrinsic_wc=npy(ext), intrinsic=npy(K),
frames_bytes=npy(frames), depth_bytes=npy(depth),
gravity=[0, 0, 1])
# 3. Room extraction (local, no GPU): world points → floor polygon + walls.
room = extract_room(world_points, wp_conf, trajectory=inverse_camera_centers(ext))
rect = fit_oriented_rectangle(room["floor"]["polygon_xy"])
room["floor"]["polygon_xy"], room["floor"]["area_m2"] = rect.tolist(), polygon_area(rect)
# 4. Transform each Boxer OBB to room frame, snap bottom to floor, keep yaw only.
R_w2r, t_w2r = room["room_transform"]["R"], room["room_transform"]["t"]
scene_graph = []
for obj in boxer_out["scene_graph"]:
obb = OBB.from_record(obj, frame="world").transformed(R_w2r, t_w2r).gravitize(floor_z=0)
if 1e-4 < np.prod(obb.extent) < 6.0: # volume sanity
scene_graph.append(obb.to_record(label=obj["label"]))
# 5. Artifacts. All local matplotlib + Open3D.
render_floorplan_v2(scene_graph, out_dir / "floorplan_v2.png")
render_dollhouse_mp4(scene_graph, out_dir / "dollhouse.mp4")
render_pov_mp4(frames, depth, ext, K, scene_graph, out_dir / "pov.mp4")
# 6. Canonical manifest the dashboard reads.
(out_dir / "run_manifest.json").write_text(json.dumps({
"run_id": run_id, "pipeline_version": "gas2@0.3-boxer",
"scene_graph": scene_graph, "room": room,
"artifacts": {"dollhouse_mp4": "dollhouse.mp4",
"floorplan_v2_png": "floorplan_v2.png",
"pov_mp4": "pov.mp4"},
}, indent=2))
Three Modal apps, six steps, one JSON file to rule them all. Every subsequent
improvement — metric scale, more detection recall, an A100 for higher-res
Boxer, a real Rerun recording for the Boxer path — lands as one more
.remote() call or one more transform on the same scene graph.
The dashboard kicks off, the DAG lights up, the artifacts land. That's the product skin. Underneath, the detection recall is still scene-dependent, the walls are empty on ordinary phone footage, and the geometry is not metric. I chose to ship what exists today because shipping and iterating was the point. Tomorrow's list starts with a camera-height prior for metric conversion and an A100 run of Boxer at 960 px. None of that invalidates what you see above — it just gets us further.
Honest evaluation
What works:
- Pipeline is genuinely zero-shot. Change the prompt, change the classes. No training data, no fine-tuning.
- POV video is usable product-quality: masks track, detection boxes persist, depth colorization reads.
- Floorplan is recognizable as a room layout. Trajectory is visible, furniture is in roughly the right relative positions.
- Scene graph centroids are within ~10-20 cm of the true positions in our test video (eyeballed against the ground truth of the room we shot).
- Total cost of producing all six artifacts from a 55-sec video: under five cents on A10G.
What's broken or limited:
- No metric scale. VGGT's world units aren't meters. Our centroids are "roughly meters" but that's an accident of training data distributions, not a guarantee. If you want to say "the chair is 1.8 m from the door", you need a reference — Depth Anything V2 metric, a LiDAR frame, or a known-scale object in view.
- Label duplicates still exist. Our label+proximity merge collapses the easy cases but fails when Grounding DINO flip-flops between adjacent concepts ("chair" vs. "sofa chair"). CLIP-feature merging is the right fix and we haven't done it.
- 64-frame ceiling. On A10G. An A100-40GB would let us push to ~256 frames. For a long walkthrough (> 2 minutes at 30 fps), you'd want to chunk frames with overlapping VGGT passes and stitch the poses — or swap VGGT for MASt3R-SLAM, which handles arbitrary length with loop closure.
- Poisson mesh is rough. The hallucinated-surface clip helps, but there's no avoiding that Poisson assumes a closed manifold; point clouds from single-pass monocular depth rarely qualify. For a real mesh, TSDF fusion or a per-object SDF is the right tool.
- Cross-session alignment is not solved. Record the same room twice, and the two scene graphs don't register to each other — each has its own VGGT-anchored world frame. CLIP features + ICP on the point clouds would fix this; we haven't done it.
We didn't run this on the Replica dataset to compare with GAS v1's quantitative numbers. That's the next Mirdan entry — a proper eval with IoU of semantic regions vs. the paper's ground truth. This article was about the integration, not the benchmark.
What's next
The shortest backlog, ranked by leverage:
- CLIP-based track merging. Replace the label+proximity merge with cosine-similarity merging on CLIP embeddings of mask regions. This is the "proper" ConceptGraphs step. Expected improvement: the label-flip duplicates (chair vs. sofa chair) collapse correctly; cross-session alignment becomes possible.
- Metric calibration via Depth Anything V2. Run DAv2 metric on a few keyframes, solve a single scalar scale factor against VGGT's depth. Now the centroid positions are in real meters.
- Long-video ingestion. Chunk frames into overlapping 64-frame windows, run VGGT per window, stitch poses using the overlap. Or just swap in MASt3R-SLAM, which is designed for this. Either way, the 64-frame ceiling goes away.
- A queryable LLM interface. The scene graph has centroids + labels. Dump it into a prompt with a system message like "you can answer spatial queries about the room below" and you have natural-language floor plans. Mirdan experiment 02 might be this.
- Object-aware mesh reconstruction. Run Poisson per-object with its own voxel grid. The global mesh approach merges objects that should be distinct.
The full recipe
Everything compressed to a runbook:
mkdir gas2 && cd gas2
cat > pyproject.toml << 'EOF'
[project]
name = "gas2"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = ["modal>=0.64", "rerun-sdk>=0.20", "matplotlib>=3.8"]
EOF
# Put app.py here — the full ~500 lines are in the repo. Key design choices:
# - modal.Image.debian_slim() (not nvidia/cuda, avoids Docker Hub)
# - torch==2.5.1 + cu124 in final layer (force-reinstall to avoid cu13 drift)
# - transformers>=4.44 for Grounding DINO (HF port, no nvcc)
# - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# - max_side=504, max_frames=64 (fits A10G 24 GB)
# - bbox via 10/90 percentile (not min/max)
# - label+proximity merge dedup (until CLIP-based)
pip install modal
modal setup # browser OAuth
modal run app.py::download_weights # one-time, ~10 min, ~50 GB
modal run app.py::analyze --video path/to/your.mp4
# Outputs in out/:
# floorplan.png (PCA-aligned top-down, ~0.1 MB)
# pov.mp4 (RGB + masks + depth, ~1 MB)
# scene.ply (colored point cloud, ~2 MB)
# scene.obj (Poisson mesh, ~8 MB)
# gas2.json (scene graph with centroids)
# gas2.rrd (Rerun recording with POV)
# To scrub the POV:
pip install rerun-sdk
rerun out/gas2.rrd
To wire four foundation models into a pipeline, you need to know: each model's output schema
(not what the README says it is — what it literally returns), each model's dtype regime
(does it expect autocast?), each model's input constraints (patch alignment, image dtype),
and how your dependency graph resolves across image layers (the last pip install wins).
The rest is Python.
Full source: gas2/app.py in the accompanying repo. Next Mirdan experiment: quantitative
eval of this pipeline on Replica vs. the GAS v1 baseline — a proper "is this actually
better" comparison, with plots.