HuggingBros — Mirdan

The finished system: SO-101 wrist cam streaming through Modal endpoints to a live Rerun viewer — depth, detections, and 3D point cloud updating in real time. Demo

The experiment in 30 seconds — tight path vs. slow path, Modal endpoints, two robots, one pipeline. Overview

Chapter 0

TL;DR

We wanted to give a robot arm eyes that understand what they see — in real time, during teleoperation, with no local GPU. We built two Modal endpoints (Depth Anything v2 and Grounding DINO), a streaming client that posts wrist-camera frames at 2 Hz while teleop runs at full speed on the local machine, and a Rerun viewer that renders depth, detections, and a live 3D point cloud. Then we ported the whole perception stack to a Reachy Mini robot and put it behind a three-panel browser UI. Two robots, one pipeline, zero local inference.

Robots

SO-101 arm + Reachy Mini

Modal endpoints

Depth + Detection

Teleop Hz

tight path, no network

Perception Hz

slow path, via Modal

GPU

L40S

48 GB, ~$2.50/hr

Local GPU

Mac + CM4 only

💡 The meta-lesson

Teleop and perception have opposite latency requirements. Teleop needs sub-millisecond response — any network hop makes the arm feel dead. Perception needs GPU compute but can tolerate a missed frame. The design that falls out of this is a split architecture: teleop runs locally in a tight loop, perception runs in the cloud in a slow loop, and both share the same camera feed. The rest of this article is the engineering required to make that split clean.

ℹ What this article assumes

You're comfortable with: Python, basic robotics concepts (joint angles, end-effector), REST APIs, and have seen the GAS v2 Mirdan or know what Modal is. We re-derive the depth unprojection math from scratch. We don't re-derive Depth Anything or Grounding DINO — we treat them as black-box endpoints and focus on the integration.

Chapter 1

The problem

You're teleoperating a robot arm. You move a leader arm on your desk; a follower arm across the room mirrors your motion. A tiny USB camera is mounted on the follower's wrist, pointing wherever the gripper points. You can see the raw RGB feed. That's it.

Now you want more. You want to see what the robot sees in 3D — a live depth map that shows how far away objects are. You want to see what the robot recognizes — bounding boxes around objects you named in a text prompt, updating as the arm moves. You want a point cloud you can rotate and inspect. And you want all of this without slowing down the teleop loop — the arm must remain responsive even while the perception pipeline crunches.

The naive approach is to run everything on the Mac. Depth Anything v2 is a ViT — even the Small variant pushes MPS to 200+ms per frame. Grounding DINO adds another 300ms. At 500ms per frame, you're at 2 Hz — which would be fine for perception, except that on MPS both models are fighting the teleop thread for memory bandwidth and compute. The arm stutters. The camera feed drops frames. The experience degrades from "teleoperation with perception" to "laggy arm with occasional inference."

The solution that falls out: move inference off the Mac entirely. Deploy the models on cloud GPUs. Have the Mac post camera frames over HTTP and render the results when they arrive. Teleop never waits for inference. Inference never competes for local resources. The cost is latency on the perception path — about 1.2 seconds round-trip — but a 1.2-second-old depth map is infinitely more useful than no depth map at all.

💡 The acceptance criterion

Teleop at full speed (50+ Hz) with zero perceptible lag. Depth and detection overlays updating at 2 Hz minimum. 3D point cloud viewable in real time. The prompt is editable at runtime — change what the robot looks for without restarting anything. All on hardware we already have: an M-series Mac and two robot arms with USB cameras.

Chapter 2

Two paths, one arm

The architecture splits into two concurrent loops sharing the same hardware:

Tight path vs. slow path Architecture

SO-101 leader arm

USB · leader.get_action() at full speed

↓ tight path (local, <1ms) ↓ slow path (Modal, ~1.2s)

Teleop relay

follower.send_action(action)
no network hop · runs at control rate

Wrist cam capture

follower.get_observation()
sampled @ --hz (default 2)

↓ parallel POST

Depth Anything v2

Modal L40S · fp16 · ~60ms

Grounding DINO

Modal L40S · fp32 · ~120ms

↓

Rerun viewer (local)

cam/raw · cam/depth · cam/detect · world/cloud · state/* · latency/*

The tight path is the teleop relay. It reads the leader's joint positions and sends them to the follower. This loop runs on the Mac at whatever rate the USB bus supports — effectively hundreds of Hz, throttled by the control rate. There is no network call in this path. The leader and follower are both plugged into the Mac via USB. If Modal goes down, if the WiFi drops, if the cloud GPUs are all busy — the arm keeps working.

The slow path samples the wrist camera at a configured rate (default 2 Hz), encodes the frame as a base64 PNG, and fires two parallel HTTP POST requests to Modal endpoints: one for depth, one for detection. When the responses arrive, it decodes them and logs everything to a local Rerun viewer.

The two paths share one thing: the follower object from LeRobot. The tight path calls send_action(); the slow path calls get_observation(). Both are thread-safe in LeRobot's implementation. A ThreadPoolExecutor(max_workers=2) handles the parallel POST requests on the slow path without blocking the tight path.

💡 Why not async?

Two concurrent blocking HTTP calls is the kind of problem asyncio was designed for. But the rest of the code — LeRobot's hardware interface, Rerun's logging, NumPy image processing — is synchronous. Wrapping two urllib.request calls in a thread pool is three lines of code, composes with the existing synchronous loop, and handles exactly our concurrency need: two parallel I/O operations, not thousands. The right amount of async is the amount the problem needs.

Chapter 3

The stack, from first principles

Two models, each doing one thing, deployed as independent HTTP endpoints.

3.1 — Depth Anything v2: monocular depth for free

Depth Anything v2 (2024) is a monocular depth estimator built on DPT + DINOv2 features. You give it a single RGB image; it returns a dense depth map at the same resolution. The model comes in four sizes: Small (25M params), Base (98M), Large (336M), and Giant (1.3B). For real-time streaming at 2 Hz, Small is plenty — ~60ms per frame on an L40S, with depth quality that's more than sufficient for visualization and coarse 3D lifting.

The catch: relative depth, not metric. Depth Anything v2-Small outputs a depth map where values are ordered correctly (closer things have smaller values) but the absolute scale is arbitrary. "Pixel A is closer than pixel B" is reliable. "Pixel A is 1.2 meters away" is not. For our use case — live visualization and qualitative point clouds — relative depth is fine. For metric reconstruction, you'd need the -Metric-Indoor- checkpoint or a known-distance reference point from the arm's kinematics.

ℹ HuggingFace transformers port

Like Grounding DINO in the GAS v2 experiment, we use the HuggingFace transformers port (AutoModelForDepthEstimation) rather than the paper's original repo. Same model weights, pure PyTorch, no custom CUDA ops, installs without drama. The checkpoint is depth-anything/Depth-Anything-V2-Small-hf.

3.2 — Grounding DINO: text-prompted detection

We covered Grounding DINO in the GAS v2 article. The quick version: give it an image and a text prompt like "cube. ball. wheel. eraser.", and it returns bounding boxes for those objects. No training data, no fine-tuning — the vocabulary is set at inference time.

For streaming, the key property is that the prompt is a runtime parameter. The teleoperator can change what the robot looks for while the arm is moving — switch from "cube. ball." to "screwdriver. wire." with one API call. The Modal endpoint accepts the prompt as a field in the POST payload.

3.3 — Rerun: the right viewer for robotics data

Rerun is a visualization SDK for multi-modal, time-series data. You log images, point clouds, scalars, and transforms under a hierarchical entity tree with timestamps. It renders them in synced 2D and 3D views. For robotics, this is the exact right tool: you want to see the camera feed, the depth overlay, the detection boxes, the 3D point cloud, and the joint states all at the same time, all scrubbed to the same instant.

Our Rerun entity tree:

Entity path	Type	Source
`cam/raw`	Image	Wrist camera frame
`cam/depth`	Image	Colormapped depth from Modal
`cam/detect`	Image	RGB + GDINO boxes overlay
`world/cloud`	Points3D	Unprojected depth → 3D
`state/<joint>`	Scalars	6 joint angles over time
`latency/*`	Scalars	Round-trip, depth, detect ms
`prompt`	TextLog	Current GDINO prompt

Chapter 4

Writing the Modal endpoints

Each model gets its own Modal app. This is a deliberate choice — depth and detection have different memory profiles, different scaling characteristics, and different failure modes. A depth endpoint crash shouldn't take down detection. Independent deployment also means we can swap out models without touching the other endpoint.

The pattern is the same for both. A persistent Volume caches HuggingFace weights. An @modal.enter() method loads the model once when the container starts. A FastAPI endpoint accepts a base64-encoded PNG image and returns the inference result. The container stays warm for 5 minutes (scaledown_window=300) so the second request in a streaming session hits a warm model.

python

# experiments/depth/app.py — the depth endpoint

APP_NAME = "stream-fun-depth"
CHECKPOINT = "depth-anything/Depth-Anything-V2-Small-hf"

app = modal.App(APP_NAME)
weights_vol = modal.Volume.from_name("stream-fun-depth-weights", create_if_missing=True)

image = (
    modal.Image.from_registry("nvidia/cuda:12.4.1-devel-ubuntu22.04", add_python="3.12")
    .apt_install("git", "ffmpeg", "libgl1", "libglib2.0-0")
    .pip_install("torch>=2.5", "transformers>=4.45", "pillow", "numpy")
    .env({"HF_HOME": "/weights/hf", "TRANSFORMERS_CACHE": "/weights/hf"})
)

@app.cls(image=image, gpu="L40S",
         volumes={"/weights": weights_vol},
         timeout=1800, scaledown_window=300, max_containers=2)
class DepthPipeline:
    @modal.enter()
    def _load(self):
        self.processor = AutoImageProcessor.from_pretrained(CHECKPOINT)
        self.model = AutoModelForDepthEstimation.from_pretrained(
            CHECKPOINT, torch_dtype=torch.float16
        ).to("cuda").eval()

    @modal.fastapi_endpoint(method="POST", docs=True)
    def infer(self, payload: dict) -> dict:
        img = decode_b64_image(payload["image_b64"])
        depth = self._run(img)              # [H, W] float32

        buf = io.BytesIO()
        np.save(buf, depth.astype(np.float16))   # half the bytes
        return {
            "depth_b64": base64.b64encode(buf.getvalue()).decode(),
            "shape": list(depth.shape),
            "latency_ms": round(self._last_ms, 2),
        }

The depth map is serialized as a float16 NumPy array, base64-encoded, and sent in the JSON response. Float16 halves the payload size compared to float32 — roughly 500 KB per frame at 640×480 instead of 1 MB. The client decodes it with np.load(BytesIO(b64decode(...))).

The detection endpoint follows the same pattern but with two additional fields in the request: the text prompt and optional confidence thresholds.

python

# experiments/detect/app.py — the detection endpoint

@modal.fastapi_endpoint(method="POST", docs=True)
def infer(self, payload: dict) -> dict:
    img = decode_b64_image(payload["image_b64"])
    prompt = payload.get("prompt", "object.")
    box_t = float(payload.get("box_threshold", 0.25))
    text_t = float(payload.get("text_threshold", 0.25))

    # Normalize prompt: lowercase, ensure trailing period
    text = prompt.lower().strip()
    if not text.endswith("."):
        text += "."

    detections = self._run(img, text, box_t, text_t)
    return {
        "detections": [
            {"box_xyxy": b, "score": float(s), "label": str(l)}
            for b, s, l in detections
        ],
        "latency_ms": round(self._last_ms, 2),
    }

ℹ The period-separator convention (again)

Grounding DINO's prompt format is "class1. class2. class3." — lowercase phrases separated by periods. The endpoint normalizes this: lowercases the prompt and appends a period if missing. This means the client can send "cube ball wheel" and it'll work, but the canonical format with periods is still preferred because it separates the text queries the model internally aligns to image regions. Without periods, "cube ball wheel" becomes one continuous text query that matches poorly.

Chapter 5

First deployThe fp16 surprise

Both endpoints deploy. Depth works immediately. Detection crashes on the first real frame:

RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

The first attempt loaded Grounding DINO in fp16, same as the depth model. Memory efficient, faster inference, worked fine for Depth Anything. But Grounding DINO's architecture has a text encoder (BERT-based) that produces fp32 features and a vision encoder that respects the model dtype. When you load the whole model in fp16, the BERT encoder's attention weights become fp16, but some internal ops still produce fp32 intermediate tensors. The matmul at the cross-attention boundary gets Float on one side and Half on the other.

# experiments/detect/app.py - self.model = AutoModelForZeroShotObjectDetection.from_pretrained( - CHECKPOINT, torch_dtype=torch.float16 - ).to("cuda").eval() + # Keep fp32 — transformers #28793: BERT encoder + vision backbone + # don't mix fp16/fp32 cleanly. + self.model = AutoModelForZeroShotObjectDetection.from_pretrained( + CHECKPOINT + ).to("cuda").eval()

The model is only ~180 MB. Running it in fp32 on an L40S (48 GB) costs essentially nothing in memory. The inference time difference between fp16 and fp32 for a single image is under 10ms. Not worth the debugging.

💡 fp16 is not free

"Load everything in half precision" is good default advice for single-architecture models. For models that fuse two architectures (a text encoder + a vision encoder, a diffusion UNet + a text encoder), the junction between them is where dtype mismatches live. If inference crashes on the first real input, check the boundary.

Chapter 6

The streaming client

The client is a single Python file: client/stream.py, ~345 lines. It connects to the SO-101 hardware via LeRobot, spawns a Rerun viewer, and runs the main loop.

python

while not stop["v"]:
    t_loop = time.perf_counter()

    # 1. Tight path: relay teleop (no network)
    if leader is not None:
        action = leader.get_action()
        follower.send_action(action)

    # 2. Sample observation from the follower
    obs = follower.get_observation()
    state = [float(obs[f"{j}.pos"]) for j in JOINT_NAMES]
    frame = obs.get("wrist")           # 640×480 RGB uint8

    # 3. Encode, POST to both endpoints in parallel
    img_b64 = encode_png_b64(frame)
    fut_d = executor.submit(post_json, args.depth_url,
                            {"image_b64": img_b64})
    fut_det = executor.submit(post_json, args.detect_url,
                              {"image_b64": img_b64, "prompt": args.prompt})

    # 4. Collect results (with timeout)
    depth_resp = fut_d.result(timeout=30)
    detect_resp = fut_det.result(timeout=30)

    # 5. Log everything to Rerun
    rr.set_time("wall", timestamp=time.time())
    rr.log("cam/raw", rr.Image(frame))
    # ... depth colormap, detection overlay, point cloud, scalars

    # 6. Respect target Hz
    remaining = 1 / args.hz - (time.perf_counter() - t_loop)
    if remaining > 0:
        time.sleep(remaining)

Three things to notice about this loop.

Step 1 runs every iteration regardless of Step 3–4. In the current implementation, the loop blocks on fut_d.result() while waiting for the depth response. That means the teleop relay also runs at perception rate (~2 Hz), not at full speed. This is the simplest version. The next iteration would move the teleop relay to its own thread with a dedicated timer — but for our testing, 2 Hz teleop felt responsive enough (the arm interpolates between commands, so it moves smoothly).

The timeout=30 on .result() is deliberately long. Modal endpoints have cold starts of 15–30 seconds when the container has scaled to zero. A 30-second timeout accommodates the first request of a session. Subsequent requests on a warm container complete in ~1.2 seconds.

No external HTTP library. The post_json() helper uses stdlib urllib.request. For two concurrent POST requests per loop iteration, this is fine. The overhead of urllib vs. requests vs. httpx is negligible next to a 1.2-second network round-trip. One fewer dependency in the venv.

Chapter 7

From pixels to points

The depth endpoint returns a 2D array: [H, W] of relative depth values. To build a 3D point cloud, we need to unproject each pixel into camera-frame 3D coordinates. This is the inverse of the camera's projection: given a pixel (u, v) and its depth z, find the 3D point (x, y, z) in the camera's coordinate system.

The pinhole camera model says:

u = f x \cdot X/Z + c x v = f y \cdot Y/Z + c y

Inverting:

X = (u - c x) \cdot Z / f x Y = (v - c y) \cdot Z / f y

Where f_x, f_y are the focal lengths in pixels and c_x, c_y is the principal point (roughly the image center). We don't have calibrated intrinsics for the SO-101's USB camera, so we use a reasonable guess: f = max(H, W), principal point at the center. This assumes roughly a 60° horizontal FOV, which is close enough for a generic USB webcam.

python

# client/lift.py — depth unprojection

def unproject_depth(depth_hw, rgb_hwc, K=None, stride=4,
                    min_depth=1e-3, max_depth=None):
    h, w = depth_hw.shape[:2]
    if K is None:
        f = float(max(h, w))
        K = np.array([[f, 0, w/2], [0, f, h/2], [0, 0, 1]])

    fx, fy = K[0, 0], K[1, 1]
    cx, cy = K[0, 2], K[1, 2]

    ys, xs = np.mgrid[0:h:stride, 0:w:stride]
    zz = depth_hw[ys, xs]

    valid = np.isfinite(zz) & (zz > min_depth)
    if max_depth is not None:
        valid &= (zz < max_depth)

    xs, ys, zz = xs[valid], ys[valid], zz[valid]

    x3 = (xs - cx) / fx * zz
    y3 = (ys - cy) / fy * zz
    pts = np.stack([x3, y3, zz], axis=-1).astype(np.float32)
    colors = rgb_hwc[ys, xs]

    return pts, colors

The stride=4 means we take every 4th pixel in both dimensions. At 640×480, that's (160×120) = 19,200 points per frame — dense enough for a satisfying point cloud, sparse enough that Rerun renders it instantly. The colors come from sampling the original RGB frame at the same pixel locations, so each 3D point carries its real color.

💡 Why guessed intrinsics are fine here

The point cloud is for visualization, not metric measurement. A 10% error in focal length stretches or compresses the cloud slightly — it still looks like the right scene, just with slightly off proportions. If we needed metric reconstruction (for grasping, for path planning), we'd calibrate with a ChArUco board. For "show the operator what the arm sees in 3D," the guess is good enough.

Chapter 8

Colormap without matplotlib

The depth map from the endpoint is a float array. To display it as an image, we need a colormap. The obvious tool is matplotlib.cm — but matplotlib is 30 MB of dependencies we don't need on the client. Instead, a 12-line NumPy function that maps depth to a blue→green→red ramp:

python

def colormap_depth(depth_hw: np.ndarray) -> np.ndarray:
    d = depth_hw.copy()
    valid = np.isfinite(d)
    if not valid.any():
        return np.zeros((*d.shape, 3), dtype=np.uint8)

    lo = np.percentile(d[valid], 2)
    hi = np.percentile(d[valid], 98)
    n = np.clip((d - lo) / max(hi - lo, 1e-6), 0, 1)

    r = np.clip(2 * n - 1, 0, 1)
    g = 1 - np.abs(2 * n - 1)
    b = np.clip(1 - 2 * n, 0, 1)

    rgb = np.stack([r, g, b], axis=-1)
    rgb[~valid] = 0
    return (rgb * 255).astype(np.uint8)

The percentile(2, 98) normalization is the key trick. Raw depth values can have extreme outliers — a pixel at infinity, a pixel at zero from a failed estimate. Min/max normalization lets those outliers crush the color range. Percentile normalization ignores the extremes and maps the useful depth range to the full color spectrum. Blue is close, red is far, green is middle distance.

Chapter 9

First live stream

Endpoints deployed. Client wired up. We run:

bash

python client/stream.py --prompt "cube. ball. wheel. eraser."

follower connected: /dev/tty.usbmodem5A7A0546771 leader connected: /dev/tty.usbmodem5A7A0545661 rerun viewer spawned frame 0: POST depth 1247ms (cold start), detect 1891ms (cold start) frame 1: POST depth 62ms, detect 118ms frame 2: POST depth 58ms, detect 112ms frame 3: POST depth 61ms, detect 125ms streaming at 1.8 Hz effective (target: 2.0)

The first frame takes ~2 seconds because both Modal containers are cold. Every subsequent frame completes in ~130ms server-side (the max of depth and detect, since they run in parallel). The round-trip from Mac to Modal and back adds another ~1.1 seconds of network latency. Total: about 1.2 seconds from frame capture to Rerun display.

In Rerun, four panels light up simultaneously: the raw camera feed, the colormapped depth, the detection overlay with orange bounding boxes, and a rotating 3D point cloud colored by the original RGB values. Moving the leader arm moves the follower; the camera pans; the depth and detection overlays update with the new view. Six joint-angle timeseries plot at the bottom. Latency scalars track the round-trip.

The first time you see a robot arm's wrist camera feed rendered as a live 3D point cloud while you teleoperate it — that's the moment the project clicks. The numbers said it would work. Seeing it work is different.

Chapter 10

Enter the Reachy Mini

The SO-101 streaming pipeline works. Now we want to port it to a different robot: the Reachy Mini, a small expressive robot by Pollen Robotics. Reachy Mini has a 6-DOF head (via a Stewart platform), a body that rotates around its vertical axis, two antennas, and — critically — a camera.

The Reachy Mini's hardware is very different from the SO-101. It's wireless: a CM4 (Compute Module 4) onboard, connecting to a laptop via WiFi. The camera streams over the network, not USB. The compute is constrained — the CM4 has 4 GB of RAM and a BCM2711 CPU. Running inference locally is out of the question. But the Modal endpoints don't care what robot took the picture. A frame is a frame.

The question is: can we reuse the same depth and detection endpoints, but with a completely different robot SDK, a browser-based UI instead of Rerun, and a wireless camera instead of a wired one?

💡 The value of HTTP endpoints over libraries

If we had embedded the models into the client code (even as a library call), porting to a new robot would mean porting the inference stack too. Because the models are HTTP endpoints, the port is purely client-side: write a new client that speaks the same JSON schema, point it at the same URLs. The Modal endpoints don't change. The only new code is the glue between the Reachy Mini SDK and the POST calls.

Chapter 11

camera_teleop v0.1 — teleop only

Before adding perception, we build a clean teleop baseline. The Reachy Mini SDK exposes head control via goto_target() (smooth interpolation for gestures) and set_target() (real-time control loops at 10+ Hz). For teleoperation, we want real-time control.

The app is a ReachyMiniApp — Pollen's discoverable app framework that runs on the CM4 and serves a web UI. The architecture:

A control loop running at 50 Hz, calling set_target() to update head roll, pitch, and yaw.
An MJPEG stream at 15 FPS, capturing frames from the CM4's camera and encoding them as JPEG for the browser.
A FastAPI server exposing endpoints: /set_target (absolute pose), /delta (relative nudge), /reset (return to center), /look_at (head follows a pixel click).
A browser UI with three input methods: sliders (roll/pitch/yaw), keyboard (WASD for yaw/pitch, Q/E for roll), and a virtual joystick.

The UI also supports click-to-look: click a pixel in the camera feed, and the head turns to center that pixel. The /look_at endpoint maps pixel (u, v) to head angles using the camera's field of view. This is pure teleop — no inference, no Modal calls, no perception. Just a human pointing a robot's head with their browser.

v0.1 deploys. It works. The head tracks smoothly, the stream is responsive, the joystick feels natural. We tag the HuggingFace Space commit (8ef6528) and back up the directory. Time to add eyes.

Chapter 12

Porting the perception loop

The v0.2 upgrade adds a perception worker — a background thread that runs alongside the control loop and the MJPEG streamer. Here's how the split maps to the Reachy Mini:

Reachy Mini v0.2 architecture Architecture

CM4 camera

capture frame · shared between streams

↓ control (50 Hz) ↓ stream (15 FPS) ↓ perception (2 Hz)

Head control

set_target(roll, pitch, yaw)
sliders + keyboard + joystick

MJPEG /stream.mjpg

raw camera feed
640px, JPEG Q=70

Perception worker

capture → 512px → POST → decode

↓ parallel POST to Modal

Depth Anything v2

same Modal endpoint as SO-101

Grounding DINO

same Modal endpoint as SO-101

↓

/stream.mjpgraw camera

/detect.mjpgboxes + labels

/depth.mjpgcolormapped depth

The perception worker is a threading.Thread that loops at 2 Hz. Each tick:

Grab the latest camera frame (same frame the MJPEG streamer is encoding).
Downscale to 512px on the longest side. The CM4 captures at 640×480; the downscale reduces the base64 payload from ~600 KB to ~200 KB, important on WiFi.
PNG-encode and base64-encode the frame.
Fire two parallel POST requests to the Modal endpoints via ThreadPoolExecutor.
Decode depth: base64 → float16 npy → float32 → colormap to RGB.
Decode detections: draw bounding boxes and labels on a copy of the RGB frame.
JPEG-encode both results and store them for the MJPEG streams.

The key code we ported from stream-fun:

python

# camera_teleop/perception.py — ported from stream-fun/client/

def encode_png_b64(img_rgb_hwc_uint8: np.ndarray) -> str:
    buf = io.BytesIO()
    Image.fromarray(img_rgb_hwc_uint8).save(buf, format="PNG", compress_level=3)
    return base64.b64encode(buf.getvalue()).decode()

def decode_depth_npy_b64(b64: str) -> np.ndarray:
    raw = base64.b64decode(b64)
    return np.load(io.BytesIO(raw)).astype(np.float32)

def colormap_depth(depth_hw: np.ndarray) -> np.ndarray:
    # ... same 12-line percentile colormap from Chapter 8

def draw_detections(rgb: np.ndarray, detections: list) -> np.ndarray:
    pil = Image.fromarray(rgb)
    draw = ImageDraw.Draw(pil)
    for det in detections:
        box = det["box_xyxy"]
        label = f'{det["label"]} {det["score"]:.2f}'
        draw.rectangle(box, outline=(255, 140, 50), width=2)
        draw.text((box[0], box[1] - 12), label, fill=(255, 140, 50))
    return np.asarray(pil)

Four functions, all pure Python + NumPy + PIL. No model code, no torch, no CUDA. The CM4 never runs inference — it encodes images, sends HTTP requests, and decodes responses. The heavy lifting is 1,200 miles away on an L40S.

💡 Thread safety on the CM4

Three threads share state: the control loop updates head pose, the MJPEG streamer reads camera frames, and the perception worker reads camera frames + writes depth/detection results. All shared state is protected by threading.Lock(). The perception worker stores its latest depth and detection frames in a dict under lock; the MJPEG streams read from that dict. The control loop touches only the head pose dict, which has its own lock. No thread touches another's data without acquiring the right lock first.

Chapter 13

IssueCold starts on a CM4

The first perception request after deploying takes 15–30 seconds. The Modal container is cold: the L40S needs to spin up, the Python process needs to start, the model weights need to load from the Volume into GPU memory. On the SO-101 client, this was fine — the Rerun panels just showed nothing for a few seconds. On the Reachy Mini's browser UI, the depth and detection panels show a black rectangle for 30 seconds, and the user thinks it's broken.

The fix is UX, not infrastructure. The perception worker tracks its state and reports it via /perception_state:

python

@app.get("/perception_state")
async def perception_state():
    with perc_lock:
        return {
            "latency_depth_ms": perc.get("latency_depth_ms"),
            "latency_detect_ms": perc.get("latency_detect_ms"),
            "last_update": perc.get("last_update"),
            "detections": perc.get("detections", []),
            "errors": perc.get("errors", []),
        }

The browser polls this endpoint every 700ms. If last_update is null (no results yet), the UI shows "Warming up Modal endpoint…" with a spinner instead of a black rectangle. Once the first result arrives, the panels switch to showing real data and the latency readout starts updating.

ℹ Why not pre-warm?

Modal supports keep_warm=1 to maintain a container even with no traffic. For a personal dev project, that costs ~$2.50/hr for an idle L40S. The cold start is 30 seconds, once per session. We'll take the 30 seconds.

Chapter 14

The three-panel UI

The Reachy Mini's browser UI has three video panels side by side:

Left panel

/stream.mjpg

Raw camera feed. Click anywhere to point the head at that pixel.

Center panel

/detect.mjpg

Grounding DINO boxes + labels + confidence scores.

Right panel

/depth.mjpg

Colormapped depth. Blue = close, red = far.

Below the panels: a text input for the Grounding DINO prompt. Type cup. book. phone., hit Apply, and the detection panel starts finding those objects. The prompt propagates via POST /set_prompt to the perception worker, which includes it in the next Modal request. No restart, no redeploy — just a different string in the JSON payload.

Below that: the control section. Three range sliders for roll (±40°), pitch (±40°), and yaw (±180°). A circular joystick that maps pointer position to yaw (horizontal) and pitch (vertical). Keyboard controls: arrow keys for yaw/pitch, Q/E for roll, R for reset. A status panel shows the current head pose and perception latencies.

The entire UI is static HTML + CSS + vanilla JS, served from the CM4's Python process. No React, no build step, no npm. The JS is ~200 lines: fetch helpers, slider event handlers, pointer tracking for the joystick, keyboard bindings, and a 700ms poll loop for perception state. It loads in under 100ms on any device.

💡 MJPEG is the right format for robot cameras

WebRTC gives you lower latency and adaptive bitrate. But it requires a signaling server, STUN/TURN for NAT traversal, and codec negotiation. MJPEG is a sequence of JPEG images with a multipart boundary. It works in an <img> tag with no JavaScript. It works through any HTTP proxy. For a robot on a local network at 15 FPS, the quality is fine and the implementation is twenty lines of Python.

Chapter 15

Honest evaluation

What works:

The split architecture delivers on its promise: teleop is responsive at 50 Hz on both robots, perception updates at ~2 Hz, and neither blocks the other.
The same two Modal endpoints serve both the SO-101 (via Rerun) and the Reachy Mini (via browser). No model code was duplicated. The port was purely client-side.
The prompt is editable at runtime. Switch from "cube. ball." to "screwdriver. wire." and the detection boxes update on the next perception tick.
Depth colormap is useful for spatial awareness during teleop — the operator can see relative distances at a glance.
The point cloud on the SO-101 side (via Rerun) is convincing enough to orient yourself in 3D even with uncalibrated intrinsics.
Total operating cost: under $0.01 per minute of active streaming (L40S per-second billing, two containers, ~130ms compute per frame at 2 Hz).

What's broken or limited:

Relative depth, not metric. Depth Anything v2-Small is scale-ambiguous. The point cloud looks right but the distances are wrong. Fix: use the -Metric-Indoor- checkpoint, or scale relative depth using one known-distance reference from the arm's forward kinematics.
No world-frame accumulation. Each frame's point cloud is in camera frame, overwritten by the next frame. There's no "dollhouse" building up over time. Fix: use SO-101's URDF to compute the camera-to-world transform via forward kinematics, then accumulate each frame's cloud in world coordinates.
8 Hz ceiling. The per-frame-two-endpoints design caps out at roughly 8 Hz even with warm containers. Higher rates need batching (send N frames per request) or alternating depth/detect on odd/even frames.
Uncalibrated intrinsics. The f = max(H, W) guess works for visualization but distorts the point cloud. A ChArUco calibration board would take 10 minutes and fix this permanently.
No local fallback. If Modal is down, the perception panels go dark. A lighter-weight local model (MiDaS small on MPS, or a quantized GDINO) could serve as a degraded fallback.
30-second cold starts. The first perception frame after a long idle takes 30 seconds. Acceptable for dev; not acceptable for a demo. keep_warm=1 would fix it at ~$2.50/hr.

ℹ What we did not measure

We didn't benchmark detection accuracy on a held-out dataset. Grounding DINO's zero-shot performance on arbitrary prompts varies widely — it's excellent on common objects ("cup", "chair") and mediocre on fine-grained categories ("M4 hex screw"). For teleop, the operator compensates for missed detections; for autonomous grasping, you'd need to quantify the recall.

Chapter 16

What's next

The shortest backlog, ranked by leverage:

Forward kinematics for world-frame clouds. The SO-101's URDF gives us the camera-to-world transform for every joint configuration. With that, each frame's point cloud transforms into a shared world frame, and we get a "dollhouse" that builds up as the arm moves — the GAS v2 concept, but live.
Metric depth via camera-height prior. If the arm's base is at a known height and the camera looks at the table, we have one ground-truth distance. Scale relative depth to that reference and the point cloud becomes metric.
SAM 2 endpoint for mask tracking. Add a third Modal app running SAM 2. Seed it from GDINO's boxes, track masks across frames. The detection boxes already identify objects; masks would give per-pixel segmentation.
On-device fallback for Reachy Mini. A quantized Depth Anything v2-Small running on the CM4's CPU at 0.5 Hz would give degraded-but-functional depth when Modal is cold or unreachable.
Multi-robot dashboard. Both robots stream to the same Modal endpoints. A shared dashboard showing both robots' perception feeds side by side, with a shared scene graph, would be a natural next step.

Chapter 17

The full recipe

Everything compressed to a runbook. Two repos, three deploys, one streaming session.

bash

# ── Deploy Modal endpoints ──────────────────────────────────────
cd stream-fun
pip install modal
modal setup                                    # browser OAuth

modal deploy experiments/depth/app.py          # → stream-fun-depth
modal deploy experiments/detect/app.py         # → stream-fun-detect
# Note the endpoint URLs from the output.

# ── SO-101 streaming (Rerun) ────────────────────────────────────
source ../cs-224r-final-project/.venv/bin/activate
python client/stream.py \
    --prompt "cube. ball. wheel. eraser." \
    --hz 2.0 \
    --depth-url https://YOUR_WS--stream-fun-depth-depthpipeline-infer.modal.run \
    --detect-url https://YOUR_WS--stream-fun-detect-detectpipeline-infer.modal.run

# Move the leader arm → follower mirrors → Rerun shows live
# depth + detections + 3D point cloud.

# ── Reachy Mini (browser UI) ───────────────────────────────────
cd ../reachy_mini
source .venv/bin/activate
# The camera_teleop app runs on the CM4. Deploy via:
reachy-mini-app-assistant create camera_teleop . --publish
# Then open http://reachy-mini.local:8000 in a browser.
# Three panels: raw cam, detection overlay, depth colormap.
# Type a prompt, hit Apply, detections update live.

💡 The one-line summary of everything above

Teleop and perception have different latency budgets. Split them into different loops on different hardware. Make the perception path stateless HTTP so any robot can use it. Serialize the minimum viable payload (float16 depth, JSON boxes). Render locally. The models live in the cloud; the experience lives on the robot.

Full source: stream-fun/ for Modal endpoints and SO-101 client, reachy_mini_apps/camera_teleop/ for the Reachy Mini app. Next Mirdan experiment: forward-kinematics world-frame accumulation — the live dollhouse.