TL;DR
We wanted to give a robot arm eyes that understand what they see — in real time, during teleoperation, with no local GPU. We built two Modal endpoints (Depth Anything v2 and Grounding DINO), a streaming client that posts wrist-camera frames at 2 Hz while teleop runs at full speed on the local machine, and a Rerun viewer that renders depth, detections, and a live 3D point cloud. Then we ported the whole perception stack to a Reachy Mini robot and put it behind a three-panel browser UI. Two robots, one pipeline, zero local inference.
Teleop and perception have opposite latency requirements. Teleop needs sub-millisecond response — any network hop makes the arm feel dead. Perception needs GPU compute but can tolerate a missed frame. The design that falls out of this is a split architecture: teleop runs locally in a tight loop, perception runs in the cloud in a slow loop, and both share the same camera feed. The rest of this article is the engineering required to make that split clean.
You're comfortable with: Python, basic robotics concepts (joint angles, end-effector), REST APIs, and have seen the GAS v2 Mirdan or know what Modal is. We re-derive the depth unprojection math from scratch. We don't re-derive Depth Anything or Grounding DINO — we treat them as black-box endpoints and focus on the integration.
The problem
You're teleoperating a robot arm. You move a leader arm on your desk; a follower arm across the room mirrors your motion. A tiny USB camera is mounted on the follower's wrist, pointing wherever the gripper points. You can see the raw RGB feed. That's it.
Now you want more. You want to see what the robot sees in 3D — a live depth map that shows how far away objects are. You want to see what the robot recognizes — bounding boxes around objects you named in a text prompt, updating as the arm moves. You want a point cloud you can rotate and inspect. And you want all of this without slowing down the teleop loop — the arm must remain responsive even while the perception pipeline crunches.
The naive approach is to run everything on the Mac. Depth Anything v2 is a ViT — even the Small variant pushes MPS to 200+ms per frame. Grounding DINO adds another 300ms. At 500ms per frame, you're at 2 Hz — which would be fine for perception, except that on MPS both models are fighting the teleop thread for memory bandwidth and compute. The arm stutters. The camera feed drops frames. The experience degrades from "teleoperation with perception" to "laggy arm with occasional inference."
The solution that falls out: move inference off the Mac entirely. Deploy the models on cloud GPUs. Have the Mac post camera frames over HTTP and render the results when they arrive. Teleop never waits for inference. Inference never competes for local resources. The cost is latency on the perception path — about 1.2 seconds round-trip — but a 1.2-second-old depth map is infinitely more useful than no depth map at all.
Teleop at full speed (50+ Hz) with zero perceptible lag. Depth and detection overlays updating at 2 Hz minimum. 3D point cloud viewable in real time. The prompt is editable at runtime — change what the robot looks for without restarting anything. All on hardware we already have: an M-series Mac and two robot arms with USB cameras.
Two paths, one arm
The architecture splits into two concurrent loops sharing the same hardware:
no network hop · runs at control rate
sampled @ --hz (default 2)
The tight path is the teleop relay. It reads the leader's joint positions and sends them to the follower. This loop runs on the Mac at whatever rate the USB bus supports — effectively hundreds of Hz, throttled by the control rate. There is no network call in this path. The leader and follower are both plugged into the Mac via USB. If Modal goes down, if the WiFi drops, if the cloud GPUs are all busy — the arm keeps working.
The slow path samples the wrist camera at a configured rate (default 2 Hz), encodes the frame as a base64 PNG, and fires two parallel HTTP POST requests to Modal endpoints: one for depth, one for detection. When the responses arrive, it decodes them and logs everything to a local Rerun viewer.
The two paths share one thing: the follower object from LeRobot. The tight path calls
send_action(); the slow path calls get_observation(). Both are thread-safe
in LeRobot's implementation. A ThreadPoolExecutor(max_workers=2) handles the parallel
POST requests on the slow path without blocking the tight path.
Two concurrent blocking HTTP calls is the kind of problem asyncio was designed for. But the rest
of the code — LeRobot's hardware interface, Rerun's logging, NumPy image processing —
is synchronous. Wrapping two urllib.request calls in a thread pool is three lines of
code, composes with the existing synchronous loop, and handles exactly our concurrency need:
two parallel I/O operations, not thousands. The right amount of async is the amount the problem
needs.
The stack, from first principles
Two models, each doing one thing, deployed as independent HTTP endpoints.
3.1 — Depth Anything v2: monocular depth for free
Depth Anything v2 (2024) is a monocular depth estimator built on DPT + DINOv2 features. You give it a single RGB image; it returns a dense depth map at the same resolution. The model comes in four sizes: Small (25M params), Base (98M), Large (336M), and Giant (1.3B). For real-time streaming at 2 Hz, Small is plenty — ~60ms per frame on an L40S, with depth quality that's more than sufficient for visualization and coarse 3D lifting.
The catch: relative depth, not metric. Depth Anything v2-Small outputs a depth map
where values are ordered correctly (closer things have smaller values) but the absolute scale is
arbitrary. "Pixel A is closer than pixel B" is reliable. "Pixel A is 1.2 meters away"
is not. For our use case — live visualization and qualitative point clouds — relative depth
is fine. For metric reconstruction, you'd need the -Metric-Indoor- checkpoint or a
known-distance reference point from the arm's kinematics.
Like Grounding DINO in the GAS v2 experiment, we use the HuggingFace
transformers port (AutoModelForDepthEstimation) rather than the paper's
original repo. Same model weights, pure PyTorch, no custom CUDA ops, installs without drama.
The checkpoint is depth-anything/Depth-Anything-V2-Small-hf.
3.2 — Grounding DINO: text-prompted detection
We covered Grounding DINO in the GAS v2 article.
The quick version: give it an image and a text prompt like "cube. ball. wheel. eraser.",
and it returns bounding boxes for those objects. No training data, no fine-tuning — the vocabulary
is set at inference time.
For streaming, the key property is that the prompt is a runtime parameter. The teleoperator can change what the robot looks for while the arm is moving — switch from "cube. ball." to "screwdriver. wire." with one API call. The Modal endpoint accepts the prompt as a field in the POST payload.
3.3 — Rerun: the right viewer for robotics data
Rerun is a visualization SDK for multi-modal, time-series data. You log images, point clouds, scalars, and transforms under a hierarchical entity tree with timestamps. It renders them in synced 2D and 3D views. For robotics, this is the exact right tool: you want to see the camera feed, the depth overlay, the detection boxes, the 3D point cloud, and the joint states all at the same time, all scrubbed to the same instant.
Our Rerun entity tree:
| Entity path | Type | Source |
|---|---|---|
cam/raw | Image | Wrist camera frame |
cam/depth | Image | Colormapped depth from Modal |
cam/detect | Image | RGB + GDINO boxes overlay |
world/cloud | Points3D | Unprojected depth → 3D |
state/<joint> | Scalars | 6 joint angles over time |
latency/* | Scalars | Round-trip, depth, detect ms |
prompt | TextLog | Current GDINO prompt |
Writing the Modal endpoints
Each model gets its own Modal app. This is a deliberate choice — depth and detection have different memory profiles, different scaling characteristics, and different failure modes. A depth endpoint crash shouldn't take down detection. Independent deployment also means we can swap out models without touching the other endpoint.
The pattern is the same for both. A persistent Volume caches HuggingFace weights. An
@modal.enter() method loads the model once when the container starts. A FastAPI
endpoint accepts a base64-encoded PNG image and returns the inference result. The container
stays warm for 5 minutes (scaledown_window=300) so the second request in a
streaming session hits a warm model.
# experiments/depth/app.py — the depth endpoint
APP_NAME = "stream-fun-depth"
CHECKPOINT = "depth-anything/Depth-Anything-V2-Small-hf"
app = modal.App(APP_NAME)
weights_vol = modal.Volume.from_name("stream-fun-depth-weights", create_if_missing=True)
image = (
modal.Image.from_registry("nvidia/cuda:12.4.1-devel-ubuntu22.04", add_python="3.12")
.apt_install("git", "ffmpeg", "libgl1", "libglib2.0-0")
.pip_install("torch>=2.5", "transformers>=4.45", "pillow", "numpy")
.env({"HF_HOME": "/weights/hf", "TRANSFORMERS_CACHE": "/weights/hf"})
)
@app.cls(image=image, gpu="L40S",
volumes={"/weights": weights_vol},
timeout=1800, scaledown_window=300, max_containers=2)
class DepthPipeline:
@modal.enter()
def _load(self):
self.processor = AutoImageProcessor.from_pretrained(CHECKPOINT)
self.model = AutoModelForDepthEstimation.from_pretrained(
CHECKPOINT, torch_dtype=torch.float16
).to("cuda").eval()
@modal.fastapi_endpoint(method="POST", docs=True)
def infer(self, payload: dict) -> dict:
img = decode_b64_image(payload["image_b64"])
depth = self._run(img) # [H, W] float32
buf = io.BytesIO()
np.save(buf, depth.astype(np.float16)) # half the bytes
return {
"depth_b64": base64.b64encode(buf.getvalue()).decode(),
"shape": list(depth.shape),
"latency_ms": round(self._last_ms, 2),
}
The depth map is serialized as a float16 NumPy array, base64-encoded, and sent in the JSON response.
Float16 halves the payload size compared to float32 — roughly 500 KB per frame at 640×480
instead of 1 MB. The client decodes it with np.load(BytesIO(b64decode(...))).
The detection endpoint follows the same pattern but with two additional fields in the request: the text prompt and optional confidence thresholds.
# experiments/detect/app.py — the detection endpoint
@modal.fastapi_endpoint(method="POST", docs=True)
def infer(self, payload: dict) -> dict:
img = decode_b64_image(payload["image_b64"])
prompt = payload.get("prompt", "object.")
box_t = float(payload.get("box_threshold", 0.25))
text_t = float(payload.get("text_threshold", 0.25))
# Normalize prompt: lowercase, ensure trailing period
text = prompt.lower().strip()
if not text.endswith("."):
text += "."
detections = self._run(img, text, box_t, text_t)
return {
"detections": [
{"box_xyxy": b, "score": float(s), "label": str(l)}
for b, s, l in detections
],
"latency_ms": round(self._last_ms, 2),
}
Grounding DINO's prompt format is "class1. class2. class3." — lowercase
phrases separated by periods. The endpoint normalizes this: lowercases the prompt and appends a
period if missing. This means the client can send "cube ball wheel" and it'll
work, but the canonical format with periods is still preferred because it separates the text queries
the model internally aligns to image regions. Without periods, "cube ball wheel" becomes
one continuous text query that matches poorly.
First deployThe fp16 surprise
Both endpoints deploy. Depth works immediately. Detection crashes on the first real frame:
The first attempt loaded Grounding DINO in fp16, same as the depth model. Memory efficient, faster inference, worked fine for Depth Anything. But Grounding DINO's architecture has a text encoder (BERT-based) that produces fp32 features and a vision encoder that respects the model dtype. When you load the whole model in fp16, the BERT encoder's attention weights become fp16, but some internal ops still produce fp32 intermediate tensors. The matmul at the cross-attention boundary gets Float on one side and Half on the other.
The model is only ~180 MB. Running it in fp32 on an L40S (48 GB) costs essentially nothing in memory. The inference time difference between fp16 and fp32 for a single image is under 10ms. Not worth the debugging.
"Load everything in half precision" is good default advice for single-architecture models. For models that fuse two architectures (a text encoder + a vision encoder, a diffusion UNet + a text encoder), the junction between them is where dtype mismatches live. If inference crashes on the first real input, check the boundary.
The streaming client
The client is a single Python file: client/stream.py, ~345 lines. It connects to the
SO-101 hardware via LeRobot, spawns a Rerun viewer, and runs the main loop.
while not stop["v"]:
t_loop = time.perf_counter()
# 1. Tight path: relay teleop (no network)
if leader is not None:
action = leader.get_action()
follower.send_action(action)
# 2. Sample observation from the follower
obs = follower.get_observation()
state = [float(obs[f"{j}.pos"]) for j in JOINT_NAMES]
frame = obs.get("wrist") # 640×480 RGB uint8
# 3. Encode, POST to both endpoints in parallel
img_b64 = encode_png_b64(frame)
fut_d = executor.submit(post_json, args.depth_url,
{"image_b64": img_b64})
fut_det = executor.submit(post_json, args.detect_url,
{"image_b64": img_b64, "prompt": args.prompt})
# 4. Collect results (with timeout)
depth_resp = fut_d.result(timeout=30)
detect_resp = fut_det.result(timeout=30)
# 5. Log everything to Rerun
rr.set_time("wall", timestamp=time.time())
rr.log("cam/raw", rr.Image(frame))
# ... depth colormap, detection overlay, point cloud, scalars
# 6. Respect target Hz
remaining = 1 / args.hz - (time.perf_counter() - t_loop)
if remaining > 0:
time.sleep(remaining)
Three things to notice about this loop.
Step 1 runs every iteration regardless of Step 3–4. In the current
implementation, the loop blocks on fut_d.result() while waiting for the depth response.
That means the teleop relay also runs at perception rate (~2 Hz), not at full speed. This is the
simplest version. The next iteration would move the teleop relay to its own thread with a dedicated
timer — but for our testing, 2 Hz teleop felt responsive enough (the arm interpolates between
commands, so it moves smoothly).
The timeout=30 on .result() is deliberately long. Modal
endpoints have cold starts of 15–30 seconds when the container has scaled to zero. A 30-second
timeout accommodates the first request of a session. Subsequent requests on a warm container
complete in ~1.2 seconds.
No external HTTP library. The post_json() helper uses stdlib
urllib.request. For two concurrent POST requests per loop iteration, this is fine.
The overhead of urllib vs. requests vs. httpx is negligible
next to a 1.2-second network round-trip. One fewer dependency in the venv.
From pixels to points
The depth endpoint returns a 2D array: [H, W] of relative depth values. To build a
3D point cloud, we need to unproject each pixel into camera-frame 3D coordinates.
This is the inverse of the camera's projection: given a pixel (u, v) and its depth
z, find the 3D point (x, y, z) in the camera's coordinate system.
The pinhole camera model says:
v = fy · Y/Z + cy
Inverting:
Y = (v − cy) · Z / fy
Where fx, fy are the focal lengths in pixels and
cx, cy is the principal point (roughly the image center).
We don't have calibrated intrinsics for the SO-101's USB camera, so we use a reasonable guess:
f = max(H, W), principal point at the center. This assumes roughly a 60° horizontal
FOV, which is close enough for a generic USB webcam.
# client/lift.py — depth unprojection
def unproject_depth(depth_hw, rgb_hwc, K=None, stride=4,
min_depth=1e-3, max_depth=None):
h, w = depth_hw.shape[:2]
if K is None:
f = float(max(h, w))
K = np.array([[f, 0, w/2], [0, f, h/2], [0, 0, 1]])
fx, fy = K[0, 0], K[1, 1]
cx, cy = K[0, 2], K[1, 2]
ys, xs = np.mgrid[0:h:stride, 0:w:stride]
zz = depth_hw[ys, xs]
valid = np.isfinite(zz) & (zz > min_depth)
if max_depth is not None:
valid &= (zz < max_depth)
xs, ys, zz = xs[valid], ys[valid], zz[valid]
x3 = (xs - cx) / fx * zz
y3 = (ys - cy) / fy * zz
pts = np.stack([x3, y3, zz], axis=-1).astype(np.float32)
colors = rgb_hwc[ys, xs]
return pts, colors
The stride=4 means we take every 4th pixel in both dimensions. At 640×480,
that's (160×120) = 19,200 points per frame — dense enough for a satisfying point cloud,
sparse enough that Rerun renders it instantly. The colors come from sampling the original RGB
frame at the same pixel locations, so each 3D point carries its real color.
The point cloud is for visualization, not metric measurement. A 10% error in focal length stretches or compresses the cloud slightly — it still looks like the right scene, just with slightly off proportions. If we needed metric reconstruction (for grasping, for path planning), we'd calibrate with a ChArUco board. For "show the operator what the arm sees in 3D," the guess is good enough.
Colormap without matplotlib
The depth map from the endpoint is a float array. To display it as an image, we need a colormap.
The obvious tool is matplotlib.cm — but matplotlib is 30 MB of dependencies
we don't need on the client. Instead, a 12-line NumPy function that maps depth to a
blue→green→red ramp:
def colormap_depth(depth_hw: np.ndarray) -> np.ndarray:
d = depth_hw.copy()
valid = np.isfinite(d)
if not valid.any():
return np.zeros((*d.shape, 3), dtype=np.uint8)
lo = np.percentile(d[valid], 2)
hi = np.percentile(d[valid], 98)
n = np.clip((d - lo) / max(hi - lo, 1e-6), 0, 1)
r = np.clip(2 * n - 1, 0, 1)
g = 1 - np.abs(2 * n - 1)
b = np.clip(1 - 2 * n, 0, 1)
rgb = np.stack([r, g, b], axis=-1)
rgb[~valid] = 0
return (rgb * 255).astype(np.uint8)
The percentile(2, 98) normalization is the key trick. Raw depth values can have
extreme outliers — a pixel at infinity, a pixel at zero from a failed estimate. Min/max
normalization lets those outliers crush the color range. Percentile normalization ignores the
extremes and maps the useful depth range to the full color spectrum. Blue is close, red is far,
green is middle distance.
First live stream
Endpoints deployed. Client wired up. We run:
python client/stream.py --prompt "cube. ball. wheel. eraser."
The first frame takes ~2 seconds because both Modal containers are cold. Every subsequent frame completes in ~130ms server-side (the max of depth and detect, since they run in parallel). The round-trip from Mac to Modal and back adds another ~1.1 seconds of network latency. Total: about 1.2 seconds from frame capture to Rerun display.
In Rerun, four panels light up simultaneously: the raw camera feed, the colormapped depth, the detection overlay with orange bounding boxes, and a rotating 3D point cloud colored by the original RGB values. Moving the leader arm moves the follower; the camera pans; the depth and detection overlays update with the new view. Six joint-angle timeseries plot at the bottom. Latency scalars track the round-trip.
Enter the Reachy Mini
The SO-101 streaming pipeline works. Now we want to port it to a different robot: the Reachy Mini, a small expressive robot by Pollen Robotics. Reachy Mini has a 6-DOF head (via a Stewart platform), a body that rotates around its vertical axis, two antennas, and — critically — a camera.
The Reachy Mini's hardware is very different from the SO-101. It's wireless: a CM4 (Compute Module 4) onboard, connecting to a laptop via WiFi. The camera streams over the network, not USB. The compute is constrained — the CM4 has 4 GB of RAM and a BCM2711 CPU. Running inference locally is out of the question. But the Modal endpoints don't care what robot took the picture. A frame is a frame.
The question is: can we reuse the same depth and detection endpoints, but with a completely different robot SDK, a browser-based UI instead of Rerun, and a wireless camera instead of a wired one?
If we had embedded the models into the client code (even as a library call), porting to a new robot would mean porting the inference stack too. Because the models are HTTP endpoints, the port is purely client-side: write a new client that speaks the same JSON schema, point it at the same URLs. The Modal endpoints don't change. The only new code is the glue between the Reachy Mini SDK and the POST calls.
camera_teleop v0.1 — teleop only
Before adding perception, we build a clean teleop baseline. The Reachy Mini SDK exposes head control
via goto_target() (smooth interpolation for gestures) and set_target()
(real-time control loops at 10+ Hz). For teleoperation, we want real-time control.
The app is a ReachyMiniApp — Pollen's discoverable app framework that runs on
the CM4 and serves a web UI. The architecture:
- A control loop running at 50 Hz, calling
set_target()to update head roll, pitch, and yaw. - An MJPEG stream at 15 FPS, capturing frames from the CM4's camera and encoding them as JPEG for the browser.
- A FastAPI server exposing endpoints:
/set_target(absolute pose),/delta(relative nudge),/reset(return to center),/look_at(head follows a pixel click). - A browser UI with three input methods: sliders (roll/pitch/yaw), keyboard (WASD for yaw/pitch, Q/E for roll), and a virtual joystick.
The UI also supports click-to-look: click a pixel in the camera feed, and the head turns to
center that pixel. The /look_at endpoint maps pixel (u, v) to
head angles using the camera's field of view. This is pure teleop — no inference, no
Modal calls, no perception. Just a human pointing a robot's head with their browser.
v0.1 deploys. It works. The head tracks smoothly, the stream is responsive, the joystick feels
natural. We tag the HuggingFace Space commit (8ef6528) and back up the directory.
Time to add eyes.
Porting the perception loop
The v0.2 upgrade adds a perception worker — a background thread that runs alongside the control loop and the MJPEG streamer. Here's how the split maps to the Reachy Mini:
sliders + keyboard + joystick
640px, JPEG Q=70
The perception worker is a threading.Thread that loops at 2 Hz. Each tick:
- Grab the latest camera frame (same frame the MJPEG streamer is encoding).
- Downscale to 512px on the longest side. The CM4 captures at 640×480; the downscale reduces the base64 payload from ~600 KB to ~200 KB, important on WiFi.
- PNG-encode and base64-encode the frame.
- Fire two parallel POST requests to the Modal endpoints via
ThreadPoolExecutor. - Decode depth: base64 → float16 npy → float32 → colormap to RGB.
- Decode detections: draw bounding boxes and labels on a copy of the RGB frame.
- JPEG-encode both results and store them for the MJPEG streams.
The key code we ported from stream-fun:
# camera_teleop/perception.py — ported from stream-fun/client/
def encode_png_b64(img_rgb_hwc_uint8: np.ndarray) -> str:
buf = io.BytesIO()
Image.fromarray(img_rgb_hwc_uint8).save(buf, format="PNG", compress_level=3)
return base64.b64encode(buf.getvalue()).decode()
def decode_depth_npy_b64(b64: str) -> np.ndarray:
raw = base64.b64decode(b64)
return np.load(io.BytesIO(raw)).astype(np.float32)
def colormap_depth(depth_hw: np.ndarray) -> np.ndarray:
# ... same 12-line percentile colormap from Chapter 8
def draw_detections(rgb: np.ndarray, detections: list) -> np.ndarray:
pil = Image.fromarray(rgb)
draw = ImageDraw.Draw(pil)
for det in detections:
box = det["box_xyxy"]
label = f'{det["label"]} {det["score"]:.2f}'
draw.rectangle(box, outline=(255, 140, 50), width=2)
draw.text((box[0], box[1] - 12), label, fill=(255, 140, 50))
return np.asarray(pil)
Four functions, all pure Python + NumPy + PIL. No model code, no torch, no CUDA. The CM4 never runs inference — it encodes images, sends HTTP requests, and decodes responses. The heavy lifting is 1,200 miles away on an L40S.
Three threads share state: the control loop updates head pose, the MJPEG streamer reads camera
frames, and the perception worker reads camera frames + writes depth/detection results. All
shared state is protected by threading.Lock(). The perception worker stores its
latest depth and detection frames in a dict under lock; the MJPEG streams read from that dict.
The control loop touches only the head pose dict, which has its own lock. No thread touches
another's data without acquiring the right lock first.
IssueCold starts on a CM4
The first perception request after deploying takes 15–30 seconds. The Modal container is cold: the L40S needs to spin up, the Python process needs to start, the model weights need to load from the Volume into GPU memory. On the SO-101 client, this was fine — the Rerun panels just showed nothing for a few seconds. On the Reachy Mini's browser UI, the depth and detection panels show a black rectangle for 30 seconds, and the user thinks it's broken.
The fix is UX, not infrastructure. The perception worker tracks its state and reports it via
/perception_state:
@app.get("/perception_state")
async def perception_state():
with perc_lock:
return {
"latency_depth_ms": perc.get("latency_depth_ms"),
"latency_detect_ms": perc.get("latency_detect_ms"),
"last_update": perc.get("last_update"),
"detections": perc.get("detections", []),
"errors": perc.get("errors", []),
}
The browser polls this endpoint every 700ms. If last_update is null (no results yet),
the UI shows "Warming up Modal endpoint…" with a spinner instead of a black rectangle.
Once the first result arrives, the panels switch to showing real data and the latency readout
starts updating.
Modal supports keep_warm=1 to maintain a container even with no traffic. For a
personal dev project, that costs ~$2.50/hr for an idle L40S. The cold start is 30 seconds,
once per session. We'll take the 30 seconds.
The three-panel UI
The Reachy Mini's browser UI has three video panels side by side:
Below the panels: a text input for the Grounding DINO prompt. Type cup. book. phone.,
hit Apply, and the detection panel starts finding those objects. The prompt propagates via
POST /set_prompt to the perception worker, which includes it in the next Modal request.
No restart, no redeploy — just a different string in the JSON payload.
Below that: the control section. Three range sliders for roll (±40°), pitch (±40°), and yaw (±180°). A circular joystick that maps pointer position to yaw (horizontal) and pitch (vertical). Keyboard controls: arrow keys for yaw/pitch, Q/E for roll, R for reset. A status panel shows the current head pose and perception latencies.
The entire UI is static HTML + CSS + vanilla JS, served from the CM4's Python process. No React, no build step, no npm. The JS is ~200 lines: fetch helpers, slider event handlers, pointer tracking for the joystick, keyboard bindings, and a 700ms poll loop for perception state. It loads in under 100ms on any device.
WebRTC gives you lower latency and adaptive bitrate. But it requires a signaling server,
STUN/TURN for NAT traversal, and codec negotiation. MJPEG is a sequence of JPEG images
with a multipart boundary. It works in an <img> tag with no JavaScript.
It works through any HTTP proxy. For a robot on a local network at 15 FPS, the
quality is fine and the implementation is twenty lines of Python.
Honest evaluation
What works:
- The split architecture delivers on its promise: teleop is responsive at 50 Hz on both robots, perception updates at ~2 Hz, and neither blocks the other.
- The same two Modal endpoints serve both the SO-101 (via Rerun) and the Reachy Mini (via browser). No model code was duplicated. The port was purely client-side.
- The prompt is editable at runtime. Switch from "cube. ball." to "screwdriver. wire." and the detection boxes update on the next perception tick.
- Depth colormap is useful for spatial awareness during teleop — the operator can see relative distances at a glance.
- The point cloud on the SO-101 side (via Rerun) is convincing enough to orient yourself in 3D even with uncalibrated intrinsics.
- Total operating cost: under $0.01 per minute of active streaming (L40S per-second billing, two containers, ~130ms compute per frame at 2 Hz).
What's broken or limited:
-
Relative depth, not metric. Depth Anything v2-Small is scale-ambiguous. The point
cloud looks right but the distances are wrong. Fix: use the
-Metric-Indoor-checkpoint, or scale relative depth using one known-distance reference from the arm's forward kinematics. - No world-frame accumulation. Each frame's point cloud is in camera frame, overwritten by the next frame. There's no "dollhouse" building up over time. Fix: use SO-101's URDF to compute the camera-to-world transform via forward kinematics, then accumulate each frame's cloud in world coordinates.
- 8 Hz ceiling. The per-frame-two-endpoints design caps out at roughly 8 Hz even with warm containers. Higher rates need batching (send N frames per request) or alternating depth/detect on odd/even frames.
-
Uncalibrated intrinsics. The
f = max(H, W)guess works for visualization but distorts the point cloud. A ChArUco calibration board would take 10 minutes and fix this permanently. - No local fallback. If Modal is down, the perception panels go dark. A lighter-weight local model (MiDaS small on MPS, or a quantized GDINO) could serve as a degraded fallback.
-
30-second cold starts. The first perception frame after a long idle takes
30 seconds. Acceptable for dev; not acceptable for a demo.
keep_warm=1would fix it at ~$2.50/hr.
We didn't benchmark detection accuracy on a held-out dataset. Grounding DINO's zero-shot performance on arbitrary prompts varies widely — it's excellent on common objects ("cup", "chair") and mediocre on fine-grained categories ("M4 hex screw"). For teleop, the operator compensates for missed detections; for autonomous grasping, you'd need to quantify the recall.
What's next
The shortest backlog, ranked by leverage:
- Forward kinematics for world-frame clouds. The SO-101's URDF gives us the camera-to-world transform for every joint configuration. With that, each frame's point cloud transforms into a shared world frame, and we get a "dollhouse" that builds up as the arm moves — the GAS v2 concept, but live.
- Metric depth via camera-height prior. If the arm's base is at a known height and the camera looks at the table, we have one ground-truth distance. Scale relative depth to that reference and the point cloud becomes metric.
- SAM 2 endpoint for mask tracking. Add a third Modal app running SAM 2. Seed it from GDINO's boxes, track masks across frames. The detection boxes already identify objects; masks would give per-pixel segmentation.
- On-device fallback for Reachy Mini. A quantized Depth Anything v2-Small running on the CM4's CPU at 0.5 Hz would give degraded-but-functional depth when Modal is cold or unreachable.
- Multi-robot dashboard. Both robots stream to the same Modal endpoints. A shared dashboard showing both robots' perception feeds side by side, with a shared scene graph, would be a natural next step.
The full recipe
Everything compressed to a runbook. Two repos, three deploys, one streaming session.
# ── Deploy Modal endpoints ──────────────────────────────────────
cd stream-fun
pip install modal
modal setup # browser OAuth
modal deploy experiments/depth/app.py # → stream-fun-depth
modal deploy experiments/detect/app.py # → stream-fun-detect
# Note the endpoint URLs from the output.
# ── SO-101 streaming (Rerun) ────────────────────────────────────
source ../cs-224r-final-project/.venv/bin/activate
python client/stream.py \
--prompt "cube. ball. wheel. eraser." \
--hz 2.0 \
--depth-url https://YOUR_WS--stream-fun-depth-depthpipeline-infer.modal.run \
--detect-url https://YOUR_WS--stream-fun-detect-detectpipeline-infer.modal.run
# Move the leader arm → follower mirrors → Rerun shows live
# depth + detections + 3D point cloud.
# ── Reachy Mini (browser UI) ───────────────────────────────────
cd ../reachy_mini
source .venv/bin/activate
# The camera_teleop app runs on the CM4. Deploy via:
reachy-mini-app-assistant create camera_teleop . --publish
# Then open http://reachy-mini.local:8000 in a browser.
# Three panels: raw cam, detection overlay, depth colormap.
# Type a prompt, hit Apply, detections update live.
Teleop and perception have different latency budgets. Split them into different loops on different hardware. Make the perception path stateless HTTP so any robot can use it. Serialize the minimum viable payload (float16 depth, JSON boxes). Render locally. The models live in the cloud; the experience lives on the robot.
Full source: stream-fun/ for Modal endpoints and SO-101 client,
reachy_mini_apps/camera_teleop/ for the Reachy Mini app.
Next Mirdan experiment: forward-kinematics world-frame accumulation — the live dollhouse.