BoxBros — Mirdan

The 32k prompted ACT picking up the orange cube. Bounding box selected from the dashboard, policy outputs absolute joint targets at 5 Hz, max 5° per step. Attempt 1

How the prompt flows (interactive) 3D · click an object

policy: idle target: none step 0 / 0

Click an object to prompt the synthetic policy. In the real system the dashboard runs GroundingDINO over the base camera feed, you click one of the detected boxes, and the policy gets that 4-D bbox as part of its observation. Drag to orbit, pinch to zoom.

Chapter 0

TL;DR

We took an off-the-shelf imitation-learning policy — the Action Chunking Transformer (ACT) — and gave it a fourth modality: a 4-dimensional bounding box that says this is the thing I want. Then we collected 80 teleoperation episodes on a 6-DOF SO-101 arm, auto-labelled the target object in each one using GroundingDINO, post-processed the dataset to inject those boxes as a new feature, and trained two policies side by side: a vanilla ACT and a prompted ACT. Both serve from Modal, both are switchable from a single browser dashboard, both are addressed by the same client. The whole stack is built on top of LeRobot's v3.0 dataset format with as few abstractions as we could get away with.

Episodes

v4 (40) + v5 (40)

Frames

41 411

~85 min teleop

Policies

vanilla + prompted

Train steps

32 000

~85 min on A10G

Action dim

5 joints + gripper

Prompt dim

[x_min, y_min, x_max, y_max]

Round-trip

~1.0 s

Mac → Modal → Mac

Local GPU

M-series Mac only

💡 The single idea

ACT already has a slot for a non-image, non-state observation called observation.environment_state. It exists to carry things like the position of a peg in a peg-in-hole task. Nothing in the architecture says it has to mean state — the transformer just embeds it as one extra token. We hijack that slot and stuff a normalized bounding box into it. The policy learns to attend to the token and the resulting behavior is conditional on whatever object the box surrounds. No new losses, no new networks, no architecture changes.

ℹ What this article assumes

You're comfortable with: PyTorch tensor shapes, basic transformer attention, REST APIs, and you've at least heard of imitation learning. You don't need to know ACT, π0, LeRobot, Modal, or GroundingDINO — we re-derive enough of each to follow the choices. Familiarity with our previous build log (HuggingBros) helps but isn't required.

Chapter 1

The problem

You have a robot arm on your desk. There's a small orange cube to its left, a green plush toy in front of it, and a black cube behind that. You want to be able to tell the arm which one to pick — with a click, not by retraining. Tomorrow you'll add a fourth object and you'd like that to work too.

The naive imitation-learning pipeline goes the other way. You collect a demonstration set: arm-picks-cube, arm-picks-cube, arm-picks-cube. You train a policy that takes (camera, joint state) and outputs joint targets. The policy has no idea what a cube is — it just learned that for these pixels, this is the right motion. If the cube moves, you hope the camera generalises. If you swap in a different object, you collect new demos and retrain. Disambiguation between two simultaneous candidate objects is invisible to the policy unless the demonstrations themselves disambiguate — and since every demo terminates in a successful grasp, the dataset never says not that one.

We want a different shape: an extra input at policy time that says here is the thing, and a training procedure that teaches the policy to use it. The shape we landed on is a 4-D bounding box, and the rest of this article is the engineering required to make the loop run end-to-end.

💡 The acceptance criterion

One trained model. Two objects in scene. Click on either bounding box in the dashboard. Arm goes for the one we clicked. Click on the other one. Arm goes for that one instead. No retraining between clicks.

Chapter 2

ACT in 5 minutes

The Action Chunking Transformer is the small-VLA that ships in LeRobot and just works for desk-scale teleop tasks. Three things make it different from the obvious "MLP from pixels to actions":

It's a transformer. Image features (from a frozen ResNet-18 backbone), joint-state encodings, and the optional environment-state vector all become tokens. The decoder produces a chunk of future actions.
It outputs an entire chunk at once. A single forward pass returns a sequence of chunk_size actions (default 50). At control time you execute as many of them as you want before the next forward pass.
It's trained as a CVAE. The encoder takes the actual demonstration chunk plus the current state and produces a latent code. The decoder reconstructs the chunk from observations + the latent. At inference, you sample z = 0: deterministic, mean-of-distribution actions.

Action chunking, visually

Why chunks? In teleop demonstrations, your hand commits to a multi-step plan even though the policy only sees one frame. Predicting one action from one frame gives you the average of every demonstrator's micro-decisions and produces dithering output. Predicting 50 actions forces the network to commit, and stitched-together chunks produce the smooth wrist-cam paths we see at inference.

One chunk = 50 future actions Interactive

Drag the “commit horizon” slider to see how many actions of the chunk are executed before the policy is queried again. actions_per_chunk = 1 is closed-loop; 50 is full open-loop.

actions_per_chunk: 1

The CVAE trick (briefly)

ACT's loss is the standard CVAE loss: reconstruction + KL. The reconstruction term is L1 between the predicted chunk and the demonstration chunk. The KL term keeps the latent near a unit Gaussian. A KL coefficient of about 10 is the LeRobot default and it means — informally — the model is biased toward "use the latent only when observations aren't enough."

For our purposes, none of this matters at inference: we always set z = 0. What matters is the encoder side, because that's where the prompt token lives.

ACT encoder — token roll-call Architecture

camera tokens

ResNet-18 → 8×15 grid
per camera, 2 cameras → 240 tokens

joint state

6 floats → 1 token
via Linear(6, 512)

env_state (= bbox)

4 floats → 1 token
via Linear(4, 512)

latent z

style code → 1 token
z=0 at inference

↓ concat → Transformer encoder (4 layers)

Decoder produces 50 × 6 action chunk

L1 loss vs demonstration chunk · KL on z

💡 The architectural foothold

The orange box above — env_state — is what makes ACT promptable without changing a line of model code. It's a stock LeRobot input that defaults to nothing. We give it a meaning.

Chapter 3

Making ACT promptable

env_state, repurposed

ACT has three optional input keys. Two are mandatory in practice: observation.images.* (one or more cameras) and observation.state (current joint angles). The third is observation.environment_state, a flat vector of arbitrary length. It exists for tasks where the environment carries information that isn't in the image — the position of a peg, the joint angles of a second arm, the contact force from a sensor.

Nothing in ACT cares what those numbers mean. The vector becomes one token via a linear projection, joins the other tokens, the transformer attends, the actions come out. If we put a 4-D bounding box in there during training, the model learns to associate that box with the eventual gripper trajectory. At inference, give it a different box and you get a different trajectory.

Why a 4-D bounding box (and not, say, a mask, a pixel, or a class label)

A bounding box is the laziest spatial prompt that still works. Four numbers, normalized to [0, 1] over image width and height:

bbox = [x min, y min, x max, y max] \in [0, 1] 4

Compared to alternatives:

Prompt shape	Pros	Cons
4-D bbox	Tiny. Fits in `env_state`. Trivial to auto-label. Trivial to draw with a click.	No object identity, no shape. Not pose-aware.
Pixel coordinate (x, y)	Even tinier (2-D).	Loses scale; one of the few cues ACT can actually exploit at coarse depth.
Segmentation mask (256×256 binary)	Most expressive.	Big. Need a mask encoder. Need SAM at inference. Way more brittle.
Class label (one-hot over N)	Smallest possible.	Closed-vocabulary. Can't add a new object without retraining.
CLIP text embedding (512-D)	Open-vocabulary.	Doesn't tell the policy where. Two cubes in scene → same embedding.

A bounding box hits the sweet spot: it's 4 numbers, anyone can produce it (open-vocab text model + image, or just a click and drag), and it carries both where and how big. Open-vocabulary is recovered at label time from the detector; the policy itself only ever sees four floats.

Think of it this way

The bbox is the policy's laser pointer. It doesn't identify the object — it just says “the thing I want is somewhere in this rectangle.” The policy still has to figure out what to do with the gripper from the camera; it just knows where to focus.

Chapter 4

Collecting demonstrations

lerobot-record drives data collection. You hold a leader arm, the follower mirrors your motion via USB, two cameras record the scene, and the entire trajectory (joint angles + camera frames + timestamps) is written to a Hugging Face dataset in the v3.0 chunked-parquet format. Our recording protocol for the prompted variant:

bash · record_v5.sh

lerobot-record \
  --robot.type=so101_follower \
  --robot.port=/dev/tty.usbmodem5A7A0546771 \
  --robot.id=so101_follower_main \
  --robot.cameras='{ wrist: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 15, backend: AVFOUNDATION},
                     base:  {type: opencv, index_or_path: 1, width: 640, height: 480, fps: 15, backend: AVFOUNDATION} }' \
  --teleop.type=so101_leader \
  --teleop.port=/dev/tty.usbmodem5A7A0545661 \
  --teleop.id=so101_leader_main \
  --display_data=true \
  --dataset.repo_id=ozyphus/so101-cleanup-v5 \
  --dataset.single_task="clean up the desk" \
  --dataset.num_episodes=40 \
  --dataset.fps=15 \
  --dataset.episode_time_s=35 \
  --dataset.reset_time_s=15 \
  --dataset.push_to_hub=true \
  --dataset.private=true

A few choices in there matter for the prompted setup:

Two cameras. A wrist cam (USB index 0) for fine-grained grasp information, and a base cam (USB index 1) angled at the workspace. The base cam is the one we'll later run a detector on.
Single task string. ACT takes a task string for compatibility with language-conditioned VLAs but doesn't actually consume it. We use "clean up the desk" for everything — the prompt token does the work.
15 fps, 35-second episodes. ~525 frames per episode × 80 episodes = ~42k frames, which lands in the “enough to learn a single-stage manipulation task, short of the 100k+ where things really start working” zone for ACT.
Both objects in scene every episode. If you only ever record demos with one object, the policy has no reason to attend to the bbox — it'll just learn “there's only one thing here, go for it.” We alternate which one we pick (~20 cube, ~20 toy) and vary positions, so the bbox is the only signal that breaks the symmetry.

The dataset that comes out is a tree of parquet files:

so101-cleanup-v5/ data/ chunk-000/ file-000.parquet ← one row per frame, all episodes concatenated videos/ observation.images.base/chunk-000/file-000.mp4 observation.images.wrist/chunk-000/file-000.mp4 meta/ info.json ← feature schema, codebase_version stats.json ← global per-feature stats episodes/chunk-000/file-000.parquet ← one row per episode tasks.parquet

ℹ The codebase_version branch (a small landmine)

LeRobot's dataset loader doesn't read from main — it resolves a Hub revision matching the codebase_version field in info.json (e.g. "v3.0") and reads from that branch. lerobot-record creates this branch automatically when it pushes. HfApi.upload_folder — which we use after manual surgery — does not. We had to add an explicit api.create_branch(...) after every upload, otherwise LeRobotDataset(...) downstream throws RevisionNotFoundError.

Chapter 5

Auto-labelling with GroundingDINO

We need a bounding box per episode that says this is the thing the demonstrator picked. There are two reasonable ways to get one:

Annotate the first frame manually. Reliable, but 80 episodes × click-and-drag is exactly the kind of work we set up Modal to avoid.
Run an open-vocabulary detector like GroundingDINO on the first frame, take the highest- scoring match for the prompt “orange cube . green toy . black cube”, and call it a day.

We picked (2) and deployed GroundingDINO behind a Modal endpoint. The detector lives in experiments/detector/app.py as a single @app.cls with a FastAPI endpoint. About 4 GB of weights, runs comfortably on an A10G (24 GB), warm inference around 200 ms.

The chassis bug

The first labeller run did exactly what you'd expect a slightly-overfit detector to do: it learned that the most prominent orange thing in the workspace was the SO-101 itself.

Run 1 18 / 40 “orange cube” episodes mis-labelled

auto_label_bboxes.py reported high-confidence detections for every cube episode. We pulled the review images and the bounding boxes were almost all on the arm, not the cube. The arm has an orange-red sticker on the elbow, and the cube sometimes occupies less than 5% of the frame. The detector picked the bigger, more saturated “orange thing” every time.

Per-phrase + area filter

Two changes fixed this. First, we noticed GroundingDINO sometimes combined multiple phrases in the prompt onto a single bounding box — a side effect of how its text encoder handles dotted prompts. The fix was to split the prompt by "." and run detection once per phrase, then take the union of results. Second, we added a maximum-area filter: any box that covers more than 8% of the image gets dropped. The arm's elbow region was reliably 30-50% of the frame, so this drops chassis matches without touching real objects.

python · experiments/detector/app.py

DEFAULT_PROMPT = "orange cube . green toy ."   # extend at the dashboard for new objects
DEFAULT_BOX_THRESHOLD = 0.30
DEFAULT_TEXT_THRESHOLD = 0.25
DEFAULT_MAX_AREA_FRAC = 0.08    # drop boxes > 8% of image — filters the arm chassis

def _detect_pil(self, image, prompt, *, box_th, text_th, max_area_frac):
    """Run detect once per phrase and merge — GDINO sometimes collapses
    a multi-phrase prompt into one box. Per-phrase keeps phrases independent."""
    phrases = [p.strip() for p in prompt.split(".") if p.strip()]
    H, W = image.size[1], image.size[0]
    img_area = float(H * W)
    out = []
    for phrase in phrases:
        boxes = self._detect_pil_one_prompt(image, phrase, box_th, text_th)
        for box in boxes:
            x0, y0, x1, y1 = box["bbox"]
            area_frac = ((x1 - x0) * (y1 - y0)) / img_area
            if area_frac > max_area_frac:
                continue   # too big — almost certainly the robot itself
            box["label"] = phrase
            box["bbox_norm"] = [x0/W, y0/H, x1/W, y1/H]
            out.append(box)
    return out

With those two changes, Run 2 labelled all 40 cube episodes correctly and 38 of 40 toy episodes, with 2 toy episodes producing no detection at all (out-of-frame at the first frame). For the unlabelled episodes we fall back to the median bbox across the labelled ones — the workspace centroid — which is a fine prior to learn from.

Per-phrase vs. one-shot prompt Diagram

One prompt: “orange cube . green toy .”

Detector returns <= 4 boxes,
often labelled “orange cube green toy”
on a single combined region. Useless.

Split → per-phrase calls

detect("orange cube") → box A
detect("green toy") → box B
area filter drops chassis match.

Chapter 6

Surgery on parquet

The labels file is a flat dict mapping episode_index to {"bbox_norm": [...], "n_frames_in_episode": N, ...}. To use it for training, we need it to live inside the dataset as a proper LeRobot feature. That means editing four places in the v3.0 dataset tree:

step 1

data/….parquet

Append observation.environment_state as a fixed-size list<float, 4>, broadcasting each episode's bbox over all of its frames.

step 2

meta/episodes/….parquet

Add 10 stats columns (min, max, mean, std, count, q01..q99) for the new feature, computed per-episode. Bbox is constant within an episode → min == max == mean.

step 3

meta/info.json

Declare the feature schema: dtype, shape, names. Without this LeRobot doesn't know it exists.

step 4

meta/stats.json

Add global stats over all frames (length-weighted across episodes). Used for normalisation in some policies.

The full script is tools/add_bbox_to_dataset.py. The interesting bit is step 1, because that's where we actually inject pixels-of-meaning into the trainable tensor:

python · tools/add_bbox_to_dataset.py

def update_data_parquet(parquet_path, labels):
    """Append fixed_size_list<float>[4] column, broadcasting per-episode bbox to all rows."""
    table = pq.read_table(parquet_path)
    ep_idx_col = table.column("episode_index").to_pylist()

    fallback = _median_bbox(labels)   # for episodes the detector skipped
    bboxes_flat = []
    for ep in ep_idx_col:
        key = str(ep)
        bbox = labels[key]["bbox_norm"] if key in labels else fallback
        bboxes_flat.extend(float(v) for v in bbox)

    flat  = pa.array(bboxes_flat, type=pa.float32())
    fixed = pa.FixedSizeListArray.from_arrays(flat, BBOX_DIM)   # 4
    new_table = table.append_column(NEW_FEATURE, fixed)
    pq.write_table(new_table, parquet_path)

We also need info.json to declare the feature, otherwise LeRobot won't plumb it through to ACT's input_features:

json · meta/info.json (added entry)

"observation.environment_state": {
  "dtype": "float32",
  "shape": [4],
  "names": ["x_min", "y_min", "x_max", "y_max"]
}

Once that's in place, LeRobotDataset(repo_id) picks up the new feature automatically. Sanity check: the keys printed by the dataset object now include observation.environment_state alongside the cameras and joint state, and ACT's config.input_features contains an entry with shape (4,). No model code changed.

💡 Why per-episode constant?

We label only the first frame of the demo, then broadcast that single bbox to every row of the episode's parquet. This means the policy sees a static bbox for the entire demo, even as the gripper moves and the object moves with it. We tried the alternative — running the detector frame-by-frame — and found that close to the grasp point, occlusion-by-gripper makes the detector unreliable. A static bbox is also closer to how a user will use the system: they click once at the start, the policy carries the prompt through the whole reach.

Chapter 7

Merging v4 + v5

We trained the first prompted ACT on 40 episodes (v4). It didn't learn the conditioning — clicking different bounding boxes produced the same trajectory. The symptom was diagnostic: when the dashboard did forward the bbox to the policy (we logged the env_state in every payload to confirm), the behavior was unchanged. This is the textbook signature of an ignored input: the gradient through the bbox token never lined up against a useful direction during training, so the policy effectively learned to attend to it with weight zero.

Looking at published ACT recipes, 40 demos is the under-fit edge for any pickup task — most papers use 50-200. So we recorded another 40 (v5, same scene, same protocol, alternating which object got picked) and merged the two into a single 80-episode dataset.

Merging LeRobot v3.0 datasets isn't a one-liner because of the chunked layout: each video file holds multiple episodes' frames, episode indices are global, and per-episode metadata lives in its own parquet. tools/merge_datasets.py handles all of this:

Snapshot both

Pull v4 and v5 to local cache via snapshot_download.

Copy v4 entirely

v4 is the BASE; copy the whole tree to output_dir as-is.

Shift v5 video file_index

v5's file-000.mp4 becomes file-NNN.mp4 where NNN follows v4's last index. No re-encode.

Append v5 data

Write v5's data parquet as a new file, with episode_index shifted by n_eps_v4, global index shifted by n_frames_v4.

Append v5 episode meta

Append v5 meta/episodes rows with chunk_index, file_index, and timestamp ranges shifted to point at the new video locations.

Recompute totals

info.json totals (total_episodes, total_frames) and stats.json get recomputed from the merged data.

Push + branch

Upload to a new HF repo, then explicitly create the v3.0 branch.

Once so101-cleanup-v4plus5 exists on the Hub (80 episodes, 41 411 frames, ~85 minutes of teleop), the prompted variant is a one-line invocation of add_bbox_to_dataset.py against the merged dataset:

bash

python tools/auto_label_bboxes.py --repo-id ozyphus/so101-cleanup-v4plus5 \
       --review-dir review_v4plus5/

python tools/add_bbox_to_dataset.py \
       --src-repo-id ozyphus/so101-cleanup-v4plus5 \
       --dst-repo-id ozyphus/so101-cleanup-v4plus5-prompted \
       --labels bbox_labels.json --output-dir build/v4plus5-prompted \
       --push

We get two datasets out of one collection: the unmodified merge for the vanilla baseline, and the prompted variant for the conditioning experiment.

Chapter 8

Three apps from one file

We want three Modal deployments from one source file: a default that we don't break, a vanilla policy, and a prompted policy. Each needs its own URL so the dashboard can switch between them without a redeploy.

Modal apps are named per source-file. To get three names without three files, we read an environment variable at deploy time and use it to derive both the app name and the checkpoint priority:

python · experiments/pi0/app.py (top of file)

POLICY_VARIANT = os.environ.get("POLICY_VARIANT", "current").strip().lower()
APP_NAME_BASE  = "so101-pi0"
APP_NAME = APP_NAME_BASE if POLICY_VARIANT == "current" else f"{APP_NAME_BASE}-{POLICY_VARIANT}"

CHECKPOINTS_BY_VARIANT = {
    "current": [
        ("/weights/ft/act-v4plus5-prompted-32k/checkpoints/032000/pretrained_model", "act"),
        ("/weights/ft/act-v4plus5-vanilla-32k/checkpoints/032000/pretrained_model",  "act"),
        *_FALLBACK_CHECKPOINTS,
    ],
    "prompted": [
        ("/weights/ft/act-v4plus5-prompted-32k/checkpoints/032000/pretrained_model", "act"),
        *_FALLBACK_CHECKPOINTS,
    ],
    "vanilla": [
        ("/weights/ft/act-v4plus5-vanilla-32k/checkpoints/032000/pretrained_model",  "act"),
        *_FALLBACK_CHECKPOINTS,
    ],
}
CHECKPOINT_CANDIDATES = CHECKPOINTS_BY_VARIANT[POLICY_VARIANT]

app = modal.App(APP_NAME)

And the deploy invocation:

bash

POLICY_VARIANT=prompted modal deploy experiments/pi0/app.py
POLICY_VARIANT=vanilla  modal deploy experiments/pi0/app.py

You get two URLs:

https://<workspace>--so101-pi0-prompted-pi0policy-live.modal.run https://<workspace>--so101-pi0-vanilla-pi0policy-live.modal.run

Both serve from the same Python class. The class loads whichever checkpoint is first in CHECKPOINT_CANDIDATES, and the env-state path inside the policy code gracefully no-ops when the loaded checkpoint doesn't expect an environment_state input:

python · PI0Policy.__init__ (excerpt)

env_feat = self.config.input_features.get("observation.environment_state")
self.expected_env_state_dim = int(env_feat.shape[0]) if env_feat is not None else 0

# at observation-build time:
def _build_raw_obs(self, image_b64, state, env_state):
    obs = {"observation.state": torch.tensor(state, dtype=torch.float32)}
    if self.expected_env_state_dim > 0:
        es = list(env_state) if env_state is not None else [0.0] * self.expected_env_state_dim
        es += [0.0] * (self.expected_env_state_dim - len(es))   # pad
        obs["observation.environment_state"] = torch.tensor(es[:self.expected_env_state_dim])
    return obs

When the vanilla checkpoint is loaded, expected_env_state_dim == 0 and the bbox payload is silently dropped. When the prompted checkpoint is loaded, expected_env_state_dim == 4 and the bbox is the difference between picking the cube and picking the toy.

Chapter 9

The dashboard

The dashboard is a single FastAPI process running on the Mac next to the arm. It does three things:

Talks USB to the arm and to both cameras (it owns the bus).
Streams both cameras to the browser as MJPEG, runs the detector on demand, accepts a click to pick a bounding box, and streams telemetry over a WebSocket.
Posts (image, state, env_state) to the selected Modal endpoint at 5 Hz and applies the returned actions back to the arm.

Dashboard architecture Diagram

SO-101 follower (USB)

5 joints + gripper @ ~50 Hz

2 cameras (USB)

wrist + base @ 15 fps

↓

FastAPI (port 8765, on Mac)

capture thread · inference thread · FrameBuffer per cam
POST /api/{connect, detect, select-bbox, run, stop, config, status}
GET /stream/{wrist,base} (MJPEG · multipart/x-mixed-replace)
WS /ws/telemetry (action, state, latency)

↓ MJPEG <img> ↓ WS telemetry ↓ HTTPS /live

Browser

Tailwind dark UI · click-to-pick canvas overlay
policy variant dropdown · run params

Telemetry stream

step, round-trip ms, joint state vs cmd

Modal endpoint

so101-pi0-{prompted, vanilla}
L40S · ACT · ~280 ms warm

The pick-box flow on the browser side is the part that the user actually touches. Once the policy URL is set and the arm is connected:

connect

USB handshake with the arm and both cams. Both streams light up.

home

Slow move to a known starting pose so every run starts from the same place.

detect

Snapshot the base cam, POST it to the detector endpoint, draw boxes.

pick

Click inside one of the boxes (or press 1-9). Selected bbox is cached.

run

Inference loop starts. Each tick: capture frames, send to Modal, apply chunk.

The dashboard owns the bus, which means the inference loop and the capture loop both want to read from the same servo. That contention surprised us — see Chapter 11.

Chapter 10

Closed-loop, absolute, capped

The control loop reads joint state, sends an observation to the policy, gets an action chunk, applies some prefix of the chunk, and repeats. Three knobs in this loop matter a lot more than they sound like they should:

action_mode = absolute

ACT trained on SO-101 outputs absolute joint targets, not deltas from the current state. The CLI default in so101_run.py was delta, which adds the “delta” to current state. With absolute outputs treated as deltas, the arm drifts toward joint limits over a few seconds.

Fix: always pass --action-mode=absolute for ACT checkpoints. The dashboard hard-codes this.

actions_per_chunk ∈ {1, 50}

At actions_per_chunk = 1, each predicted chunk is consumed exactly once before re-querying. Closed-loop. Resilient but easily “dithers” when the policy isn't sure.

At actions_per_chunk = 50, the entire chunk is executed open-loop before re-querying. Smooth but commits to bad plans.

The under-trained vanilla 8k policy looped at 1 and committed at 50; the 32k variants behave better at 1.

max_joint_step_deg

Per-tick safety cap on how far any joint is allowed to move in one application of an action. Default 3°, dashboard uses 5°. This is how we sleep at night while the policy explores: even if it hallucinates a 90° jump, the arm only takes a 5° bite. Smaller numbers = safer but slower; bigger = faster but jerkier and more likely to slam the gripper into the table on a bad prediction.

Together: absolute mode so the policy's predictions mean what we trained them to mean, actions_per_chunk = 1 for closed-loop responsiveness, and a 5° per-step cap as a hardware-level seatbelt.

💡 The “wandering loop” failure mode

With the 8k vanilla policy at actions_per_chunk = 1, the arm would head roughly toward the cube, not quite reach it, the policy would predict a slight adjustment back, the arm would over-correct, and so on — a low-amplitude oscillation that never converges. This is the diagnostic signature of a policy whose mean prediction is approximately right but whose variance is bigger than the gripper—cube margin. It's why we tried both actions_per_chunk = 50 (commit to the predicted trajectory) and longer training (drive variance down). Both helped.

Chapter 11

Bus contention

The follower arm's serial bus tolerates exactly one reader at a time. With a naive dashboard:

The capture thread polls follower.get_observation() at 15 Hz to refresh the frame buffer.
The inference thread also calls follower.get_observation() once per inference tick to read the latest joint state to send to the policy.

Two threads, one bus, no lock. The first time we ran inference with the dashboard, we got:

RuntimeError: [PortHandler::setupPort] Port is in use!

The fix is to make the inference thread the sole reader during a run. The capture thread still feeds the camera buffers (cameras are independent USB devices), but it backs off the servo bus while inference is running:

python · client/dashboard/server.py

def _capture_loop(self):
    while self.status.connected:
        if self.status.inference_running:
            time.sleep(0.05)         # yield bus during inference
            continue
        obs = self.follower.get_observation()
        self._update_frame_buffers(obs)

def _inference_loop(self):
    while self.status.inference_running:
        obs = self.follower.get_observation()  # sole reader
        self._update_frame_buffers(obs)        # also feed the cam stream
        action = self._call_policy(obs)
        self._apply_action(action)
        time.sleep(self.tick_period)

This is a small structural change but it eliminates an entire category of intermittent USB errors. It also means the camera streams stay live during inference — the inference thread refreshes the frame buffers itself.

Chapter 12

The 32k training story

We trained ACT on the merged 80-episode dataset twice for each variant: first at the LeRobot default of 8 000 steps, then again at 32 000. Both runs use batch size 8, learning rate 1e-5, action chunk size 50, on a single A10G. The 8k run takes about 21 minutes; the 32k run takes about 85 minutes.

Final training loss — 8k vs 32k, vanilla vs prompted From training logs

Final L1+KL after the last training step. Same dataset, same hyperparameters — only training duration changes. The crossover at 32k is the bbox token paying off.

A few things to read off this:

8k is under-trained. Both variants are still on the steep part of the curve. With ~42k frames and batch size 8, 8k steps is ~1.5 epochs — you wouldn't train an MLP that briefly, and ACT is bigger than that.
At 8k, prompted is slightly worse than vanilla. This was our first hint that the bbox token wasn't being used effectively yet — if the model can't extract useful signal from it, it just becomes noise that costs gradient budget elsewhere.
At 32k, prompted is better than vanilla. 0.100 vs 0.109. That delta — about 8% — is the conditional information actually paying off. For the first time the bbox is helping more than it costs.

💡 The crossover

Watch the prompted curve cross under the vanilla curve somewhere between step ~12 000 and ~18 000. That's ACT figuring out how to use the bbox token. Before the crossover, the prompted policy is paying for an extra token's gradient with no gain. After it, the token is providing useful conditioning that the camera and state can't.

Chapter 13

Honest evaluation

What does this look like in the real world?

A second pickup attempt with the same checkpoint. The approach commits in the right direction; final closure is still grasp-quality, not conditioning. Attempt 2

Variant	Steps	Behavior	Pickup rate (est.)
vanilla	8 000	Wandering loop near the cube. Doesn't commit.	~10%
prompted	8 000	Same wander. Bbox has no apparent effect.	~10%
vanilla	32 000	Reliable approach to whatever salient object is centered. Single-object scenes work.	~50%
prompted	32 000	Approach changes when bbox changes. Two-object scenes start to disambiguate.	~55% (per-object)

These numbers are rough — we ran ~10-20 trials per variant in a single afternoon on the same desk, with the same lighting, picking from a small workspace. They are not a benchmark, they are a smell test.

What we’re reading from this

The 32k prompted policy is the first checkpoint where clicking different bounding boxes produces visibly different reaches. That's the property we set out to get, and we got it. It's not yet reliable enough that we'd trust it to clear an unfamiliar workspace — the failure mode isn't “wrong object”, it's “reaches near the right object but doesn't close cleanly” — which is a grasp-quality issue, not a conditioning issue.

Where do we go from here? Three obvious levers, in roughly increasing cost:

More episodes. 80 is the lower edge of where ACT works. 200 is comfortable. Another v6 collection (40 more) is ~90 minutes of teleop and would land us at 120.
Longer training. 32k is better than 8k; 64k might be better still, though we'd want to watch a held-out validation curve to be sure we're not overfitting.
Replace ACT with π0.5 fine-tuned on the same data. π0.5 is bigger, comes pretrained with broad robot priors, and tends to generalise better at low data counts. The infrastructure already supports it — that's why experiments/pi0/app.py is named after pi0 even though we're currently training ACT. A π0.5 fine-tune is the obvious next experiment.

Chapter 14

The full recipe

End-to-end, from a fresh checkout to a running prompted dashboard:

1. record

lerobot-record × 80 eps

Two cameras, alternate which object you pick.

2. label

auto_label_bboxes.py

GroundingDINO on Modal → bbox per episode.

3. graft

add_bbox_to_dataset.py

Inject env_state column → new HF repo.

4. train

train_act × 32k steps

One run for vanilla, one for prompted.

5. deploy

POLICY_VARIANT=… modal deploy

Two URLs, one source file.

6. run

dashboard server.py

Click bbox, click run.

bash · full pipeline

# 1. Record (interactive)
bash record_v4.sh           # 40 eps → ozyphus/so101-cleanup-v4
bash record_v5.sh           # 40 more → ozyphus/so101-cleanup-v5

# 1b. Merge to 80-episode dataset
python tools/merge_datasets.py \
       --base-repo-id ozyphus/so101-cleanup-v4 \
       --add-repo-id  ozyphus/so101-cleanup-v5 \
       --dst-repo-id  ozyphus/so101-cleanup-v4plus5 \
       --output-dir   build/v4plus5 --push

# 2. Auto-label bboxes (GroundingDINO via Modal)
modal deploy experiments/detector/app.py
python tools/auto_label_bboxes.py \
       --repo-id    ozyphus/so101-cleanup-v4plus5 \
       --review-dir review_v4plus5/

# 3. Inject bbox column → prompted variant
python tools/add_bbox_to_dataset.py \
       --src-repo-id ozyphus/so101-cleanup-v4plus5 \
       --dst-repo-id ozyphus/so101-cleanup-v4plus5-prompted \
       --labels      bbox_labels.json \
       --output-dir  build/v4plus5-prompted --push

# 4. Train both variants
modal run --detach experiments/pi0/app.py::train_act \
       --dataset-repo-id=ozyphus/so101-cleanup-v4plus5 \
       --num-steps=32000 --output-name=act-v4plus5-vanilla-32k

modal run --detach experiments/pi0/app.py::train_act \
       --dataset-repo-id=ozyphus/so101-cleanup-v4plus5-prompted \
       --num-steps=32000 --output-name=act-v4plus5-prompted-32k

# 5. Deploy
POLICY_VARIANT=prompted modal deploy experiments/pi0/app.py
POLICY_VARIANT=vanilla  modal deploy experiments/pi0/app.py

# 6. Run the dashboard
.venv/bin/python client/dashboard/server.py
# → open http://127.0.0.1:8765, connect, home, detect, click, run.

Or, if you just want to test vanilla from the CLI without the dashboard:

bash

.venv/bin/python client/so101_run.py \
  --endpoint-url=https://<workspace>--so101-pi0-vanilla-pi0policy-live.modal.run \
  --base-cam-index=1 --task="clean up the desk" --live \
  --action-mode=absolute --max-joint-step-deg=5 --actions-per-chunk=1

ℹ Time budget for the whole thing

Recording: ~3 hours of teleop including resets. Auto-labelling: ~5 minutes for 80 episodes once the detector is warm. Parquet surgery: ~2 minutes. Merge: ~3 minutes. Training: 2 × 85 min on A10G (~$3 at Modal pricing). Deploy: a few seconds each. Total wall-clock from cold to running prompted policy: an evening.

References

Zhao et al. “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.” RSS 2023. The original ACT paper. arXiv:2304.13705
Black et al. “π₀: A Vision-Language-Action Flow Model for General Robot Control.” 2024. arXiv:2410.24164
Liu et al. “Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.” ECCV 2024. arXiv:2303.05499
LeRobot Team. “LeRobot: State-of-the-art ML for Real-World Robotics.” HuggingFace. github
Cadene et al. “LeRobot Datasets v3.0 specification.” The chunked-parquet format used here. docs
Modal Labs. “Modal: Run Python in the Cloud.” modal.com/docs
SO-100 / SO-101 hardware: TheRobotStudio. github