Click an object to prompt the synthetic policy. In the real system the dashboard runs GroundingDINO over the base camera feed, you click one of the detected boxes, and the policy gets that 4-D bbox as part of its observation. Drag to orbit, pinch to zoom.
TL;DR
We took an off-the-shelf imitation-learning policy — the Action Chunking Transformer (ACT) — and gave it a fourth modality: a 4-dimensional bounding box that says this is the thing I want. Then we collected 80 teleoperation episodes on a 6-DOF SO-101 arm, auto-labelled the target object in each one using GroundingDINO, post-processed the dataset to inject those boxes as a new feature, and trained two policies side by side: a vanilla ACT and a prompted ACT. Both serve from Modal, both are switchable from a single browser dashboard, both are addressed by the same client. The whole stack is built on top of LeRobot's v3.0 dataset format with as few abstractions as we could get away with.
ACT already has a slot for a non-image, non-state observation called
observation.environment_state. It exists to carry things like the position
of a peg in a peg-in-hole task. Nothing in the architecture says it has to mean
state — the transformer just embeds it as one extra token. We hijack that
slot and stuff a normalized bounding box into it. The policy learns to attend to the
token and the resulting behavior is conditional on whatever object the box surrounds.
No new losses, no new networks, no architecture changes.
You're comfortable with: PyTorch tensor shapes, basic transformer attention, REST APIs, and you've at least heard of imitation learning. You don't need to know ACT, π0, LeRobot, Modal, or GroundingDINO — we re-derive enough of each to follow the choices. Familiarity with our previous build log (HuggingBros) helps but isn't required.
The problem
You have a robot arm on your desk. There's a small orange cube to its left, a green plush toy in front of it, and a black cube behind that. You want to be able to tell the arm which one to pick — with a click, not by retraining. Tomorrow you'll add a fourth object and you'd like that to work too.
The naive imitation-learning pipeline goes the other way. You collect a demonstration set: arm-picks-cube, arm-picks-cube, arm-picks-cube. You train a policy that takes (camera, joint state) and outputs joint targets. The policy has no idea what a cube is — it just learned that for these pixels, this is the right motion. If the cube moves, you hope the camera generalises. If you swap in a different object, you collect new demos and retrain. Disambiguation between two simultaneous candidate objects is invisible to the policy unless the demonstrations themselves disambiguate — and since every demo terminates in a successful grasp, the dataset never says not that one.
We want a different shape: an extra input at policy time that says here is the thing, and a training procedure that teaches the policy to use it. The shape we landed on is a 4-D bounding box, and the rest of this article is the engineering required to make the loop run end-to-end.
One trained model. Two objects in scene. Click on either bounding box in the dashboard. Arm goes for the one we clicked. Click on the other one. Arm goes for that one instead. No retraining between clicks.
ACT in 5 minutes
The Action Chunking Transformer is the small-VLA that ships in LeRobot and just works for desk-scale teleop tasks. Three things make it different from the obvious "MLP from pixels to actions":
- It's a transformer. Image features (from a frozen ResNet-18 backbone), joint-state encodings, and the optional environment-state vector all become tokens. The decoder produces a chunk of future actions.
- It outputs an entire chunk at once. A single forward pass returns a
sequence of
chunk_sizeactions (default 50). At control time you execute as many of them as you want before the next forward pass. - It's trained as a CVAE. The encoder takes the
actual demonstration chunk plus the current state and produces a latent code.
The decoder reconstructs the chunk from observations + the latent. At inference, you
sample
z = 0: deterministic, mean-of-distribution actions.
Action chunking, visually
Why chunks? In teleop demonstrations, your hand commits to a multi-step plan even though the policy only sees one frame. Predicting one action from one frame gives you the average of every demonstrator's micro-decisions and produces dithering output. Predicting 50 actions forces the network to commit, and stitched-together chunks produce the smooth wrist-cam paths we see at inference.
Drag the “commit horizon” slider to see how many actions of the chunk are
executed before the policy is queried again. actions_per_chunk = 1
is closed-loop; 50 is full open-loop.
The CVAE trick (briefly)
ACT's loss is the standard CVAE loss: reconstruction + KL. The reconstruction term is L1
between the predicted chunk and the demonstration chunk. The KL term keeps the latent
near a unit Gaussian. A
KL coefficient of about 10 is the LeRobot default and it means —
informally — the model is biased toward "use the latent only when observations
aren't enough."
For our purposes, none of this matters at inference: we always set z = 0.
What matters is the encoder side, because that's where the prompt token lives.
per camera, 2 cameras → 240 tokens
via Linear(6, 512)
via Linear(4, 512)
z=0 at inference
The orange box above — env_state — is what makes ACT promptable
without changing a line of model code. It's a stock LeRobot input that defaults to
nothing. We give it a meaning.
Making ACT promptable
env_state, repurposed
ACT has three optional input keys. Two are mandatory in practice:
observation.images.* (one or more cameras) and observation.state
(current joint angles). The third is observation.environment_state, a flat
vector of arbitrary length. It exists for tasks where the environment carries information
that isn't in the image — the position of a peg, the joint angles of a second arm,
the contact force from a sensor.
Nothing in ACT cares what those numbers mean. The vector becomes one token via a linear projection, joins the other tokens, the transformer attends, the actions come out. If we put a 4-D bounding box in there during training, the model learns to associate that box with the eventual gripper trajectory. At inference, give it a different box and you get a different trajectory.
Why a 4-D bounding box (and not, say, a mask, a pixel, or a class label)
A bounding box is the laziest spatial prompt that still works. Four numbers, normalized to [0, 1] over image width and height:
Compared to alternatives:
| Prompt shape | Pros | Cons |
|---|---|---|
| 4-D bbox | Tiny. Fits in env_state. Trivial to auto-label. Trivial to draw with a click. | No object identity, no shape. Not pose-aware. |
| Pixel coordinate (x, y) | Even tinier (2-D). | Loses scale; one of the few cues ACT can actually exploit at coarse depth. |
| Segmentation mask (256×256 binary) | Most expressive. | Big. Need a mask encoder. Need SAM at inference. Way more brittle. |
| Class label (one-hot over N) | Smallest possible. | Closed-vocabulary. Can't add a new object without retraining. |
| CLIP text embedding (512-D) | Open-vocabulary. | Doesn't tell the policy where. Two cubes in scene → same embedding. |
A bounding box hits the sweet spot: it's 4 numbers, anyone can produce it (open-vocab text model + image, or just a click and drag), and it carries both where and how big. Open-vocabulary is recovered at label time from the detector; the policy itself only ever sees four floats.
The bbox is the policy's laser pointer. It doesn't identify the object — it just says “the thing I want is somewhere in this rectangle.” The policy still has to figure out what to do with the gripper from the camera; it just knows where to focus.
Collecting demonstrations
lerobot-record drives data collection. You hold a leader arm, the follower
mirrors your motion via USB, two cameras record the scene, and the entire trajectory
(joint angles + camera frames + timestamps) is written to a Hugging Face dataset in the
v3.0 chunked-parquet format. Our recording protocol for the prompted variant:
lerobot-record \
--robot.type=so101_follower \
--robot.port=/dev/tty.usbmodem5A7A0546771 \
--robot.id=so101_follower_main \
--robot.cameras='{ wrist: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 15, backend: AVFOUNDATION},
base: {type: opencv, index_or_path: 1, width: 640, height: 480, fps: 15, backend: AVFOUNDATION} }' \
--teleop.type=so101_leader \
--teleop.port=/dev/tty.usbmodem5A7A0545661 \
--teleop.id=so101_leader_main \
--display_data=true \
--dataset.repo_id=ozyphus/so101-cleanup-v5 \
--dataset.single_task="clean up the desk" \
--dataset.num_episodes=40 \
--dataset.fps=15 \
--dataset.episode_time_s=35 \
--dataset.reset_time_s=15 \
--dataset.push_to_hub=true \
--dataset.private=true
A few choices in there matter for the prompted setup:
- Two cameras. A wrist cam (USB index 0) for fine-grained grasp information, and a base cam (USB index 1) angled at the workspace. The base cam is the one we'll later run a detector on.
- Single task string. ACT takes a task string for compatibility with
language-conditioned VLAs but doesn't actually consume it. We use
"clean up the desk"for everything — the prompt token does the work. - 15 fps, 35-second episodes. ~525 frames per episode × 80 episodes = ~42k frames, which lands in the “enough to learn a single-stage manipulation task, short of the 100k+ where things really start working” zone for ACT.
- Both objects in scene every episode. If you only ever record demos with one object, the policy has no reason to attend to the bbox — it'll just learn “there's only one thing here, go for it.” We alternate which one we pick (~20 cube, ~20 toy) and vary positions, so the bbox is the only signal that breaks the symmetry.
The dataset that comes out is a tree of parquet files:
LeRobot's dataset loader doesn't read from main — it resolves a Hub
revision matching the codebase_version field in info.json
(e.g. "v3.0") and reads from that branch.
lerobot-record creates this branch automatically when it pushes.
HfApi.upload_folder — which we use after manual surgery —
does not. We had to add an explicit api.create_branch(...) after every
upload, otherwise LeRobotDataset(...) downstream throws
RevisionNotFoundError.
Auto-labelling with GroundingDINO
We need a bounding box per episode that says this is the thing the demonstrator picked. There are two reasonable ways to get one:
- Annotate the first frame manually. Reliable, but 80 episodes × click-and-drag is exactly the kind of work we set up Modal to avoid.
- Run an open-vocabulary detector like GroundingDINO on the first frame, take the highest- scoring match for the prompt “orange cube . green toy . black cube”, and call it a day.
We picked (2) and deployed GroundingDINO behind a Modal endpoint. The detector lives in
experiments/detector/app.py as a single @app.cls with a
FastAPI endpoint. About 4 GB of weights, runs comfortably on an A10G (24 GB), warm
inference around 200 ms.
The chassis bug
The first labeller run did exactly what you'd expect a slightly-overfit detector to do: it learned that the most prominent orange thing in the workspace was the SO-101 itself.
auto_label_bboxes.py reported high-confidence detections for every cube
episode. We pulled the review images and the bounding boxes were almost all on the
arm, not the cube. The arm has an orange-red sticker on the elbow, and the cube
sometimes occupies less than 5% of the frame. The detector picked the bigger,
more saturated “orange thing” every time.
Per-phrase + area filter
Two changes fixed this. First, we noticed GroundingDINO sometimes combined
multiple phrases in the prompt onto a single bounding box — a side effect of how
its text encoder handles dotted prompts. The fix was to split the prompt by
"." and run detection once per phrase, then take the union of results.
Second, we added a maximum-area filter: any box that covers more than 8% of the image
gets dropped. The arm's elbow region was reliably 30-50% of the frame, so this drops
chassis matches without touching real objects.
DEFAULT_PROMPT = "orange cube . green toy ." # extend at the dashboard for new objects
DEFAULT_BOX_THRESHOLD = 0.30
DEFAULT_TEXT_THRESHOLD = 0.25
DEFAULT_MAX_AREA_FRAC = 0.08 # drop boxes > 8% of image — filters the arm chassis
def _detect_pil(self, image, prompt, *, box_th, text_th, max_area_frac):
"""Run detect once per phrase and merge — GDINO sometimes collapses
a multi-phrase prompt into one box. Per-phrase keeps phrases independent."""
phrases = [p.strip() for p in prompt.split(".") if p.strip()]
H, W = image.size[1], image.size[0]
img_area = float(H * W)
out = []
for phrase in phrases:
boxes = self._detect_pil_one_prompt(image, phrase, box_th, text_th)
for box in boxes:
x0, y0, x1, y1 = box["bbox"]
area_frac = ((x1 - x0) * (y1 - y0)) / img_area
if area_frac > max_area_frac:
continue # too big — almost certainly the robot itself
box["label"] = phrase
box["bbox_norm"] = [x0/W, y0/H, x1/W, y1/H]
out.append(box)
return out
With those two changes, Run 2 labelled all 40 cube episodes correctly and 38 of 40 toy episodes, with 2 toy episodes producing no detection at all (out-of-frame at the first frame). For the unlabelled episodes we fall back to the median bbox across the labelled ones — the workspace centroid — which is a fine prior to learn from.
often labelled “orange cube green toy”
on a single combined region. Useless.
detect("green toy") → box B
area filter drops chassis match.
Surgery on parquet
The labels file is a flat dict mapping episode_index to
{"bbox_norm": [...], "n_frames_in_episode": N, ...}. To use it for training,
we need it to live inside the dataset as a proper LeRobot feature. That means
editing four places in the v3.0 dataset tree:
observation.environment_state as a fixed-size list<float, 4>, broadcasting each episode's bbox over all of its frames.
The full script is tools/add_bbox_to_dataset.py. The interesting bit is
step 1, because that's where we actually inject pixels-of-meaning into the trainable
tensor:
def update_data_parquet(parquet_path, labels):
"""Append fixed_size_list<float>[4] column, broadcasting per-episode bbox to all rows."""
table = pq.read_table(parquet_path)
ep_idx_col = table.column("episode_index").to_pylist()
fallback = _median_bbox(labels) # for episodes the detector skipped
bboxes_flat = []
for ep in ep_idx_col:
key = str(ep)
bbox = labels[key]["bbox_norm"] if key in labels else fallback
bboxes_flat.extend(float(v) for v in bbox)
flat = pa.array(bboxes_flat, type=pa.float32())
fixed = pa.FixedSizeListArray.from_arrays(flat, BBOX_DIM) # 4
new_table = table.append_column(NEW_FEATURE, fixed)
pq.write_table(new_table, parquet_path)
We also need info.json to declare the feature, otherwise LeRobot won't
plumb it through to ACT's input_features:
"observation.environment_state": {
"dtype": "float32",
"shape": [4],
"names": ["x_min", "y_min", "x_max", "y_max"]
}
Once that's in place, LeRobotDataset(repo_id) picks up the new feature
automatically. Sanity check: the keys printed by the dataset object now include
observation.environment_state alongside the cameras and joint state, and
ACT's config.input_features contains an entry with shape (4,).
No model code changed.
We label only the first frame of the demo, then broadcast that single bbox to every row of the episode's parquet. This means the policy sees a static bbox for the entire demo, even as the gripper moves and the object moves with it. We tried the alternative — running the detector frame-by-frame — and found that close to the grasp point, occlusion-by-gripper makes the detector unreliable. A static bbox is also closer to how a user will use the system: they click once at the start, the policy carries the prompt through the whole reach.
Merging v4 + v5
We trained the first prompted ACT on 40 episodes (v4). It didn't learn the
conditioning — clicking different bounding boxes produced the same trajectory. The
symptom was diagnostic: when the dashboard did forward the bbox to the policy
(we logged the env_state in every payload to confirm), the behavior was unchanged. This
is the textbook signature of an ignored input: the gradient through the bbox token never
lined up against a useful direction during training, so the policy effectively learned
to attend to it with weight zero.
Looking at published ACT recipes, 40 demos is the under-fit edge for any pickup task —
most papers use 50-200. So we recorded another 40 (v5, same scene, same
protocol, alternating which object got picked) and merged the two into a single 80-episode
dataset.
Merging LeRobot v3.0 datasets isn't a one-liner because of the chunked layout: each video
file holds multiple episodes' frames, episode indices are global, and per-episode metadata
lives in its own parquet. tools/merge_datasets.py handles all of this:
snapshot_download.output_dir as-is.file-000.mp4 becomes file-NNN.mp4 where NNN follows v4's last index. No re-encode.episode_index shifted by n_eps_v4, global index shifted by n_frames_v4.chunk_index, file_index, and timestamp ranges shifted to point at the new video locations.info.json totals (total_episodes, total_frames) and stats.json get recomputed from the merged data.v3.0 branch.
Once so101-cleanup-v4plus5 exists on the Hub (80 episodes, 41 411 frames,
~85 minutes of teleop), the prompted variant is a one-line invocation of
add_bbox_to_dataset.py against the merged dataset:
python tools/auto_label_bboxes.py --repo-id ozyphus/so101-cleanup-v4plus5 \
--review-dir review_v4plus5/
python tools/add_bbox_to_dataset.py \
--src-repo-id ozyphus/so101-cleanup-v4plus5 \
--dst-repo-id ozyphus/so101-cleanup-v4plus5-prompted \
--labels bbox_labels.json --output-dir build/v4plus5-prompted \
--push
We get two datasets out of one collection: the unmodified merge for the vanilla baseline, and the prompted variant for the conditioning experiment.
Three apps from one file
We want three Modal deployments from one source file: a default that we don't break, a vanilla policy, and a prompted policy. Each needs its own URL so the dashboard can switch between them without a redeploy.
Modal apps are named per source-file. To get three names without three files, we read an environment variable at deploy time and use it to derive both the app name and the checkpoint priority:
POLICY_VARIANT = os.environ.get("POLICY_VARIANT", "current").strip().lower()
APP_NAME_BASE = "so101-pi0"
APP_NAME = APP_NAME_BASE if POLICY_VARIANT == "current" else f"{APP_NAME_BASE}-{POLICY_VARIANT}"
CHECKPOINTS_BY_VARIANT = {
"current": [
("/weights/ft/act-v4plus5-prompted-32k/checkpoints/032000/pretrained_model", "act"),
("/weights/ft/act-v4plus5-vanilla-32k/checkpoints/032000/pretrained_model", "act"),
*_FALLBACK_CHECKPOINTS,
],
"prompted": [
("/weights/ft/act-v4plus5-prompted-32k/checkpoints/032000/pretrained_model", "act"),
*_FALLBACK_CHECKPOINTS,
],
"vanilla": [
("/weights/ft/act-v4plus5-vanilla-32k/checkpoints/032000/pretrained_model", "act"),
*_FALLBACK_CHECKPOINTS,
],
}
CHECKPOINT_CANDIDATES = CHECKPOINTS_BY_VARIANT[POLICY_VARIANT]
app = modal.App(APP_NAME)
And the deploy invocation:
POLICY_VARIANT=prompted modal deploy experiments/pi0/app.py
POLICY_VARIANT=vanilla modal deploy experiments/pi0/app.py
You get two URLs:
Both serve from the same Python class. The class loads whichever checkpoint is first
in CHECKPOINT_CANDIDATES, and the env-state path inside the policy code
gracefully no-ops when the loaded checkpoint doesn't expect an
environment_state input:
env_feat = self.config.input_features.get("observation.environment_state")
self.expected_env_state_dim = int(env_feat.shape[0]) if env_feat is not None else 0
# at observation-build time:
def _build_raw_obs(self, image_b64, state, env_state):
obs = {"observation.state": torch.tensor(state, dtype=torch.float32)}
if self.expected_env_state_dim > 0:
es = list(env_state) if env_state is not None else [0.0] * self.expected_env_state_dim
es += [0.0] * (self.expected_env_state_dim - len(es)) # pad
obs["observation.environment_state"] = torch.tensor(es[:self.expected_env_state_dim])
return obs
When the vanilla checkpoint is loaded, expected_env_state_dim == 0 and the
bbox payload is silently dropped. When the prompted checkpoint is loaded,
expected_env_state_dim == 4 and the bbox is the difference between picking
the cube and picking the toy.
The dashboard
The dashboard is a single FastAPI process running on the Mac next to the arm. It does three things:
- Talks USB to the arm and to both cameras (it owns the bus).
- Streams both cameras to the browser as MJPEG, runs the detector on demand, accepts a click to pick a bounding box, and streams telemetry over a WebSocket.
- Posts
(image, state, env_state)to the selected Modal endpoint at 5 Hz and applies the returned actions back to the arm.
POST /api/{connect, detect, select-bbox, run, stop, config, status}
GET /stream/{wrist,base} (MJPEG · multipart/x-mixed-replace)
WS /ws/telemetry (action, state, latency)
policy variant dropdown · run params
L40S · ACT · ~280 ms warm
The pick-box flow on the browser side is the part that the user actually touches. Once the policy URL is set and the arm is connected:
The dashboard owns the bus, which means the inference loop and the capture loop both want to read from the same servo. That contention surprised us — see Chapter 11.
Closed-loop, absolute, capped
The control loop reads joint state, sends an observation to the policy, gets an action chunk, applies some prefix of the chunk, and repeats. Three knobs in this loop matter a lot more than they sound like they should:
so101_run.py was delta,
which adds the “delta” to current state. With absolute outputs treated as
deltas, the arm drifts toward joint limits over a few seconds.
Fix: always pass
--action-mode=absolute for ACT
checkpoints. The dashboard hard-codes this.
actions_per_chunk = 1, each predicted chunk is consumed exactly once
before re-querying. Closed-loop. Resilient but easily “dithers” when the
policy isn't sure.
At
actions_per_chunk = 50, the entire chunk is executed open-loop
before re-querying. Smooth but commits to bad plans.
The under-trained vanilla 8k policy looped at 1 and committed at 50; the 32k variants behave better at 1.
Together: absolute mode so the policy's predictions mean what we trained them to mean, actions_per_chunk = 1 for closed-loop responsiveness, and a 5° per-step cap as a hardware-level seatbelt.
With the 8k vanilla policy at actions_per_chunk = 1, the arm would head
roughly toward the cube, not quite reach it, the policy would predict a slight
adjustment back, the arm would over-correct, and so on — a low-amplitude
oscillation that never converges. This is the diagnostic signature of a policy whose
mean prediction is approximately right but whose variance is bigger than
the gripper—cube margin. It's why we tried both
actions_per_chunk = 50 (commit to the predicted trajectory) and longer
training (drive variance down). Both helped.
Bus contention
The follower arm's serial bus tolerates exactly one reader at a time. With a naive dashboard:
- The capture thread polls
follower.get_observation()at 15 Hz to refresh the frame buffer. - The inference thread also calls
follower.get_observation()once per inference tick to read the latest joint state to send to the policy.
Two threads, one bus, no lock. The first time we ran inference with the dashboard, we got:
The fix is to make the inference thread the sole reader during a run. The capture thread still feeds the camera buffers (cameras are independent USB devices), but it backs off the servo bus while inference is running:
def _capture_loop(self):
while self.status.connected:
if self.status.inference_running:
time.sleep(0.05) # yield bus during inference
continue
obs = self.follower.get_observation()
self._update_frame_buffers(obs)
def _inference_loop(self):
while self.status.inference_running:
obs = self.follower.get_observation() # sole reader
self._update_frame_buffers(obs) # also feed the cam stream
action = self._call_policy(obs)
self._apply_action(action)
time.sleep(self.tick_period)
This is a small structural change but it eliminates an entire category of intermittent USB errors. It also means the camera streams stay live during inference — the inference thread refreshes the frame buffers itself.
The 32k training story
We trained ACT on the merged 80-episode dataset twice for each variant: first at the
LeRobot default of 8 000 steps, then again at 32 000. Both runs use batch size 8,
learning rate 1e-5, action chunk size 50, on a single A10G. The 8k run
takes about 21 minutes; the 32k run takes about 85 minutes.
Final L1+KL after the last training step. Same dataset, same hyperparameters — only training duration changes. The crossover at 32k is the bbox token paying off.
A few things to read off this:
- 8k is under-trained. Both variants are still on the steep part of the curve. With ~42k frames and batch size 8, 8k steps is ~1.5 epochs — you wouldn't train an MLP that briefly, and ACT is bigger than that.
- At 8k, prompted is slightly worse than vanilla. This was our first hint that the bbox token wasn't being used effectively yet — if the model can't extract useful signal from it, it just becomes noise that costs gradient budget elsewhere.
- At 32k, prompted is better than vanilla. 0.100 vs 0.109. That delta — about 8% — is the conditional information actually paying off. For the first time the bbox is helping more than it costs.
Watch the prompted curve cross under the vanilla curve somewhere between step ~12 000 and ~18 000. That's ACT figuring out how to use the bbox token. Before the crossover, the prompted policy is paying for an extra token's gradient with no gain. After it, the token is providing useful conditioning that the camera and state can't.
Honest evaluation
What does this look like in the real world?
| Variant | Steps | Behavior | Pickup rate (est.) |
|---|---|---|---|
| vanilla | 8 000 | Wandering loop near the cube. Doesn't commit. | ~10% |
| prompted | 8 000 | Same wander. Bbox has no apparent effect. | ~10% |
| vanilla | 32 000 | Reliable approach to whatever salient object is centered. Single-object scenes work. | ~50% |
| prompted | 32 000 | Approach changes when bbox changes. Two-object scenes start to disambiguate. | ~55% (per-object) |
These numbers are rough — we ran ~10-20 trials per variant in a single afternoon on the same desk, with the same lighting, picking from a small workspace. They are not a benchmark, they are a smell test.
The 32k prompted policy is the first checkpoint where clicking different bounding boxes produces visibly different reaches. That's the property we set out to get, and we got it. It's not yet reliable enough that we'd trust it to clear an unfamiliar workspace — the failure mode isn't “wrong object”, it's “reaches near the right object but doesn't close cleanly” — which is a grasp-quality issue, not a conditioning issue.
Where do we go from here? Three obvious levers, in roughly increasing cost:
- More episodes. 80 is the lower edge of where ACT works. 200 is comfortable. Another v6 collection (40 more) is ~90 minutes of teleop and would land us at 120.
- Longer training. 32k is better than 8k; 64k might be better still, though we'd want to watch a held-out validation curve to be sure we're not overfitting.
- Replace ACT with π0.5 fine-tuned on the same data. π0.5 is bigger,
comes pretrained with broad robot priors, and tends to generalise better at low data
counts. The infrastructure already supports it — that's why
experiments/pi0/app.pyis named after pi0 even though we're currently training ACT. A π0.5 fine-tune is the obvious next experiment.
The full recipe
End-to-end, from a fresh checkout to a running prompted dashboard:
# 1. Record (interactive)
bash record_v4.sh # 40 eps → ozyphus/so101-cleanup-v4
bash record_v5.sh # 40 more → ozyphus/so101-cleanup-v5
# 1b. Merge to 80-episode dataset
python tools/merge_datasets.py \
--base-repo-id ozyphus/so101-cleanup-v4 \
--add-repo-id ozyphus/so101-cleanup-v5 \
--dst-repo-id ozyphus/so101-cleanup-v4plus5 \
--output-dir build/v4plus5 --push
# 2. Auto-label bboxes (GroundingDINO via Modal)
modal deploy experiments/detector/app.py
python tools/auto_label_bboxes.py \
--repo-id ozyphus/so101-cleanup-v4plus5 \
--review-dir review_v4plus5/
# 3. Inject bbox column → prompted variant
python tools/add_bbox_to_dataset.py \
--src-repo-id ozyphus/so101-cleanup-v4plus5 \
--dst-repo-id ozyphus/so101-cleanup-v4plus5-prompted \
--labels bbox_labels.json \
--output-dir build/v4plus5-prompted --push
# 4. Train both variants
modal run --detach experiments/pi0/app.py::train_act \
--dataset-repo-id=ozyphus/so101-cleanup-v4plus5 \
--num-steps=32000 --output-name=act-v4plus5-vanilla-32k
modal run --detach experiments/pi0/app.py::train_act \
--dataset-repo-id=ozyphus/so101-cleanup-v4plus5-prompted \
--num-steps=32000 --output-name=act-v4plus5-prompted-32k
# 5. Deploy
POLICY_VARIANT=prompted modal deploy experiments/pi0/app.py
POLICY_VARIANT=vanilla modal deploy experiments/pi0/app.py
# 6. Run the dashboard
.venv/bin/python client/dashboard/server.py
# → open http://127.0.0.1:8765, connect, home, detect, click, run.
Or, if you just want to test vanilla from the CLI without the dashboard:
.venv/bin/python client/so101_run.py \
--endpoint-url=https://<workspace>--so101-pi0-vanilla-pi0policy-live.modal.run \
--base-cam-index=1 --task="clean up the desk" --live \
--action-mode=absolute --max-joint-step-deg=5 --actions-per-chunk=1
Recording: ~3 hours of teleop including resets. Auto-labelling: ~5 minutes for 80 episodes once the detector is warm. Parquet surgery: ~2 minutes. Merge: ~3 minutes. Training: 2 × 85 min on A10G (~$3 at Modal pricing). Deploy: a few seconds each. Total wall-clock from cold to running prompted policy: an evening.
References
- Zhao et al. “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.” RSS 2023. The original ACT paper. arXiv:2304.13705
- Black et al. “π0: A Vision-Language-Action Flow Model for General Robot Control.” 2024. arXiv:2410.24164
- Liu et al. “Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.” ECCV 2024. arXiv:2303.05499
- LeRobot Team. “LeRobot: State-of-the-art ML for Real-World Robotics.” HuggingFace. github
- Cadene et al. “LeRobot Datasets v3.0 specification.” The chunked-parquet format used here. docs
- Modal Labs. “Modal: Run Python in the Cloud.” modal.com/docs
- SO-100 / SO-101 hardware: TheRobotStudio. github