RTC inpaints at inference time. Helix asks: what if the model learned to handle action prefixes during training? By conditioning on previously-committed actions during training, Helix achieves the same consistency as RTC with zero inference overhead.
RTC solved the latency problem elegantly: freeze the committed actions, inpaint the rest. But it introduced two new problems that are easy to overlook.
Problem 1: Inference overhead. Inpainting requires extra computation during denoising. At each denoising step, you must clamp the frozen prefix back into place. More importantly, the quality of the inpainted tail depends on using enough denoising steps to properly condition on the prefix. In practice, RTC needs 2-3x more denoising steps than standard chunking to achieve good consistency. On a resource-constrained robot, those extra steps eat into the already-tight compute budget.
Problem 2: Distribution mismatch. The flow matching model was trained to generate full, unconditional action chunks. At inference time, RTC asks it to generate chunks conditioned on a frozen prefix -- a task it never saw during training. This works surprisingly well (flow matching is flexible), but it's fundamentally asking the model to do something it wasn't optimized for. There's a gap between what the model learned and what it's asked to do.
Helix's key insight is deceptively simple: if the robot will always have a prefix of committed actions at inference time, why not simulate that during training?
During real deployment, the robot is always in one of two states: (1) starting from scratch (no prefix), or (2) partway through a previous chunk (prefix exists). Case 2 is the common case. RTC handles it at inference time with inpainting. Helix handles it at training time by modifying the training data.
The idea: during training, randomly sample an action prefix from the ground-truth trajectory and provide it as additional input to the model. The model learns to generate the full chunk conditioned on this prefix. At inference time, no inpainting is needed -- the model already knows how to continue from a prefix because it practiced doing exactly that during training.
During training, each action chunk in the dataset is a sequence [a0, a1, ..., aH-1]. Helix modifies the training procedure as follows:
The crucial detail: K is sampled randomly during training, including K=0. This means the model learns both the unconditional case (fresh start) and the conditional case (continuing a prefix). At inference time, you just set K to whatever the actual prefix length is.
Click "Sample" to see how random prefix lengths are drawn during training. The model must generate the full chunk consistent with the green prefix.
Let's be precise about the difference. Both RTC and Helix solve the same problem: generating an action chunk that is consistent with previously committed actions. They differ in when they solve it.
The model is trained normally (unconditional chunks). At inference time, consistency is enforced by clamping the prefix during denoising. Each denoising step generates a full chunk, then the prefix portion is overwritten with the committed values. This is an external constraint applied to a model that doesn't know about prefixes.
The model is trained with random prefixes as input. At inference time, the committed prefix is simply fed as input -- no clamping, no inpainting, no extra steps. The model natively generates continuations because it learned to do so during training.
Compare how each method handles the frozen prefix. Click "Step" to advance the denoising process.
One of Helix's most appealing properties is its simplicity. The training modification is remarkably small -- the paper describes it as "a few lines of code" on top of any flow matching policy.
The changes are:
That's it. No architectural changes. No new modules. No new hyperparameters beyond Kmax (which the paper sets to match the commitment horizon used at deployment). The model architecture remains identical to a standard flow matching VLA.
python # Helix training modification (pseudocode) def helix_training_step(obs, actions_gt, model): K = random.randint(0, K_MAX) # random prefix length prefix = actions_gt[:K] # ground-truth prefix mask = [1]*K + [0]*(H-K) # 1 = has prefix noise = randn_like(actions_gt) tau = random.uniform(0, 1) noisy = tau * actions_gt + (1-tau) * noise # Concatenate prefix + mask as extra input pred = model(obs, noisy, tau, prefix, mask) loss = mse(pred, actions_gt - noise) # standard flow loss return loss
Helix is evaluated on both simulation benchmarks and real-world dexterous tasks. The headline result: Helix matches or exceeds RTC's performance with significantly less inference compute.
A bimanual task requiring the robot to load a coffee pod, close the machine, place a cup, and press the brew button. This is a multi-step sequential task where each step depends on precise placement of the previous step. Helix handles the transitions between steps smoothly because it was trained to generate continuations that are consistent with committed actions.
Folding a flat cardboard template into a 3D box. This requires bimanual coordination and precise force control. Helix achieves comparable success rates to RTC while using fewer denoising steps -- meaning the robot can operate at a higher effective control frequency.
| Method | Denoising steps | Espresso | Box assembly |
|---|---|---|---|
| Open-loop | 10 | Baseline | Baseline |
| RTC (inpainting) | 20-30 | Good | Good |
| Helix | 10 | Good+ | Good |
Helix and RTC are not competitors -- they're complementary solutions at different points in the design space. The right choice depends on your constraints.
Helix fits into a broader trend in robot learning: moving intelligence from inference time to training time. This same pattern appears throughout deep learning.
| Domain | Inference-time | Training-time |
|---|---|---|
| Robot chunking | RTC (inpainting) | Helix (prefix conditioning) |
| Language models | Chain-of-thought prompting | Training on reasoning traces |
| Image generation | Classifier guidance | Classifier-free guidance (CFG) |
| RL | MCTS at inference | Distilling MCTS into policy |
In each case, the training-time approach produces a model that handles the task natively, while the inference-time approach bolts it on externally. The training-time version is usually simpler at deployment and more computationally efficient, but requires access to the training loop.
Toggle between RTC and Helix to see the denoising pipeline difference. RTC clamps after each step; Helix conditions natively.