Physical Intelligence + UC Berkeley, 2025

Helix: Training-Time
Action Conditioning

RTC inpaints at inference time. Helix asks: what if the model learned to handle action prefixes during training? By conditioning on previously-committed actions during training, Helix achieves the same consistency as RTC with zero inference overhead.

Prerequisites: pi-0 basics + RTC concepts
8
Chapters
3+
Simulations

Chapter 0: RTC's Overhead

RTC solved the latency problem elegantly: freeze the committed actions, inpaint the rest. But it introduced two new problems that are easy to overlook.

Problem 1: Inference overhead. Inpainting requires extra computation during denoising. At each denoising step, you must clamp the frozen prefix back into place. More importantly, the quality of the inpainted tail depends on using enough denoising steps to properly condition on the prefix. In practice, RTC needs 2-3x more denoising steps than standard chunking to achieve good consistency. On a resource-constrained robot, those extra steps eat into the already-tight compute budget.

Problem 2: Distribution mismatch. The flow matching model was trained to generate full, unconditional action chunks. At inference time, RTC asks it to generate chunks conditioned on a frozen prefix -- a task it never saw during training. This works surprisingly well (flow matching is flexible), but it's fundamentally asking the model to do something it wasn't optimized for. There's a gap between what the model learned and what it's asked to do.

The mismatch is subtle but real: During training, the model learns p(full chunk | observation). During RTC inference, you're sampling from p(tail | prefix, observation) using inpainting as an approximation. This approximation is good but not exact -- it can produce slightly inconsistent transitions and occasionally fails to properly account for the prefix.
What are RTC's two main limitations?

Chapter 1: The Training-Time Insight

Helix's key insight is deceptively simple: if the robot will always have a prefix of committed actions at inference time, why not simulate that during training?

During real deployment, the robot is always in one of two states: (1) starting from scratch (no prefix), or (2) partway through a previous chunk (prefix exists). Case 2 is the common case. RTC handles it at inference time with inpainting. Helix handles it at training time by modifying the training data.

The idea: during training, randomly sample an action prefix from the ground-truth trajectory and provide it as additional input to the model. The model learns to generate the full chunk conditioned on this prefix. At inference time, no inpainting is needed -- the model already knows how to continue from a prefix because it practiced doing exactly that during training.

The analogy: Imagine learning to write sentences. RTC is like always writing from scratch but then erasing and rewriting the first few words to match what came before. Helix is like practicing with sentence-completion exercises: "Given the start 'The cat sat on the...', complete the sentence." The second approach is obviously more natural. The model internalizes continuation, not post-hoc editing.
What is Helix's core insight for eliminating inference-time inpainting?

Chapter 2: How Prefix Conditioning Works

During training, each action chunk in the dataset is a sequence [a0, a1, ..., aH-1]. Helix modifies the training procedure as follows:

Step 1: Sample prefix length
Randomly choose K from {0, 1, ..., Kmax}. K=0 means no prefix (standard training).
Step 2: Extract prefix
Take the first K actions [a0, ..., aK-1] as the prefix. These are ground-truth actions from the dataset.
Step 3: Condition
Feed the prefix into the model alongside the observation and noise. The model must denoise the FULL chunk while being consistent with the prefix.
Step 4: Loss
Compute flow matching loss on the ENTIRE chunk. The model is penalized if its output contradicts the prefix.

The crucial detail: K is sampled randomly during training, including K=0. This means the model learns both the unconditional case (fresh start) and the conditional case (continuing a prefix). At inference time, you just set K to whatever the actual prefix length is.

How the prefix is provided: The prefix actions are concatenated with the noisy action chunk as an additional input to the flow matching denoising network. Think of it like giving the model a "hint" -- the first K answers in a fill-in-the-blank test. During training, K varies randomly. During inference, K matches the actual number of committed actions from the previous chunk.
Prefix Conditioning During Training

Click "Sample" to see how random prefix lengths are drawn during training. The model must generate the full chunk consistent with the green prefix.

Why does Helix randomly include K=0 (no prefix) during training?

Chapter 3: Inference-Time vs Training-Time

Let's be precise about the difference. Both RTC and Helix solve the same problem: generating an action chunk that is consistent with previously committed actions. They differ in when they solve it.

RTC (inference-time)

The model is trained normally (unconditional chunks). At inference time, consistency is enforced by clamping the prefix during denoising. Each denoising step generates a full chunk, then the prefix portion is overwritten with the committed values. This is an external constraint applied to a model that doesn't know about prefixes.

Helix (training-time)

The model is trained with random prefixes as input. At inference time, the committed prefix is simply fed as input -- no clamping, no inpainting, no extra steps. The model natively generates continuations because it learned to do so during training.

RTC vs Helix: Side by Side

Compare how each method handles the frozen prefix. Click "Step" to advance the denoising process.

The performance difference: RTC requires extra denoising steps to get good consistency. Helix uses the standard number of steps because the model already knows what to do with the prefix. In practice, Helix achieves equal or better consistency with 50% fewer denoising steps. That's a 50% inference speedup on top of the already-async pipeline.
What is the practical advantage of training-time conditioning over inference-time inpainting?

Chapter 4: Implementation

One of Helix's most appealing properties is its simplicity. The training modification is remarkably small -- the paper describes it as "a few lines of code" on top of any flow matching policy.

The changes are:

Data loading
Sample a random K for each training example. Extract the first K actions from the ground-truth chunk.
Model input
Concatenate the prefix to the noisy chunk (or zero-pad if K=0). Add a binary mask indicating which positions have a prefix.
Loss
Standard flow matching loss on all positions. Optionally upweight the prefix positions to enforce strict consistency.

That's it. No architectural changes. No new modules. No new hyperparameters beyond Kmax (which the paper sets to match the commitment horizon used at deployment). The model architecture remains identical to a standard flow matching VLA.

python
# Helix training modification (pseudocode)
def helix_training_step(obs, actions_gt, model):
    K = random.randint(0, K_MAX)         # random prefix length
    prefix = actions_gt[:K]               # ground-truth prefix
    mask = [1]*K + [0]*(H-K)             # 1 = has prefix
    noise = randn_like(actions_gt)
    tau = random.uniform(0, 1)
    noisy = tau * actions_gt + (1-tau) * noise
    # Concatenate prefix + mask as extra input
    pred = model(obs, noisy, tau, prefix, mask)
    loss = mse(pred, actions_gt - noise)  # standard flow loss
    return loss
The mask is important: Without the binary mask, the model can't distinguish "prefix action = 0.0" from "no prefix at this position." The mask lets the model know which positions carry real prefix information and which are just padding. This is the same pattern used in language models with attention masks.
What architectural changes does Helix require compared to a standard flow matching policy?

Chapter 5: Results

Helix is evaluated on both simulation benchmarks and real-world dexterous tasks. The headline result: Helix matches or exceeds RTC's performance with significantly less inference compute.

Espresso making

A bimanual task requiring the robot to load a coffee pod, close the machine, place a cup, and press the brew button. This is a multi-step sequential task where each step depends on precise placement of the previous step. Helix handles the transitions between steps smoothly because it was trained to generate continuations that are consistent with committed actions.

Box assembly

Folding a flat cardboard template into a 3D box. This requires bimanual coordination and precise force control. Helix achieves comparable success rates to RTC while using fewer denoising steps -- meaning the robot can operate at a higher effective control frequency.

MethodDenoising stepsEspressoBox assembly
Open-loop10BaselineBaseline
RTC (inpainting)20-30GoodGood
Helix10Good+Good
The efficiency win: Helix achieves RTC-level performance with standard denoising step counts (10 steps vs RTC's 20-30). This isn't just about speed -- fewer denoising steps means the async pipeline's commitment horizon K can be smaller, which means more frequent replanning, which means better reactivity. It's a cascading improvement.
Why does using fewer denoising steps improve reactivity, not just speed?

Chapter 6: When to Use Helix vs RTC

Helix and RTC are not competitors -- they're complementary solutions at different points in the design space. The right choice depends on your constraints.

Use Helix when:

Use RTC when:

They can be combined: You can train with Helix-style prefix conditioning AND use RTC-style inpainting at inference time as an additional consistency check. The paper shows this combination sometimes outperforms either alone, though the gains are marginal since Helix already handles the conditioning well.
When would you choose RTC over Helix?

Chapter 7: Connections

Helix fits into a broader trend in robot learning: moving intelligence from inference time to training time. This same pattern appears throughout deep learning.

DomainInference-timeTraining-time
Robot chunkingRTC (inpainting)Helix (prefix conditioning)
Language modelsChain-of-thought promptingTraining on reasoning traces
Image generationClassifier guidanceClassifier-free guidance (CFG)
RLMCTS at inferenceDistilling MCTS into policy

In each case, the training-time approach produces a model that handles the task natively, while the inference-time approach bolts it on externally. The training-time version is usually simpler at deployment and more computationally efficient, but requires access to the training loop.

Inference-Time vs Training-Time Conditioning

Toggle between RTC and Helix to see the denoising pipeline difference. RTC clamps after each step; Helix conditions natively.

MethodRTC
Related lessons: RTCpi-0pi-0.5Gleams: Flow Matching
"If you know the constraint at deployment, teach it during training."
— The principle underlying Helix, classifier-free guidance, and distillation