Physical Intelligence + UC Berkeley, 2025

Helix: Training-Time
Action Conditioning

RTC inpaints at inference time. Helix asks: what if the model learned to handle action prefixes during training? By conditioning on previously-committed actions during training, Helix achieves the same consistency as RTC with zero inference overhead.

Prerequisites: pi-0 basics + RTC concepts

Chapters

Simulations

Chapter 0: RTC's Overhead

RTC solved the latency problem elegantly: freeze the committed actions, inpaint the rest. But it introduced two new problems that are easy to overlook.

Problem 1: Inference overhead. Inpainting requires extra computation during denoising. At each denoising step, you must clamp the frozen prefix back into place. More importantly, the quality of the inpainted tail depends on using enough denoising steps to properly condition on the prefix. In practice, RTC needs 2-3x more denoising steps than standard chunking to achieve good consistency. On a resource-constrained robot, those extra steps eat into the already-tight compute budget.

Problem 2: Distribution mismatch. The flow matching model was trained to generate full, unconditional action chunks. At inference time, RTC asks it to generate chunks conditioned on a frozen prefix -- a task it never saw during training. This works surprisingly well (flow matching is flexible), but it's fundamentally asking the model to do something it wasn't optimized for. There's a gap between what the model learned and what it's asked to do.

The mismatch is subtle but real: During training, the model learns p(full chunk | observation). During RTC inference, you're sampling from p(tail | prefix, observation) using inpainting as an approximation. This approximation is good but not exact -- it can produce slightly inconsistent transitions and occasionally fails to properly account for the prefix.

What are RTC's two main limitations?

Inference overhead from extra denoising steps and distributional mismatch between unconditional training and conditional inference It requires too much training data and a larger model It only works in simulation

Chapter 1: The Training-Time Insight

Helix's key insight is deceptively simple: if the robot will always have a prefix of committed actions at inference time, why not simulate that during training?

During real deployment, the robot is always in one of two states: (1) starting from scratch (no prefix), or (2) partway through a previous chunk (prefix exists). Case 2 is the common case. RTC handles it at inference time with inpainting. Helix handles it at training time by modifying the training data.

The idea: during training, randomly sample an action prefix from the ground-truth trajectory and provide it as additional input to the model. The model learns to generate the full chunk conditioned on this prefix. At inference time, no inpainting is needed -- the model already knows how to continue from a prefix because it practiced doing exactly that during training.

The analogy: Imagine learning to write sentences. RTC is like always writing from scratch but then erasing and rewriting the first few words to match what came before. Helix is like practicing with sentence-completion exercises: "Given the start 'The cat sat on the...', complete the sentence." The second approach is obviously more natural. The model internalizes continuation, not post-hoc editing.

What is Helix's core insight for eliminating inference-time inpainting?

Use a faster GPU for inpainting Simulate the action prefix during training so the model learns to generate continuations natively, eliminating the need for inference-time conditioning Reduce the chunk size to avoid the overlap problem

Chapter 2: How Prefix Conditioning Works

During training, each action chunk in the dataset is a sequence [a₀, a₁, ..., a_H-1]. Helix modifies the training procedure as follows:

Step 1: Sample prefix length

Randomly choose K from {0, 1, ..., K_max}. K=0 means no prefix (standard training).

↓

Step 2: Extract prefix

Take the first K actions [a₀, ..., a_K-1] as the prefix. These are ground-truth actions from the dataset.

↓

Step 3: Condition

Feed the prefix into the model alongside the observation and noise. The model must denoise the FULL chunk while being consistent with the prefix.

↓

Step 4: Loss

Compute flow matching loss on the ENTIRE chunk. The model is penalized if its output contradicts the prefix.

The crucial detail: K is sampled randomly during training, including K=0. This means the model learns both the unconditional case (fresh start) and the conditional case (continuing a prefix). At inference time, you just set K to whatever the actual prefix length is.

How the prefix is provided: The prefix actions are concatenated with the noisy action chunk as an additional input to the flow matching denoising network. Think of it like giving the model a "hint" -- the first K answers in a fill-in-the-blank test. During training, K varies randomly. During inference, K matches the actual number of committed actions from the previous chunk.

Prefix Conditioning During Training

Click "Sample" to see how random prefix lengths are drawn during training. The model must generate the full chunk consistent with the green prefix.

Why does Helix randomly include K=0 (no prefix) during training?

So the model learns both fresh-start generation (no prefix) and continuation (with prefix) -- at deployment, the first chunk has no prefix while all subsequent chunks do To reduce training time K=0 is only used for validation, not training

Chapter 3: Inference-Time vs Training-Time

Let's be precise about the difference. Both RTC and Helix solve the same problem: generating an action chunk that is consistent with previously committed actions. They differ in when they solve it.

RTC (inference-time)

The model is trained normally (unconditional chunks). At inference time, consistency is enforced by clamping the prefix during denoising. Each denoising step generates a full chunk, then the prefix portion is overwritten with the committed values. This is an external constraint applied to a model that doesn't know about prefixes.

Helix (training-time)

The model is trained with random prefixes as input. At inference time, the committed prefix is simply fed as input -- no clamping, no inpainting, no extra steps. The model natively generates continuations because it learned to do so during training.

RTC vs Helix: Side by Side

Compare how each method handles the frozen prefix. Click "Step" to advance the denoising process.

The performance difference: RTC requires extra denoising steps to get good consistency. Helix uses the standard number of steps because the model already knows what to do with the prefix. In practice, Helix achieves equal or better consistency with 50% fewer denoising steps. That's a 50% inference speedup on top of the already-async pipeline.

What is the practical advantage of training-time conditioning over inference-time inpainting?

Training is faster The model is smaller No extra denoising steps needed at inference -- the model natively generates prefix-consistent continuations, giving ~50% fewer denoising steps for equivalent quality

Chapter 4: Implementation

One of Helix's most appealing properties is its simplicity. The training modification is remarkably small -- the paper describes it as "a few lines of code" on top of any flow matching policy.

The changes are:

Data loading

Sample a random K for each training example. Extract the first K actions from the ground-truth chunk.

↓

Model input

Concatenate the prefix to the noisy chunk (or zero-pad if K=0). Add a binary mask indicating which positions have a prefix.

↓

Loss

Standard flow matching loss on all positions. Optionally upweight the prefix positions to enforce strict consistency.

That's it. No architectural changes. No new modules. No new hyperparameters beyond K_max (which the paper sets to match the commitment horizon used at deployment). The model architecture remains identical to a standard flow matching VLA.

python
# Helix training modification (pseudocode)
def helix_training_step(obs, actions_gt, model):
    K = random.randint(0, K_MAX)         # random prefix length
    prefix = actions_gt[:K]               # ground-truth prefix
    mask = [1]*K + [0]*(H-K)             # 1 = has prefix
    noise = randn_like(actions_gt)
    tau = random.uniform(0, 1)
    noisy = tau * actions_gt + (1-tau) * noise
    # Concatenate prefix + mask as extra input
    pred = model(obs, noisy, tau, prefix, mask)
    loss = mse(pred, actions_gt - noise)  # standard flow loss
    return loss

The mask is important: Without the binary mask, the model can't distinguish "prefix action = 0.0" from "no prefix at this position." The mask lets the model know which positions carry real prefix information and which are just padding. This is the same pattern used in language models with attention masks.

What architectural changes does Helix require compared to a standard flow matching policy?

None -- just additional input channels for the prefix and a binary mask, with no changes to the model architecture itself A separate encoder for the prefix A new attention mechanism

Chapter 5: Results

Helix is evaluated on both simulation benchmarks and real-world dexterous tasks. The headline result: Helix matches or exceeds RTC's performance with significantly less inference compute.

Espresso making

A bimanual task requiring the robot to load a coffee pod, close the machine, place a cup, and press the brew button. This is a multi-step sequential task where each step depends on precise placement of the previous step. Helix handles the transitions between steps smoothly because it was trained to generate continuations that are consistent with committed actions.

Box assembly

Folding a flat cardboard template into a 3D box. This requires bimanual coordination and precise force control. Helix achieves comparable success rates to RTC while using fewer denoising steps -- meaning the robot can operate at a higher effective control frequency.

Method	Denoising steps	Espresso	Box assembly
Open-loop	10	Baseline	Baseline
RTC (inpainting)	20-30	Good	Good
Helix	10	Good+	Good

The efficiency win: Helix achieves RTC-level performance with standard denoising step counts (10 steps vs RTC's 20-30). This isn't just about speed -- fewer denoising steps means the async pipeline's commitment horizon K can be smaller, which means more frequent replanning, which means better reactivity. It's a cascading improvement.

Why does using fewer denoising steps improve reactivity, not just speed?

Faster inference means a smaller commitment horizon K is feasible -- the robot can replan more frequently, reacting to changes sooner Fewer steps produce smoother trajectories The model uses less memory

Chapter 6: When to Use Helix vs RTC

Helix and RTC are not competitors -- they're complementary solutions at different points in the design space. The right choice depends on your constraints.

Use Helix when:

You control the training pipeline. Helix requires retraining (or fine-tuning) the model. If you're training from scratch anyway, it's nearly free to add.
Inference compute is tight. Helix needs no extra denoising steps, so it's better for resource-constrained robots.
You want maximum reactivity. Fewer denoising steps = smaller K = more frequent replanning.

Use RTC when:

You have a pre-trained model you can't retrain. RTC works with any flow matching policy out of the box. No training modifications needed.
You need a quick fix. RTC is purely an inference-time technique. Deploy it today, no retraining required.
The model is too expensive to retrain. For large foundation models, retraining is costly. RTC sidesteps this entirely.

They can be combined: You can train with Helix-style prefix conditioning AND use RTC-style inpainting at inference time as an additional consistency check. The paper shows this combination sometimes outperforms either alone, though the gains are marginal since Helix already handles the conditioning well.

When would you choose RTC over Helix?

When you have a pre-trained model you can't retrain -- RTC works at inference time with any flow matching policy, requiring no training changes When you need the best possible performance When inference speed matters most

Chapter 7: Connections

Helix fits into a broader trend in robot learning: moving intelligence from inference time to training time. This same pattern appears throughout deep learning.

Domain	Inference-time	Training-time
Robot chunking	RTC (inpainting)	Helix (prefix conditioning)
Language models	Chain-of-thought prompting	Training on reasoning traces
Image generation	Classifier guidance	Classifier-free guidance (CFG)
RL	MCTS at inference	Distilling MCTS into policy

In each case, the training-time approach produces a model that handles the task natively, while the inference-time approach bolts it on externally. The training-time version is usually simpler at deployment and more computationally efficient, but requires access to the training loop.

Inference-Time vs Training-Time Conditioning

Toggle between RTC and Helix to see the denoising pipeline difference. RTC clamps after each step; Helix conditions natively.

MethodRTC

Related lessons: RTC • pi-0 • pi-0.5 • Gleams: Flow Matching

"If you know the constraint at deployment, teach it during training."

— The principle underlying Helix, classifier-free guidance, and distillation

Helix: Training-TimeAction Conditioning