Action chunking policies are too slow for real-time control. RTC fixes this by computing the next chunk while executing the current one — using flow matching to "inpaint" the uncommitted portion of each chunk so that consecutive plans blend seamlessly.
Modern VLA policies like pi-0 predict actions in chunks -- an entire sequence of 50 future actions at once. This gives them temporal coherence: the robot plans a smooth trajectory, not just a single jittery step. But it creates a problem nobody talks about: the model is too slow to keep up.
Generating a 50-step action chunk through flow matching takes 50-200 milliseconds of GPU compute. At 50 Hz control, a new observation arrives every 20 ms. By the time the policy finishes computing, the world has moved on. The observation used to plan is already stale.
The naive fix: execute the chunk open-loop -- just play all 50 actions without replanning. This works in gentle, static environments. But in dynamic settings -- catching a ball, avoiding a moving obstacle, reacting to a human -- open-loop execution is catastrophically fragile. One unexpected perturbation and the entire plan is wrong.
Drag the chunk size. Larger chunks are smoother but take longer to compute -- the "stale window" grows.
To understand RTC, we need to be precise about what action chunking does. An action chunking policy takes in the current observation ot and outputs a sequence of H actions: [at, at+1, ..., at+H-1]. This is the chunk.
In a flow matching policy like pi-0, producing this chunk means running multiple denoising steps. You start with random noise shaped like H actions, and the model iteratively refines it into a coherent action trajectory. Each denoising step requires a full forward pass of the transformer.
The standard approach is temporal ensembling: re-plan at every timestep, but average overlapping predictions. Query the policy at t=0, get actions [a0...a49]. At t=1, query again, get [a'1...a'50]. Execute the average of a1 and a'1. This reduces jitter but doesn't solve latency -- you're still computing a fresh chunk at every step.
This is exactly what RTC does -- but with a clever trick to ensure that when you do replan, the new chunk is consistent with the actions you already committed to executing.
Here is RTC's core mechanism, built around three key parameters. H is the total chunk length (e.g. 50 actions). d is the inference delay -- the number of steps the old chunk keeps executing while the GPU computes the new one. s is the execution horizon -- the minimum number of freshly generated steps before triggering the next inference call.
When a new chunk is generated, RTC divides it into three regions based on a guidance weight ω:
This is action inpainting, identical to image inpainting in diffusion models. At each denoising step, the frozen portion is clamped to known values from the previous chunk. Only the free tail is generated from scratch. The intermediate zone blends smoothly between the two.
Click "Replan" to see how the frozen prefix is preserved while the tail is regenerated via flow matching.
The freeze-inpaint mechanism enables something powerful: asynchronous execution. While the robot executes the frozen K actions from chunk N, the GPU is already computing chunk N+1. By the time the frozen actions are done, the next chunk is ready.
This is a pipeline, exactly like CPU instruction pipelining. One "stage" executes actions on the robot. Another "stage" computes the next plan on the GPU. The two stages overlap in time.
The key parameter is K -- the number of frozen actions (the "commitment horizon"). K must be large enough that the GPU finishes computing the next chunk before the robot finishes executing the current one. If inference takes 100 ms at 50 Hz control, K needs to be at least 5 steps (100 ms / 20 ms per step).
Even with inpainting, there can be a visible "seam" where the frozen prefix ends and the newly generated tail begins. The inpainted actions are sampled from the right conditional distribution, but they may not perfectly match the velocity and acceleration at the boundary.
This is where the intermediate region from Chapter 2 does its work. Rather than a hard binary transition from frozen to free, the guidance weight ω decays exponentially across the overlap zone:
where d is the frozen prefix length and s is the free tail. This exponential schedule is critical -- the paper tested hard masking (ω snaps from 1 to 0), linear decay, and exponential decay. Soft exponential masking consistently outperforms the alternatives. Hard masking creates discontinuities. Linear decay over-constrains the model for too long. Exponential decay releases the constraint quickly after the boundary while still maintaining continuity.
The blending is applied during the denoising process itself -- at each flow-matching step, the guidance term nudges the trajectory toward consistency with the committed prefix.
The paper evaluates RTC on 12 dynamic tasks in the Kinetix simulation suite -- environments emphasizing dynamics and closed-loop control: throwing, catching, balancing, and tracking. Each task is trained with a 4-layer MLP-Mixer, chunk length H=8, on 1M transitions per environment (32 epochs, 2,048 rollouts per configuration).
| Method | Mechanism | Dynamic Tasks |
|---|---|---|
| Naive async | Execute old chunk, ignore overlap | Discontinuities and jitter |
| Temporal ensembling (TE) | Replan every step, average overlaps | Fails even at d=0 |
| BID (bidirectional) | Bidirectional blending of chunks | Moderate |
| RTC (hard mask) | Binary freeze/free split | Good but has seams |
| RTC (soft exponential) | Exponential decay guidance | Best across all delays |
The most striking result: temporal ensembling fails even at zero delay (d=0). Why? Because these tasks have multimodal action distributions. When multiple valid trajectories exist (reach left or right around an obstacle), averaging them produces an invalid trajectory (go straight into the obstacle). Flow matching captures multimodality; TE destroys it.
RTC with soft exponential masking achieves the highest robustness across all delay conditions. Execution horizon size inversely correlates with delay sensitivity -- larger s means more fresh actions per chunk, providing a buffer against stale observations.
The real-world evaluation uses the π0.5 base model on a dual-arm 6-DoF robot with parallel jaw grippers. The setup: chunk length H=50, control period 20 ms (50 Hz), n=5 denoising steps, model latency ~97 ms (vs 76 ms baseline -- only ~21 ms overhead from the guidance). They tested with injected delays of +100 ms (d≈11) and +200 ms (d≈16) to stress-test robustness.
| Task | Why it needs reactivity |
|---|---|
| Light candle | Precise force, angle, and speed for the strike -- small variations in match position require continuous correction |
| Ethernet plug | Sub-millimeter alignment into a port -- contact forces shift the cable |
| Mobile bed-making | Cloth dynamics are unpredictable -- the robot must react to how fabric drapes |
| Shirt folding | Bimanual coordination with deformable material |
| Batch folding | Sequential folding of multiple items |
| Mobile dish placement | Navigation + precise placement on an uneven surface |
RTC demonstrated superior throughput under all delay conditions with statistical significance at +100 ms and +200 ms. Synchronous execution degrades linearly with injected delay. Temporal ensembling variants triggered protective stops under high delay. RTC additionally reduced retry frequency and local errors, with the biggest advantage on precision-critical tasks like candle lighting.
RTC solves the latency problem at inference time with zero training changes. But it comes with a cost: the inpainting denoising steps add compute overhead (you need extra denoising steps for the conditioning), and there's a subtle distributional mismatch -- the model was trained on unconditional chunks but is asked to generate conditional (prefix-clamped) ones at inference.
Watch chunks overlap: one executes while the next is computed. Drag the commitment slider to adjust K.
This is where Helix enters. Helix asks: what if the model learned to handle action prefixes during training? Instead of inpainting at inference time, you train the model to condition on a prefix of already-committed actions. This eliminates the inference overhead and the distributional mismatch. RTC is the inference-time solution; Helix is the training-time solution to the same problem.
| Method | Where it solves latency | Overhead |
|---|---|---|
| Open-loop chunking | Doesn't solve it | None |
| Temporal ensembling | Inference (averaging) | Recompute every step |
| RTC | Inference (inpainting) | Extra denoising steps |
| Helix | Training (prefix conditioning) | Zero inference overhead |