Kevin Black, Manuel Y. Galliker, Sergey Levine — Physical Intelligence, 2025

RTC: Real-Time
Chunking

Action chunking policies are too slow for real-time control. RTC fixes this by computing the next chunk while executing the current one — using flow matching to "inpaint" the uncommitted portion of each chunk so that consecutive plans blend seamlessly.

Prerequisites: pi-0 basics + Action chunking
8
Chapters
3+
Simulations

Chapter 0: The Latency Problem

Modern VLA policies like pi-0 predict actions in chunks -- an entire sequence of 50 future actions at once. This gives them temporal coherence: the robot plans a smooth trajectory, not just a single jittery step. But it creates a problem nobody talks about: the model is too slow to keep up.

Generating a 50-step action chunk through flow matching takes 50-200 milliseconds of GPU compute. At 50 Hz control, a new observation arrives every 20 ms. By the time the policy finishes computing, the world has moved on. The observation used to plan is already stale.

The naive fix: execute the chunk open-loop -- just play all 50 actions without replanning. This works in gentle, static environments. But in dynamic settings -- catching a ball, avoiding a moving obstacle, reacting to a human -- open-loop execution is catastrophically fragile. One unexpected perturbation and the entire plan is wrong.

The latency paradox: The bigger the action chunk, the smoother and more capable the robot. But the bigger the chunk, the longer it takes to compute, and the more stale the observations become. You want large chunks for capability and small chunks for reactivity. RTC resolves this tension.
Latency vs Reactivity Tradeoff

Drag the chunk size. Larger chunks are smoother but take longer to compute -- the "stale window" grows.

Chunk size50
Why can't action chunking policies simply wait for computation to finish before acting?

Chapter 1: Action Chunking Recap

To understand RTC, we need to be precise about what action chunking does. An action chunking policy takes in the current observation ot and outputs a sequence of H actions: [at, at+1, ..., at+H-1]. This is the chunk.

In a flow matching policy like pi-0, producing this chunk means running multiple denoising steps. You start with random noise shaped like H actions, and the model iteratively refines it into a coherent action trajectory. Each denoising step requires a full forward pass of the transformer.

The standard approach is temporal ensembling: re-plan at every timestep, but average overlapping predictions. Query the policy at t=0, get actions [a0...a49]. At t=1, query again, get [a'1...a'50]. Execute the average of a1 and a'1. This reduces jitter but doesn't solve latency -- you're still computing a fresh chunk at every step.

The key realization: In temporal ensembling, you throw away most of each chunk. You compute 50 actions but only use 1 before replanning. That's 98% wasted compute. What if you committed to executing more of each chunk and only replanned less frequently?

This is exactly what RTC does -- but with a clever trick to ensure that when you do replan, the new chunk is consistent with the actions you already committed to executing.

What is the core inefficiency of temporal ensembling?

Chapter 2: Freeze + Inpaint

Here is RTC's core mechanism, built around three key parameters. H is the total chunk length (e.g. 50 actions). d is the inference delay -- the number of steps the old chunk keeps executing while the GPU computes the new one. s is the execution horizon -- the minimum number of freshly generated steps before triggering the next inference call.

When a new chunk is generated, RTC divides it into three regions based on a guidance weight ω:

Frozen (first d steps): ω = 1.0 — must match old chunk exactly
Intermediate (overlap zone): ω decays from 1.0 → 0.0 — soft transition
Free (final s steps): ω = 0.0 — freshly generated from new observation

This is action inpainting, identical to image inpainting in diffusion models. At each denoising step, the frozen portion is clamped to known values from the previous chunk. Only the free tail is generated from scratch. The intermediate zone blends smoothly between the two.

Why this works: Flow matching is a generative model -- it doesn't just produce a single trajectory, it models the distribution of trajectories. By clamping the prefix during denoising, you're sampling from the conditional distribution p(future | committed_past). The model naturally produces continuations that are dynamically consistent with what came before.
Freeze + Inpaint Mechanism

Click "Replan" to see how the frozen prefix is preserved while the tail is regenerated via flow matching.

How does RTC ensure the new chunk is consistent with already-committed actions?

Chapter 3: The Asynchronous Pipeline

The freeze-inpaint mechanism enables something powerful: asynchronous execution. While the robot executes the frozen K actions from chunk N, the GPU is already computing chunk N+1. By the time the frozen actions are done, the next chunk is ready.

This is a pipeline, exactly like CPU instruction pipelining. One "stage" executes actions on the robot. Another "stage" computes the next plan on the GPU. The two stages overlap in time.

Time 0 - K
Robot executes chunk 0 (frozen actions). GPU computes chunk 1 (with frozen prefix from chunk 0).
Time K - 2K
Robot executes chunk 1 (frozen actions). GPU computes chunk 2 (with frozen prefix from chunk 1).
Time 2K - 3K
Robot executes chunk 2. GPU computes chunk 3. And so on...

The key parameter is K -- the number of frozen actions (the "commitment horizon"). K must be large enough that the GPU finishes computing the next chunk before the robot finishes executing the current one. If inference takes 100 ms at 50 Hz control, K needs to be at least 5 steps (100 ms / 20 ms per step).

The tradeoff: Larger K = more time for GPU computation (can use more denoising steps = better quality). Smaller K = more frequent replanning = more reactive to changes. The paper finds K=5-8 works well for most tasks -- enough time for 10+ denoising steps while replanning every 100-160 ms.
What determines the minimum commitment horizon K?

Chapter 4: Chunk Blending

Even with inpainting, there can be a visible "seam" where the frozen prefix ends and the newly generated tail begins. The inpainted actions are sampled from the right conditional distribution, but they may not perfectly match the velocity and acceleration at the boundary.

This is where the intermediate region from Chapter 2 does its work. Rather than a hard binary transition from frozen to free, the guidance weight ω decays exponentially across the overlap zone:

ω(i) = exp(−λ · (i − d) / (H − d − s))

where d is the frozen prefix length and s is the free tail. This exponential schedule is critical -- the paper tested hard masking (ω snaps from 1 to 0), linear decay, and exponential decay. Soft exponential masking consistently outperforms the alternatives. Hard masking creates discontinuities. Linear decay over-constrains the model for too long. Exponential decay releases the constraint quickly after the boundary while still maintaining continuity.

The blending is applied during the denoising process itself -- at each flow-matching step, the guidance term nudges the trajectory toward consistency with the committed prefix.

Why blend during denoising, not after? If you blend only the final output, the model has no chance to adapt to the constraint. By applying guidance at every denoising step, you're steering the generative process itself. The model sees the guided state and adjusts its predictions accordingly. This produces much smoother transitions than post-hoc interpolation.
Why is chunk blending applied during the denoising process rather than as a post-processing step?

Chapter 5: Benchmark Results

The paper evaluates RTC on 12 dynamic tasks in the Kinetix simulation suite -- environments emphasizing dynamics and closed-loop control: throwing, catching, balancing, and tracking. Each task is trained with a 4-layer MLP-Mixer, chunk length H=8, on 1M transitions per environment (32 epochs, 2,048 rollouts per configuration).

RTC vs baselines

MethodMechanismDynamic Tasks
Naive asyncExecute old chunk, ignore overlapDiscontinuities and jitter
Temporal ensembling (TE)Replan every step, average overlapsFails even at d=0
BID (bidirectional)Bidirectional blending of chunksModerate
RTC (hard mask)Binary freeze/free splitGood but has seams
RTC (soft exponential)Exponential decay guidanceBest across all delays

The most striking result: temporal ensembling fails even at zero delay (d=0). Why? Because these tasks have multimodal action distributions. When multiple valid trajectories exist (reach left or right around an obstacle), averaging them produces an invalid trajectory (go straight into the obstacle). Flow matching captures multimodality; TE destroys it.

RTC with soft exponential masking achieves the highest robustness across all delay conditions. Execution horizon size inversely correlates with delay sensitivity -- larger s means more fresh actions per chunk, providing a buffer against stale observations.

The 12 Kinetix tasks cover throwing, catching, balancing, and obstacle avoidance with injected actuation noise. RTC shows its biggest advantage on tasks requiring fast reactions, exactly where the gap between async computation and real-time control hurts the most.
On which types of tasks does RTC show the largest improvement over open-loop chunking?

Chapter 6: Real-World Tasks

The real-world evaluation uses the π0.5 base model on a dual-arm 6-DoF robot with parallel jaw grippers. The setup: chunk length H=50, control period 20 ms (50 Hz), n=5 denoising steps, model latency ~97 ms (vs 76 ms baseline -- only ~21 ms overhead from the guidance). They tested with injected delays of +100 ms (d≈11) and +200 ms (d≈16) to stress-test robustness.

Six dual-arm tasks (480 episodes, 28 machine hours)

TaskWhy it needs reactivity
Light candlePrecise force, angle, and speed for the strike -- small variations in match position require continuous correction
Ethernet plugSub-millimeter alignment into a port -- contact forces shift the cable
Mobile bed-makingCloth dynamics are unpredictable -- the robot must react to how fabric drapes
Shirt foldingBimanual coordination with deformable material
Batch foldingSequential folding of multiple items
Mobile dish placementNavigation + precise placement on an uneven surface

RTC demonstrated superior throughput under all delay conditions with statistical significance at +100 ms and +200 ms. Synchronous execution degrades linearly with injected delay. Temporal ensembling variants triggered protective stops under high delay. RTC additionally reduced retry frequency and local errors, with the biggest advantage on precision-critical tasks like candle lighting.

Candle lighting is emblematic: It requires millisecond-level precision in a task that lasts only a second or two. The entire trajectory is dynamic -- there's no "static" phase where open-loop execution would be safe. Every 100 ms of stale observation can mean the difference between ignition and failure. RTC replans every ~6 steps while executing, keeping the trajectory fresh.
Why does match lighting specifically require RTC rather than open-loop chunking?

Chapter 7: Connections

RTC solves the latency problem at inference time with zero training changes. But it comes with a cost: the inpainting denoising steps add compute overhead (you need extra denoising steps for the conditioning), and there's a subtle distributional mismatch -- the model was trained on unconditional chunks but is asked to generate conditional (prefix-clamped) ones at inference.

Async Chunking Timeline

Watch chunks overlap: one executes while the next is computed. Drag the commitment slider to adjust K.

Commitment K6

This is where Helix enters. Helix asks: what if the model learned to handle action prefixes during training? Instead of inpainting at inference time, you train the model to condition on a prefix of already-committed actions. This eliminates the inference overhead and the distributional mismatch. RTC is the inference-time solution; Helix is the training-time solution to the same problem.

MethodWhere it solves latencyOverhead
Open-loop chunkingDoesn't solve itNone
Temporal ensemblingInference (averaging)Recompute every step
RTCInference (inpainting)Extra denoising steps
HelixTraining (prefix conditioning)Zero inference overhead
Related lessons: Helixpi-0pi-0.5Gleams: Flow Matching
"The best time to start computing the next action is while you're still executing the current one."
— The core principle of RTC