Kevin Black, Manuel Y. Galliker, Sergey Levine — Physical Intelligence, 2025

RTC: Real-Time
Chunking

Action chunking policies are too slow for real-time control. RTC fixes this by computing the next chunk while executing the current one — using flow matching to "inpaint" the uncommitted portion of each chunk so that consecutive plans blend seamlessly.

Prerequisites: pi-0 basics + Action chunking

Chapters

Simulations

Chapter 0: The Latency Problem

Modern VLA policies like pi-0 predict actions in chunks -- an entire sequence of 50 future actions at once. This gives them temporal coherence: the robot plans a smooth trajectory, not just a single jittery step. But it creates a problem nobody talks about: the model is too slow to keep up.

Generating a 50-step action chunk through flow matching takes 50-200 milliseconds of GPU compute. At 50 Hz control, a new observation arrives every 20 ms. By the time the policy finishes computing, the world has moved on. The observation used to plan is already stale.

The naive fix: execute the chunk open-loop -- just play all 50 actions without replanning. This works in gentle, static environments. But in dynamic settings -- catching a ball, avoiding a moving obstacle, reacting to a human -- open-loop execution is catastrophically fragile. One unexpected perturbation and the entire plan is wrong.

The latency paradox: The bigger the action chunk, the smoother and more capable the robot. But the bigger the chunk, the longer it takes to compute, and the more stale the observations become. You want large chunks for capability and small chunks for reactivity. RTC resolves this tension.

Latency vs Reactivity Tradeoff

Drag the chunk size. Larger chunks are smoother but take longer to compute -- the "stale window" grows.

Chunk size50

Why can't action chunking policies simply wait for computation to finish before acting?

Computation takes 50-200 ms, but the world changes every 20 ms -- by the time the plan is ready, the observation is stale and the plan may be wrong The GPU runs out of memory The robot joints cannot accept batched commands

Chapter 1: Action Chunking Recap

To understand RTC, we need to be precise about what action chunking does. An action chunking policy takes in the current observation o_t and outputs a sequence of H actions: [a_t, a_t+1, ..., a_t+H-1]. This is the chunk.

In a flow matching policy like pi-0, producing this chunk means running multiple denoising steps. You start with random noise shaped like H actions, and the model iteratively refines it into a coherent action trajectory. Each denoising step requires a full forward pass of the transformer.

The standard approach is temporal ensembling: re-plan at every timestep, but average overlapping predictions. Query the policy at t=0, get actions [a₀...a₄₉]. At t=1, query again, get [a'₁...a'₅₀]. Execute the average of a₁ and a'₁. This reduces jitter but doesn't solve latency -- you're still computing a fresh chunk at every step.

The key realization: In temporal ensembling, you throw away most of each chunk. You compute 50 actions but only use 1 before replanning. That's 98% wasted compute. What if you committed to executing more of each chunk and only replanned less frequently?

This is exactly what RTC does -- but with a clever trick to ensure that when you do replan, the new chunk is consistent with the actions you already committed to executing.

What is the core inefficiency of temporal ensembling?

It uses too much memory You compute an entire chunk of 50 actions but only use 1 before replanning, wasting 98% of the computation It requires two GPUs

Chapter 2: Freeze + Inpaint

Here is RTC's core mechanism, built around three key parameters. H is the total chunk length (e.g. 50 actions). d is the inference delay -- the number of steps the old chunk keeps executing while the GPU computes the new one. s is the execution horizon -- the minimum number of freshly generated steps before triggering the next inference call.

When a new chunk is generated, RTC divides it into three regions based on a guidance weight ω:

Frozen (first d steps): ω = 1.0 — must match old chunk exactly
Intermediate (overlap zone): ω decays from 1.0 → 0.0 — soft transition
Free (final s steps): ω = 0.0 — freshly generated from new observation

This is action inpainting, identical to image inpainting in diffusion models. At each denoising step, the frozen portion is clamped to known values from the previous chunk. Only the free tail is generated from scratch. The intermediate zone blends smoothly between the two.

Why this works: Flow matching is a generative model -- it doesn't just produce a single trajectory, it models the distribution of trajectories. By clamping the prefix during denoising, you're sampling from the conditional distribution p(future | committed_past). The model naturally produces continuations that are dynamically consistent with what came before.

Freeze + Inpaint Mechanism

Click "Replan" to see how the frozen prefix is preserved while the tail is regenerated via flow matching.

How does RTC ensure the new chunk is consistent with already-committed actions?

During denoising, committed actions are clamped at each step -- only the uncommitted tail is generated freely, sampling from the conditional distribution given the frozen prefix It trains a separate consistency loss It simply concatenates old and new chunks

Chapter 3: The Asynchronous Pipeline

The freeze-inpaint mechanism enables something powerful: asynchronous execution. While the robot executes the frozen K actions from chunk N, the GPU is already computing chunk N+1. By the time the frozen actions are done, the next chunk is ready.

This is a pipeline, exactly like CPU instruction pipelining. One "stage" executes actions on the robot. Another "stage" computes the next plan on the GPU. The two stages overlap in time.

Time 0 - K

Robot executes chunk 0 (frozen actions). GPU computes chunk 1 (with frozen prefix from chunk 0).

↓

Time K - 2K

Robot executes chunk 1 (frozen actions). GPU computes chunk 2 (with frozen prefix from chunk 1).

↓

Time 2K - 3K

Robot executes chunk 2. GPU computes chunk 3. And so on...

The key parameter is K -- the number of frozen actions (the "commitment horizon"). K must be large enough that the GPU finishes computing the next chunk before the robot finishes executing the current one. If inference takes 100 ms at 50 Hz control, K needs to be at least 5 steps (100 ms / 20 ms per step).

The tradeoff: Larger K = more time for GPU computation (can use more denoising steps = better quality). Smaller K = more frequent replanning = more reactive to changes. The paper finds K=5-8 works well for most tasks -- enough time for 10+ denoising steps while replanning every 100-160 ms.

What determines the minimum commitment horizon K?

The number of robot joints The size of the action chunk H The GPU inference time divided by the control period -- K must be large enough that the next chunk is ready before the current frozen actions finish executing

Chapter 4: Chunk Blending

Even with inpainting, there can be a visible "seam" where the frozen prefix ends and the newly generated tail begins. The inpainted actions are sampled from the right conditional distribution, but they may not perfectly match the velocity and acceleration at the boundary.

This is where the intermediate region from Chapter 2 does its work. Rather than a hard binary transition from frozen to free, the guidance weight ω decays exponentially across the overlap zone:

ω(i) = exp(−λ · (i − d) / (H − d − s))

where d is the frozen prefix length and s is the free tail. This exponential schedule is critical -- the paper tested hard masking (ω snaps from 1 to 0), linear decay, and exponential decay. Soft exponential masking consistently outperforms the alternatives. Hard masking creates discontinuities. Linear decay over-constrains the model for too long. Exponential decay releases the constraint quickly after the boundary while still maintaining continuity.

The blending is applied during the denoising process itself -- at each flow-matching step, the guidance term nudges the trajectory toward consistency with the committed prefix.

Why blend during denoising, not after? If you blend only the final output, the model has no chance to adapt to the constraint. By applying guidance at every denoising step, you're steering the generative process itself. The model sees the guided state and adjusts its predictions accordingly. This produces much smoother transitions than post-hoc interpolation.

Why is chunk blending applied during the denoising process rather than as a post-processing step?

Blending at each denoising step lets the model see and adapt to the constraint, producing smoother transitions than a one-shot post-hoc blend Post-processing is computationally expensive The frozen actions are not available after denoising

Chapter 5: Benchmark Results

The paper evaluates RTC on 12 dynamic tasks in the Kinetix simulation suite -- environments emphasizing dynamics and closed-loop control: throwing, catching, balancing, and tracking. Each task is trained with a 4-layer MLP-Mixer, chunk length H=8, on 1M transitions per environment (32 epochs, 2,048 rollouts per configuration).

RTC vs baselines

Method	Mechanism	Dynamic Tasks
Naive async	Execute old chunk, ignore overlap	Discontinuities and jitter
Temporal ensembling (TE)	Replan every step, average overlaps	Fails even at d=0
BID (bidirectional)	Bidirectional blending of chunks	Moderate
RTC (hard mask)	Binary freeze/free split	Good but has seams
RTC (soft exponential)	Exponential decay guidance	Best across all delays

The most striking result: temporal ensembling fails even at zero delay (d=0). Why? Because these tasks have multimodal action distributions. When multiple valid trajectories exist (reach left or right around an obstacle), averaging them produces an invalid trajectory (go straight into the obstacle). Flow matching captures multimodality; TE destroys it.

RTC with soft exponential masking achieves the highest robustness across all delay conditions. Execution horizon size inversely correlates with delay sensitivity -- larger s means more fresh actions per chunk, providing a buffer against stale observations.

The 12 Kinetix tasks cover throwing, catching, balancing, and obstacle avoidance with injected actuation noise. RTC shows its biggest advantage on tasks requiring fast reactions, exactly where the gap between async computation and real-time control hurts the most.

On which types of tasks does RTC show the largest improvement over open-loop chunking?

Dynamic tasks requiring real-time reactivity -- catching, dodging, tracking moving objects -- where open-loop execution cannot adapt to changes Static pick-and-place tasks Tasks with long planning horizons

Chapter 6: Real-World Tasks

The real-world evaluation uses the π0.5 base model on a dual-arm 6-DoF robot with parallel jaw grippers. The setup: chunk length H=50, control period 20 ms (50 Hz), n=5 denoising steps, model latency ~97 ms (vs 76 ms baseline -- only ~21 ms overhead from the guidance). They tested with injected delays of +100 ms (d≈11) and +200 ms (d≈16) to stress-test robustness.

Six dual-arm tasks (480 episodes, 28 machine hours)

Task	Why it needs reactivity
Light candle	Precise force, angle, and speed for the strike -- small variations in match position require continuous correction
Ethernet plug	Sub-millimeter alignment into a port -- contact forces shift the cable
Mobile bed-making	Cloth dynamics are unpredictable -- the robot must react to how fabric drapes
Shirt folding	Bimanual coordination with deformable material
Batch folding	Sequential folding of multiple items
Mobile dish placement	Navigation + precise placement on an uneven surface

RTC demonstrated superior throughput under all delay conditions with statistical significance at +100 ms and +200 ms. Synchronous execution degrades linearly with injected delay. Temporal ensembling variants triggered protective stops under high delay. RTC additionally reduced retry frequency and local errors, with the biggest advantage on precision-critical tasks like candle lighting.

Candle lighting is emblematic: It requires millisecond-level precision in a task that lasts only a second or two. The entire trajectory is dynamic -- there's no "static" phase where open-loop execution would be safe. Every 100 ms of stale observation can mean the difference between ignition and failure. RTC replans every ~6 steps while executing, keeping the trajectory fresh.

Why does match lighting specifically require RTC rather than open-loop chunking?

Matches are too small for the camera to see Open-loop is too slow to compute The entire trajectory is dynamic with no safe open-loop phase -- small variations in match position require continuous replanning of the strike angle and force

Chapter 7: Connections

RTC solves the latency problem at inference time with zero training changes. But it comes with a cost: the inpainting denoising steps add compute overhead (you need extra denoising steps for the conditioning), and there's a subtle distributional mismatch -- the model was trained on unconditional chunks but is asked to generate conditional (prefix-clamped) ones at inference.

Async Chunking Timeline

Watch chunks overlap: one executes while the next is computed. Drag the commitment slider to adjust K.

Commitment K6

This is where Helix enters. Helix asks: what if the model learned to handle action prefixes during training? Instead of inpainting at inference time, you train the model to condition on a prefix of already-committed actions. This eliminates the inference overhead and the distributional mismatch. RTC is the inference-time solution; Helix is the training-time solution to the same problem.

Method	Where it solves latency	Overhead
Open-loop chunking	Doesn't solve it	None
Temporal ensembling	Inference (averaging)	Recompute every step
RTC	Inference (inpainting)	Extra denoising steps
Helix	Training (prefix conditioning)	Zero inference overhead

Related lessons: Helix • pi-0 • pi-0.5 • Gleams: Flow Matching

"The best time to start computing the next action is while you're still executing the current one."

— The core principle of RTC

RTC: Real-TimeChunking