ACT: Action Chunking with Transformers

Chapter 0: The Problem

Imagine teleoperating a robot to open a tiny condiment cup. You tip the cup over with the right gripper, nudge it into the left gripper, close gently, lift, then pry open the lid with a fingertip. Each motion requires millimeter precision. One wrong move and the cup flies off the table.

Now you want a robot to learn this from your demonstrations. The standard approach — behavioral cloning — trains a neural network to predict the next action given the current observation. Simple, right?

The problem is compounding errors. The policy makes a tiny mistake at timestep 1. Now at timestep 2, the robot is in a slightly wrong state — one it never saw in training. It makes a slightly bigger mistake. By timestep 50, the errors have cascaded and the robot is in a completely alien state, flailing helplessly.

Why fine manipulation is the hardest case: Compounding errors scale with precision requirements. For a pick-and-place task with centimeter tolerances, small drift is fine. But threading a zip tie through a 3mm×25mm loop? A 2mm error at the grasp compounds to a 10mm deviation at insertion — total failure. Fine manipulation amplifies every policy mistake.

Previous solutions had painful tradeoffs. DAgger requires an expert to provide corrections during rollouts — impractical with teleoperation. Noise injection during data collection makes demonstrations worse. Synthetic correction data only works with low-dimensional states.

And there's a second problem: non-Markovian demonstrations. Humans pause mid-task, vary their hand-off positions, and use different strategies for the same state. A single-step Markovian policy sees a state where the human paused and learns "do nothing" — then gets stuck forever.

Compounding Errors in Fine Manipulation

Each timestep adds a small error. Drag the "Error per step" slider to see how quickly cumulative error exceeds the task's precision tolerance. Fine manipulation (tight tolerance) fails much faster than coarse manipulation.

Error/step5

Why is compounding error especially devastating for fine manipulation tasks like threading a zip tie?

The precision tolerance is so tight that even small accumulated errors cause total failure The robot moves faster during fine manipulation The demonstrations are noisier for fine tasks

Chapter 1: The Key Insight

Here's the core idea, borrowed from neuroscience: humans don't plan one muscle twitch at a time. We group sequences of actions into chunks — "reach for cup," "grasp handle," "lift" — and execute each chunk as a single unit. This is called action chunking in psychology.

ACT applies this idea to robot learning. Instead of predicting one action at a time (π_θ(a_t | s_t)), the policy predicts the next k actions at once:

π_θ(a_t:t+k | s_t)

This simple change has a profound effect: it reduces the effective horizon of the task by a factor of k. If an episode is 500 steps long and k = 100, the policy only needs to make 5 decisions instead of 500. Fewer decisions means fewer chances for errors to compound.

Why k-fold horizon reduction helps: Compounding error grows with the number of sequential decisions. Cutting from 500 decisions to 5 doesn't just reduce errors 100× — it's more like 100²× because errors compound exponentially. Chunking doesn't just help linearly; it attacks the exponential nature of compounding.

But there's a subtlety. If we only observe every k steps and execute k actions open-loop, the robot can't react to changes mid-chunk. The naive approach produces jerky, unresponsive behavior.

ACT solves this with two elegant tricks: (1) query the policy at every timestep (not every k steps), creating overlapping chunks, and (2) blend the overlapping predictions with temporal ensembling. More on this in Chapter 4.

And to handle the variability in human demonstrations — different people use different strategies for the same state — ACT trains as a Conditional VAE. A latent "style variable" z captures which strategy to use, while the policy focuses on executing it precisely. More in Chapter 5.

Single-step policy

π(a_t | s_t) — 500 sequential decisions, high compounding error

↓

ACT (chunked)

π(a_t:t+k | s_t) — only 5 decisions at k=100, dramatically less error

↓

+ Temporal ensemble

Query every step, blend overlapping chunks for smooth, reactive execution

↓

+ CVAE

Latent z captures human style variation; decode deterministically at test time

If an episode has 500 timesteps and the chunk size k = 100, how many sequential decisions does the policy need to make?

5 — reducing compounding error dramatically 100 — one per chunk 500 — chunking doesn't change the number of decisions

Chapter 2: ALOHA Hardware

Before ACT can learn, it needs demonstrations. And demonstrations need a teleoperation system. Previous bimanual systems cost $100k+ (Shadow Robot, ABB YuMi, da Vinci surgical). ALOHA costs under $20k — comparable to a single research arm like a Franka Panda.

Design principles

ALOHA uses two ViperX 6-DOF arms (~$5,600 each) as "followers" and two smaller WidowX arms (~$3,300 each) as "leaders." The operator backdrives the leader arms, and the followers mirror the motion via joint-space mapping.

Why joint-space mapping instead of task-space (IK-based) mapping? Two reasons that matter enormously for fine manipulation:

No singularities: Fine manipulation often requires poses near the robot's kinematic singularities. IK fails frequently there. Joint-space mapping works everywhere within joint limits.
Natural damping: The weight of the leader arm prevents the operator from moving too fast and dampens vibrations. This produces better demonstrations than holding a VR controller in free space.

See-through grippers: A seemingly small detail with outsized impact. ALOHA's 3D-printed transparent fingers let the operator see what's being grasped — critical for sub-millimeter positioning. The grippers are fitted with grip tape for robust hold even on thin plastic films. Total cost for the custom parts: a few dollars in 3D printing filament.

Observation setup

Four Logitech C922x webcams (480×640 RGB at 50Hz): two mounted on the followers' wrists for close-up views, one front, one top. The wrist cameras are essential — the fixed cameras can't see what's happening between the gripper fingers during delicate operations.

Actions are recorded as the leader robot's joint positions (not the follower's), because the difference between leader and follower positions implicitly encodes the force being applied through the PID controller.

ALOHA System Overview

The leader-follower setup with joint-space mapping. The human backdrives the smaller leader arms; the larger followers mirror the motion.

Why does ALOHA record the leader robot's joint positions as actions instead of the follower's?

The leader robots are more accurate The leader positions are smoother The difference between leader and follower positions implicitly encodes the force being applied via PID control

Chapter 3: Action Chunking

Let's formalize the chunking idea. In standard behavioral cloning, the policy predicts one action per timestep:

π_θ(a_t | s_t) → 1 action (14-dim: 7 joints × 2 arms)

With action chunking at chunk size k, the policy predicts k future actions at once:

π_θ(a_t:t+k | s_t) → k actions (k × 14 tensor)

In the naïve version, you'd observe every k steps, generate k actions, execute them all, then observe again. This has a problem: it's fully open-loop within each chunk. If something changes mid-chunk, the robot can't react.

The horizon reduction effect

Consider a real task like Slide Ziploc: 500 timesteps at 50Hz (10 seconds). With k=1, the policy makes 500 sequential decisions. With k=100, only 5. The effective "episode length" for compounding error purposes is just 5 steps.

Chunking also fixes temporal confounders: If a human pauses for 20 timesteps mid-demonstration, a single-step policy sees 20 identical states and learns "do nothing." But with k=100, those 20 pause steps are embedded in a larger chunk that starts before and ends after the pause — the model learns the full action sequence including the pause, without getting stuck on it.

How large should k be?

There's a tradeoff. Larger k means fewer decisions (less compounding error) but also longer open-loop execution (less reactivity). The paper's ablation shows a clear trend:

k=1: ~1% success (no chunking, pure compounding error)
k=100: ~44% success (sweet spot)
k=200-400: slight decline (too open-loop, can't react to perturbations)

Chunk Size vs Success Rate

Drag the slider to see how chunk size k affects the tradeoff between compounding error reduction and reactivity loss.

k100

Why does increasing the chunk size k beyond ~100 start to hurt performance?

The model runs out of memory Longer open-loop execution reduces the policy's ability to react to unexpected changes The training loss becomes unstable

Chapter 4: Temporal Ensembling

Naïve chunking queries the policy every k steps and switches abruptly between chunks. This creates jerky motion at chunk boundaries. ACT fixes this with a beautiful trick: query the policy at every timestep, creating overlapping chunks, then blend the overlapping predictions.

How it works

At every timestep t, the policy generates k actions a_t:t+k. At timestep t+1, it generates another k actions a_t+1:t+1+k. For timestep t+5 (say), we now have predictions from multiple chunks — each made at a different time with different observations.

The temporal ensemble combines these with an exponential weighting scheme:

a_t = ∑_i w_i A_t[i] / ∑_i w_i, where w_i = exp(−m · i)

Here w₀ is the weight for the oldest prediction (made earliest), and m controls how quickly we incorporate new observations. A smaller m means faster incorporation of new information.

This is NOT typical smoothing: Standard temporal smoothing averages actions at adjacent timesteps — which introduces bias (the smoothed action is systematically different from any of the originals). Temporal ensembling averages multiple predictions for the same timestep, made at different times. There's no systematic bias — each prediction is an independent estimate of the same target.

Implementation details

ACT uses a FIFO buffer B of length T (the episode length). At each timestep t, the new k-step prediction is appended to buffers B[t:t+k]. To get the action for timestep t, we take the weighted average of everything in B[t]. This costs nothing in training time — only a small amount of extra inference compute.

The ablation shows temporal ensembling adds about 3.3% to ACT's success rate — a meaningful gain when working at 80-90% levels. It helps parametric methods (ACT, BC-ConvMLP) but actually hurts VINN (a non-parametric retrieval method), because VINN returns ground-truth demonstration actions that don't need smoothing.

Temporal Ensembling

Three overlapping chunks (orange, teal, blue) predict actions for the same timestep. The weighted average (white) is smoother and more accurate than any individual chunk. Adjust m to control blending speed.

m (blend rate)10

Why is temporal ensembling different from (and better than) typical temporal smoothing?

It averages multiple independent predictions FOR the same timestep (no bias), rather than averaging actions at adjacent timesteps (which introduces systematic bias) It uses exponential weights instead of uniform weights It runs faster because it doesn't need extra computation

Chapter 5: CVAE Training

Human demonstrations are inherently noisy. Even a single demonstrator will hand off a tape segment at slightly different positions each time — there's no visual or haptic reference for exact consistency. A deterministic policy would try to average over all these variations, producing a "compromise" trajectory that corresponds to none of them.

ACT handles this by training as a Conditional Variational Autoencoder (CVAE). The idea: compress the variation across demonstrations into a low-dimensional latent variable z — the "style variable" — and let the policy condition on z to select one consistent trajectory.

The CVAE encoder (training only)

During training, the CVAE encoder sees both the current observation and the ground-truth action sequence. It compresses this into a distribution over z (diagonal Gaussian). This encoder learns to capture what varies across demonstrations for the same state — the "style" of execution.

For efficiency, the encoder only uses proprioceptive observations (joint positions), not images. This is a deliberate design choice: the style variation is about which strategy to use, not what's in the scene.

The CVAE decoder (the actual policy)

The decoder takes z, the current observations (images + joints), and produces the action chunk. At test time, we set z = 0 (the mean of the prior), making the policy deterministic. The CVAE training effectively teaches the decoder to produce the most "typical" behavior when z = 0.

The training objective

L = L_reconst + β · L_reg

L_reconst = MSE(â_t:t+k, a_t:t+k)

L_reg = D_KL(q_φ(z | a_t:t+k, ō_t) ∥ N(0, I))

The reconstruction loss ensures the predicted actions match the demonstrations. The KL divergence term regularizes z toward a standard Gaussian prior. The hyperparameter β controls the information bottleneck — higher β means less information flows through z.

Why CVAE matters empirically: The ablation is stark. On scripted (deterministic) demonstrations, removing the CVAE makes almost no difference. On human demonstrations, success rate drops from 35.3% to 2%. The CVAE is the mechanism that lets ACT handle the inherent stochasticity of human behavior.

CVAE: Encoding Style Variation

Multiple demonstrations (colored curves) for the same task state vary in trajectory. The CVAE encoder compresses this variation into a latent z. At test time, z=0 produces the most typical trajectory (white). Toggle the β slider to see how the information bottleneck affects the decoded trajectories.

β (KL weight)10

Why is the CVAE objective essential for learning from human demonstrations but unnecessary for scripted data?

Human demonstrations are stochastic — the latent z captures which variant to use, preventing the policy from averaging over incompatible strategies. Scripted data has no variation to capture. The CVAE makes training faster on human data Scripted data doesn't need a neural network at all

Chapter 6: Architecture

ACT uses transformers for both the CVAE encoder and decoder. Let's trace the data flow through the full system.

CVAE Encoder (training only)

Inputs: [CLS] token + joint positions + action sequence (k steps). That's a (k+2)×512 input to a BERT-style transformer encoder. The [CLS] output is projected to predict the mean μ and variance σ² of z's distribution. z is then sampled via the reparameterization trick.

CVAE Decoder = The Policy

The decoder has three stages:

Vision encoding: Four 480×640 RGB images → ResNet-18 backbones (one per camera) → each produces a 15×20×512 feature map, flattened to 300×512. With 2D sinusoidal position embeddings to preserve spatial info. Total: 1200×512 visual features.
Transformer encoder: Synthesizes the 1200 visual features + joint positions + style variable z. Input is 1202×512. This fuses all observation information into a rich representation.
Transformer decoder: Takes k fixed position embeddings as queries, cross-attends to the encoder output, and generates k×512 features. An MLP projects these to k×14 — the predicted joint positions for both arms over the next k steps.

L1 not L2: ACT uses L1 loss for reconstruction instead of the more common L2. The authors found L1 leads to more precise action modeling — likely because L2 penalizes outliers quadratically, which encourages the model to predict the mean (the averaging problem again). L1 is more tolerant of occasional outliers and produces sharper predictions.

Training and inference specs

The model has ~80M parameters. Training takes ~5 hours on a single RTX 2080 Ti (11GB). Inference: ~0.01 seconds per forward pass — fast enough for real-time 50Hz control with temporal ensembling.

ACT Architecture

Data flow through the CVAE decoder (policy). Four camera images are encoded via ResNet-18, fused with joints and z in a transformer encoder, then decoded into k actions via cross-attention.

Why does ACT use L1 loss instead of L2 loss for action prediction?

L1 produces sharper, more precise predictions — L2 penalizes outliers quadratically, which encourages averaging over demonstration variability L1 is faster to compute L1 doesn't require gradient computation

Chapter 7: Experiments

ACT is evaluated on 8 tasks: 6 real-world (with ALOHA) and 2 simulated (in MuJoCo). All require bimanual, fine-grained manipulation. The real-world tasks are trained with just 50 demonstrations each (~10 minutes of data).

The tasks

Slide Ziploc (88%): Grasp bag body, pinch slider, unzip. Transparent bag is hard to perceive.
Slot Battery (96%): Place battery in slot, push in against spring resistance. Left arm holds remote steady.
Open Cup (84%): Tip over condiment cup, nudge into left gripper, lift, pry open lid. Translucent cup.
Thread Velcro (20%): Pick up cable tie, grasp tail mid-air, insert through 3mm×25mm loop. Hardest task.
Prep Tape (64%): Grasp tape, cut, hand off mid-air, hang on box edge. 4-stage bimanual coordination.
Put On Shoe (92%): Grasp shoe, fit onto mannequin foot, secure velcro strap. Tight fitting requires coordination.

Baselines crushed

Four baselines: BC-ConvMLP, BET, RT-1, VINN. On the two detailed real-world tasks (Slide Ziploc, Slot Battery), all baselines achieve 0% final success. They can sometimes complete the first subtask but fail completely by the end due to compounding errors.

The pattern is consistent: baselines make progress on early subtasks but cascade failures make later subtasks impossible. ACT's chunking breaks this cascade.

Real-World Results

Success rates across 6 real-world tasks. ACT (teal) vs the best baseline BET (orange). Note that BET achieves 0% on 4 out of 6 tasks.

Thread Velcro — the humbling case: At 20% success, this is ACT's hardest task. Each stage halves the success rate: 92% → 40% → 20%. The failure modes are perceptual: the black cable tie against a black table has minimal contrast, and occupies just a few pixels in the image. This isn't a policy problem — it's a perception bottleneck.

Why do all baseline methods (BC-ConvMLP, BET, RT-1, VINN) achieve 0% final success on Slide Ziploc and Slot Battery?

Single-step policies suffer from compounding errors — they complete early subtasks but cascade failures make later subtasks impossible The baseline architectures are too small to learn the tasks The baselines can't process image observations

Chapter 8: Ablations

The paper runs careful ablations to isolate the contribution of each design choice. The results paint a clear picture of what matters and why.

Action chunking is generalizable

The authors augment two baselines with action chunking (BC-ConvMLP: increase output dimension; VINN: retrieve k actions). Both benefit enormously — showing that chunking isn't ACT-specific but a generally useful technique for imitation learning.

BC-ConvMLP goes from near-zero to ~25% success with chunking. VINN goes from near-zero to ~15%. ACT still outperforms both augmented baselines, but the gap is smaller — confirming that chunking is the primary driver of improvement.

CVAE: essential for human data, irrelevant for scripted data

The sharpest ablation in the paper: removing the CVAE objective drops success from 35.3% to 2% on human data, but makes zero difference on scripted data. This perfectly isolates the CVAE's role — it exists solely to handle the stochasticity of human demonstrations.

50Hz is necessary

A user study with 6 participants comparing 5Hz vs 50Hz teleoperation shows 50Hz reduces task completion time by 38% on average (p < 0.001). Many recent works use 5-10Hz control for computational reasons. This study demonstrates that high-frequency control isn't a luxury — it's a necessity for fine manipulation.

Ablation Summary

Impact of each component. CVAE contribution shown separately for scripted vs human data.

Worked example — the full pipeline: At timestep t=200 of Slot Battery: (1) 4 cameras capture 480×640 images, 14-dim joint state recorded; (2) ResNet-18 encodes each image to 300×512; (3) transformer encoder fuses 1202 tokens; (4) transformer decoder generates k=100 actions; (5) temporal ensemble blends with previous predictions; (6) action at t=200 is executed via PID. Total: 10ms inference. Next timestep t=201: repeat. 50 demos, 5 hours training, 96% success.

In the ablation studies, what happens when you augment the simple BC-ConvMLP baseline with action chunking?

Performance improves dramatically — from near-zero to ~25% — showing that chunking is a generally beneficial technique, not specific to ACT's architecture Performance stays the same because the architecture can't handle chunked outputs Performance gets worse due to the higher-dimensional output

Chapter 9: Connections

What ACT built on

Behavioral Cloning (Pomerleau, 1988): The simplest imitation learning — supervised learning from state-action pairs. ACT inherits this simplicity but fixes the compounding error problem through chunking.

BET (Shafiullah et al., 2022): Behavior Transformers for multimodal policies. Discretizes the action space. ACT uses continuous actions with CVAE instead — critical for the sub-millimeter precision needed in fine manipulation.

IBC (Florence et al., 2021): Implicit behavioral cloning via energy-based models. Handles multimodality but suffers from training instability. ACT's CVAE achieves similar expressiveness with stable training.

What ACT enabled

ALOHA 2 / Mobile ALOHA (2024): Extended ALOHA to a mobile platform and demonstrated whole-body control — cooking, cleaning, opening doors. Same ACT algorithm, larger scale.

Diffusion Policy (Chi et al., 2023): A concurrent work that also predicts action sequences but uses diffusion-based generation instead of CVAE. Both papers independently discovered that predicting action chunks is the key to effective behavior cloning. Diffusion Policy uses receding horizon control (T_a < T_p) — conceptually similar to ACT's temporal ensembling.

pi-0 (Physical Intelligence, 2024): The robot foundation model builds directly on lessons from ACT/ALOHA: action chunking, high-frequency control, and low-cost data collection enable scaling to dozens of tasks across multiple robot types.

The broader impact: ACT/ALOHA proved that you don't need $100k robots and complex algorithms for fine manipulation. A $20k system with a simple CVAE + action chunking achieves 80-90% on tasks that stump every baseline. This democratized fine manipulation research — dozens of labs worldwide have now built ALOHA systems.

Cheat sheet

Core equation

π_θ(a_t:t+k | o_t, z) — predict k actions conditioned on observation and style z

Training loss

L = L1(â, a) + β · D_KL(q(z|a, ō) ∥ N(0,I))

Key params

k=100 (chunk size), 80M params, 50Hz control, 50 demos per task

Hardware

ALOHA: 2×ViperX + 2×WidowX + 4 webcams = $20k

Key finding

Action chunking + CVAE + temporal ensemble = 80-90% on fine manipulation from 10 min demos

What is the key shared insight between ACT and Diffusion Policy?

Both independently discovered that predicting action sequences (chunks) rather than single actions is the key to effective behavior cloning Both use diffusion models for action generation Both require expensive robot hardware

ACT: Action Chunkingwith Transformers