Low-cost bimanual manipulation via ALOHA teleoperation and a CVAE-based imitation learning algorithm that predicts action chunks — achieving 80-90% success on tasks like battery insertion and cup opening from just 10 minutes of demonstrations.
Imagine teleoperating a robot to open a tiny condiment cup. You tip the cup over with the right gripper, nudge it into the left gripper, close gently, lift, then pry open the lid with a fingertip. Each motion requires millimeter precision. One wrong move and the cup flies off the table.
Now you want a robot to learn this from your demonstrations. The standard approach — behavioral cloning — trains a neural network to predict the next action given the current observation. Simple, right?
The problem is compounding errors. The policy makes a tiny mistake at timestep 1. Now at timestep 2, the robot is in a slightly wrong state — one it never saw in training. It makes a slightly bigger mistake. By timestep 50, the errors have cascaded and the robot is in a completely alien state, flailing helplessly.
Previous solutions had painful tradeoffs. DAgger requires an expert to provide corrections during rollouts — impractical with teleoperation. Noise injection during data collection makes demonstrations worse. Synthetic correction data only works with low-dimensional states.
And there's a second problem: non-Markovian demonstrations. Humans pause mid-task, vary their hand-off positions, and use different strategies for the same state. A single-step Markovian policy sees a state where the human paused and learns "do nothing" — then gets stuck forever.
Each timestep adds a small error. Drag the "Error per step" slider to see how quickly cumulative error exceeds the task's precision tolerance. Fine manipulation (tight tolerance) fails much faster than coarse manipulation.
Here's the core idea, borrowed from neuroscience: humans don't plan one muscle twitch at a time. We group sequences of actions into chunks — "reach for cup," "grasp handle," "lift" — and execute each chunk as a single unit. This is called action chunking in psychology.
ACT applies this idea to robot learning. Instead of predicting one action at a time (πθ(at | st)), the policy predicts the next k actions at once:
This simple change has a profound effect: it reduces the effective horizon of the task by a factor of k. If an episode is 500 steps long and k = 100, the policy only needs to make 5 decisions instead of 500. Fewer decisions means fewer chances for errors to compound.
But there's a subtlety. If we only observe every k steps and execute k actions open-loop, the robot can't react to changes mid-chunk. The naive approach produces jerky, unresponsive behavior.
ACT solves this with two elegant tricks: (1) query the policy at every timestep (not every k steps), creating overlapping chunks, and (2) blend the overlapping predictions with temporal ensembling. More on this in Chapter 4.
And to handle the variability in human demonstrations — different people use different strategies for the same state — ACT trains as a Conditional VAE. A latent "style variable" z captures which strategy to use, while the policy focuses on executing it precisely. More in Chapter 5.
Before ACT can learn, it needs demonstrations. And demonstrations need a teleoperation system. Previous bimanual systems cost $100k+ (Shadow Robot, ABB YuMi, da Vinci surgical). ALOHA costs under $20k — comparable to a single research arm like a Franka Panda.
ALOHA uses two ViperX 6-DOF arms (~$5,600 each) as "followers" and two smaller WidowX arms (~$3,300 each) as "leaders." The operator backdrives the leader arms, and the followers mirror the motion via joint-space mapping.
Why joint-space mapping instead of task-space (IK-based) mapping? Two reasons that matter enormously for fine manipulation:
Four Logitech C922x webcams (480×640 RGB at 50Hz): two mounted on the followers' wrists for close-up views, one front, one top. The wrist cameras are essential — the fixed cameras can't see what's happening between the gripper fingers during delicate operations.
Actions are recorded as the leader robot's joint positions (not the follower's), because the difference between leader and follower positions implicitly encodes the force being applied through the PID controller.
The leader-follower setup with joint-space mapping. The human backdrives the smaller leader arms; the larger followers mirror the motion.
Let's formalize the chunking idea. In standard behavioral cloning, the policy predicts one action per timestep:
With action chunking at chunk size k, the policy predicts k future actions at once:
In the naïve version, you'd observe every k steps, generate k actions, execute them all, then observe again. This has a problem: it's fully open-loop within each chunk. If something changes mid-chunk, the robot can't react.
Consider a real task like Slide Ziploc: 500 timesteps at 50Hz (10 seconds). With k=1, the policy makes 500 sequential decisions. With k=100, only 5. The effective "episode length" for compounding error purposes is just 5 steps.
There's a tradeoff. Larger k means fewer decisions (less compounding error) but also longer open-loop execution (less reactivity). The paper's ablation shows a clear trend:
Drag the slider to see how chunk size k affects the tradeoff between compounding error reduction and reactivity loss.
Naïve chunking queries the policy every k steps and switches abruptly between chunks. This creates jerky motion at chunk boundaries. ACT fixes this with a beautiful trick: query the policy at every timestep, creating overlapping chunks, then blend the overlapping predictions.
At every timestep t, the policy generates k actions at:t+k. At timestep t+1, it generates another k actions at+1:t+1+k. For timestep t+5 (say), we now have predictions from multiple chunks — each made at a different time with different observations.
The temporal ensemble combines these with an exponential weighting scheme:
Here w0 is the weight for the oldest prediction (made earliest), and m controls how quickly we incorporate new observations. A smaller m means faster incorporation of new information.
ACT uses a FIFO buffer B of length T (the episode length). At each timestep t, the new k-step prediction is appended to buffers B[t:t+k]. To get the action for timestep t, we take the weighted average of everything in B[t]. This costs nothing in training time — only a small amount of extra inference compute.
The ablation shows temporal ensembling adds about 3.3% to ACT's success rate — a meaningful gain when working at 80-90% levels. It helps parametric methods (ACT, BC-ConvMLP) but actually hurts VINN (a non-parametric retrieval method), because VINN returns ground-truth demonstration actions that don't need smoothing.
Three overlapping chunks (orange, teal, blue) predict actions for the same timestep. The weighted average (white) is smoother and more accurate than any individual chunk. Adjust m to control blending speed.
Human demonstrations are inherently noisy. Even a single demonstrator will hand off a tape segment at slightly different positions each time — there's no visual or haptic reference for exact consistency. A deterministic policy would try to average over all these variations, producing a "compromise" trajectory that corresponds to none of them.
ACT handles this by training as a Conditional Variational Autoencoder (CVAE). The idea: compress the variation across demonstrations into a low-dimensional latent variable z — the "style variable" — and let the policy condition on z to select one consistent trajectory.
During training, the CVAE encoder sees both the current observation and the ground-truth action sequence. It compresses this into a distribution over z (diagonal Gaussian). This encoder learns to capture what varies across demonstrations for the same state — the "style" of execution.
For efficiency, the encoder only uses proprioceptive observations (joint positions), not images. This is a deliberate design choice: the style variation is about which strategy to use, not what's in the scene.
The decoder takes z, the current observations (images + joints), and produces the action chunk. At test time, we set z = 0 (the mean of the prior), making the policy deterministic. The CVAE training effectively teaches the decoder to produce the most "typical" behavior when z = 0.
The reconstruction loss ensures the predicted actions match the demonstrations. The KL divergence term regularizes z toward a standard Gaussian prior. The hyperparameter β controls the information bottleneck — higher β means less information flows through z.
Multiple demonstrations (colored curves) for the same task state vary in trajectory. The CVAE encoder compresses this variation into a latent z. At test time, z=0 produces the most typical trajectory (white). Toggle the β slider to see how the information bottleneck affects the decoded trajectories.
ACT uses transformers for both the CVAE encoder and decoder. Let's trace the data flow through the full system.
Inputs: [CLS] token + joint positions + action sequence (k steps). That's a (k+2)×512 input to a BERT-style transformer encoder. The [CLS] output is projected to predict the mean μ and variance σ² of z's distribution. z is then sampled via the reparameterization trick.
The decoder has three stages:
The model has ~80M parameters. Training takes ~5 hours on a single RTX 2080 Ti (11GB). Inference: ~0.01 seconds per forward pass — fast enough for real-time 50Hz control with temporal ensembling.
Data flow through the CVAE decoder (policy). Four camera images are encoded via ResNet-18, fused with joints and z in a transformer encoder, then decoded into k actions via cross-attention.
ACT is evaluated on 8 tasks: 6 real-world (with ALOHA) and 2 simulated (in MuJoCo). All require bimanual, fine-grained manipulation. The real-world tasks are trained with just 50 demonstrations each (~10 minutes of data).
Four baselines: BC-ConvMLP, BET, RT-1, VINN. On the two detailed real-world tasks (Slide Ziploc, Slot Battery), all baselines achieve 0% final success. They can sometimes complete the first subtask but fail completely by the end due to compounding errors.
The pattern is consistent: baselines make progress on early subtasks but cascade failures make later subtasks impossible. ACT's chunking breaks this cascade.
Success rates across 6 real-world tasks. ACT (teal) vs the best baseline BET (orange). Note that BET achieves 0% on 4 out of 6 tasks.
The paper runs careful ablations to isolate the contribution of each design choice. The results paint a clear picture of what matters and why.
The authors augment two baselines with action chunking (BC-ConvMLP: increase output dimension; VINN: retrieve k actions). Both benefit enormously — showing that chunking isn't ACT-specific but a generally useful technique for imitation learning.
BC-ConvMLP goes from near-zero to ~25% success with chunking. VINN goes from near-zero to ~15%. ACT still outperforms both augmented baselines, but the gap is smaller — confirming that chunking is the primary driver of improvement.
The sharpest ablation in the paper: removing the CVAE objective drops success from 35.3% to 2% on human data, but makes zero difference on scripted data. This perfectly isolates the CVAE's role — it exists solely to handle the stochasticity of human demonstrations.
A user study with 6 participants comparing 5Hz vs 50Hz teleoperation shows 50Hz reduces task completion time by 38% on average (p < 0.001). Many recent works use 5-10Hz control for computational reasons. This study demonstrates that high-frequency control isn't a luxury — it's a necessity for fine manipulation.
Impact of each component. CVAE contribution shown separately for scripted vs human data.
Behavioral Cloning (Pomerleau, 1988): The simplest imitation learning — supervised learning from state-action pairs. ACT inherits this simplicity but fixes the compounding error problem through chunking.
BET (Shafiullah et al., 2022): Behavior Transformers for multimodal policies. Discretizes the action space. ACT uses continuous actions with CVAE instead — critical for the sub-millimeter precision needed in fine manipulation.
IBC (Florence et al., 2021): Implicit behavioral cloning via energy-based models. Handles multimodality but suffers from training instability. ACT's CVAE achieves similar expressiveness with stable training.
ALOHA 2 / Mobile ALOHA (2024): Extended ALOHA to a mobile platform and demonstrated whole-body control — cooking, cleaning, opening doors. Same ACT algorithm, larger scale.
Diffusion Policy (Chi et al., 2023): A concurrent work that also predicts action sequences but uses diffusion-based generation instead of CVAE. Both papers independently discovered that predicting action chunks is the key to effective behavior cloning. Diffusion Policy uses receding horizon control (Ta < Tp) — conceptually similar to ACT's temporal ensembling.
pi-0 (Physical Intelligence, 2024): The robot foundation model builds directly on lessons from ACT/ALOHA: action chunking, high-frequency control, and low-cost data collection enable scaling to dozens of tasks across multiple robot types.