A teardown of the architectures, losses, and training recipes that move modern manipulators — from behavior cloning's first sin to flow-matched VLAs and the pixel-RL renaissance. Written for the new grad, the staff engineer, and the director who needs to know which abstraction is load-bearing.
Before you can appreciate what learning buys you, you need to understand what classical control gives you for free — and where it breaks.
The robot as a dynamical system
A robot arm is a chain of rigid bodies connected by joints. Each joint has a motor. The motor applies a torque $\tau$, which accelerates the joint angle $q$. The arm's links have mass, they are spinning, and gravity is always pulling them down. Every controller must fight all three forces simultaneously.
In plain English: a robot arm is heavy, it is moving, and gravity pulls it down. The controller has to fight all three at every instant.
The equation of motion for a robot with $n$ joints:
$q \in \mathbb{R}^n$ — the joint angle vector. For a 7-DOF Franka, $q = (q_1, \ldots, q_7)$. Each $q_i$ is the angle of one revolute joint.
$\dot{q}, \ddot{q}$ — the joint velocity and acceleration. $\dot{q}$ is how fast each joint is currently moving; $\ddot{q}$ is the acceleration we are trying to produce.
$M(q) \in \mathbb{R}^{n \times n}$ — the mass (inertia) matrix. Encodes how heavy and how far from the joint each link is. It is symmetric, positive-definite, and configuration-dependent: the same torque produces different accelerations depending on the arm's pose. When the arm is stretched out, the effective inertia is large; when folded close, it is small.
$C(q, \dot{q})\dot{q} \in \mathbb{R}^n$ — the Coriolis and centrifugal forces. These are velocity-dependent terms that arise because the arm is a system of coupled rotating bodies. When joint 2 spins fast, it creates a centrifugal force that tries to fling joint 3 outward. These forces are zero when the arm is stationary and grow quadratically with speed.
$g(q) \in \mathbb{R}^n$ — the gravity torque vector. The torque that gravity exerts on each joint, computed from the mass distribution and the current configuration. A horizontal arm at full extension feels maximum gravity torque; a vertical arm feels almost none.
$\tau \in \mathbb{R}^n$ — the commanded joint torques. This is what the controller sends to the motors. The equation says: the torques must equal the sum of inertial, velocity-coupling, and gravitational loads.
This equation is exact for rigid bodies. If you know $M$, $C$, and $g$ perfectly, you can compute the exact torque needed for any desired motion using computed torque control: $\tau = M(q)\ddot{q}_{\text{des}} + C(q, \dot{q})\dot{q} + g(q)$, which cancels all nonlinearities and reduces the system to $\ddot{q} = \ddot{q}_{\text{des}}$. This is the gold standard of classical control — and it requires a perfect model.
PID control — the classical workhorse
When you do not have a perfect model (you never do), you fall back to PID control. Define the error as $e(t) = q_{\text{des}}(t) - q(t)$, the difference between where you want the joint and where it is. The PID controller computes a torque command as:
$K_p \, e(t)$ — the proportional term. Push harder when the error is larger. If the joint is 10 degrees off, push 10 times harder than if it is 1 degree off. The gain $K_p$ sets the stiffness.
$K_i \int_0^t e(s) \, ds$ — the integral term. Accumulates past error. If there is a persistent offset (say, gravity pulling the joint down by 2 degrees that $K_p$ alone cannot overcome), the integral term slowly builds up torque until the steady-state error vanishes. Too much $K_i$ and the system winds up and oscillates.
$K_d \, \dot{e}(t)$ — the derivative term. Damps oscillation by opposing rapid changes in error. As the joint approaches the target and the error shrinks quickly, $\dot{e}$ is large and negative, so $K_d$ acts as a brake. Without it, a stiff ($K_p$-heavy) controller overshoots and rings.
Worked example: 1-DOF joint tracking a sine wave. Consider a single joint with inertia $I = 0.5\,\text{kg m}^2$ and gravity torque $\tau_g = 2.0 \sin(q)$. The desired trajectory is $q_{\text{des}}(t) = \sin(2\pi \cdot 0.5 \cdot t)$ — a 0.5 Hz sine wave.
With $K_p = 50, K_i = 0, K_d = 10$: the joint tracks smoothly but has a small steady-state offset during the swing because $K_p$ alone cannot fully compensate gravity at all angles.
With $K_p = 200, K_i = 0, K_d = 5$: the high proportional gain overshoots every peak and oscillates for ~200ms before settling. The low derivative gain cannot damp the ringing fast enough.
With $K_p = 50, K_i = 20, K_d = 10$: the integral term eliminates the gravity offset, but if the desired trajectory changes abruptly (a step input), the accumulated integral "winds up" and causes a large overshoot.
The lesson: PID tuning is a manual balancing act for every joint, every payload, every speed regime. Change the object in the gripper and the gains are wrong again.
Where classical control hits a wall
PID + computed torque + gravity compensation can produce beautiful motion in controlled environments. Industrial robots in factories prove this every day. But four failure modes are structural — no amount of gain tuning fixes them:
Contact. When the robot touches something, the dynamics change discontinuously. The mass matrix $M(q)$ now includes the object. The constraint forces are impulsive. No smooth model covers the transition from free motion to rigid contact to sliding to release. Classical controllers either go stiff (dangerous) or go compliant (imprecise).
Deformable objects. Cloth, rope, dough, cables — these have effectively infinite-DOF dynamics. You cannot write $M(q)$ for a towel. The state space is so large that analytical models are intractable. Every fold pattern is a different dynamical regime.
Partial observability. Classical controllers assume full state knowledge: joint angles, velocities, and the full world model. But a real robot cannot see behind objects, cannot measure the mass of an unknown payload until it lifts it, cannot sense whether a drawer is locked or just stiff. The controller needs information that no sensor provides.
Task diversity. Each task requires its own controller, its own gain schedule, its own state machine with hand-coded transitions. A pick-and-place controller cannot pour water. A pouring controller cannot fold cloth. Scaling to 100 tasks means 100 hand-engineered controllers — 100 person-months of tuning that transfer nothing between them.
Classical control excels when you have an accurate model. Learning excels when you do not. Modern robot learning is not replacing classical control — it is filling the gap where models fail. The best deployed systems in 2026 are hybrids: a learned policy for high-level decisions and contact-rich manipulation, sitting on top of a classical low-level controller that handles joint-space PD regulation and safety limits.
Dimension
Classical control
Learned policies
Accuracy (known dynamics)
Millimeter-precise, provably optimal
Sub-optimal; limited by data and generalization
Generalization
Zero — each system is hand-tuned
Transfers across objects, scenes, embodiments
Model requirement
Full analytical model (URDFs, mass, friction)
None — learns from data
Contact handling
Fragile; requires mode-switching logic
Handles contact implicitly from demonstrations
Task diversity
One controller per task
One policy, many tasks (with enough data)
Safety guarantees
Formal (Lyapunov stability, bounds)
Empirical only — no worst-case guarantees
Data requirement
Physics knowledge (free but expert-intensive)
Demonstrations or sim experience (expensive)
Interactive: PID tuning
Drag the P, I, and D sliders to control a simulated 1-DOF joint tracking a step setpoint. Watch how the gains trade off rise time, overshoot, and steady-state error. The amber line is the setpoint; the blueprint line is the actual joint angle; the terracotta fill shows the error.
setpointactualerror
Notice the tradeoffs. Crank $K_p$ to 200 and the joint overshoots violently. Add $K_d = 30$ and it damps, but at the cost of sluggish response. Add $K_i$ and the steady-state error vanishes, but the system winds up on sudden changes. Now imagine doing this for 7 joints simultaneously, with a changing payload, on a task the controller was never designed for. That is why we need learning.
01The problem
A robot policy is a function from sensors to motor commands. The interesting part is everything that word "function" hides.
Formally, a robot operates inside a partially-observed Markov decision process. The world has a true state $s_t$; the robot sees an observation $o_t$ that is a noisy, lossy projection of it. At each tick the robot picks an action $a_t$, the world transitions to $s_{t+1}$ via dynamics $p(s_{t+1} \mid s_t, a_t)$, and (sometimes) emits a reward $r_t$. A policy $\pi(a_t \mid o_t, h_t)$ — possibly stateful via history $h_t$ — maps observations to actions. The goal is a policy that achieves the task, defined either by a reward function the policy maximizes (RL) or by a dataset of demonstrations the policy imitates (BC), or both.
Unpacking the POMDP
Each of those symbols hides an enormous amount of engineering reality. Let us make them concrete for a 7-DOF robot arm picking up a mug from a table.
State $s_t$ is the full physical truth of the world at time $t$. For our mug task it includes: the 7 joint angles and 7 joint velocities of the arm (14 numbers), the end-effector 6-DOF pose (6 numbers), the gripper finger width (1 number), the mug's 6-DOF pose (6 numbers), the mug's mass and friction coefficients, the positions of every other object on the table, contact forces at every contact point, and the deformation state of any soft objects. In principle the state also includes air currents, table vibration, and the thermal expansion of the links. In total, $s_t$ could be hundreds or thousands of dimensions. No sensor suite measures all of it.
Observation $o_t$ is what the robot actually sees. Typically: a 480×640×3 RGB image from a camera (or two), plus a proprioception vector $q_t \in \mathbb{R}^{14}$ (joint positions and velocities) and a gripper width scalar. The camera image is a lossy 2D projection of the 3D scene — it loses depth, cannot see behind objects, is affected by lighting and reflections, and compresses the mug's full 6-DOF pose into a pattern of pixels that a vision encoder must learn to decode. Objects that are occluded by the robot's own arm simply vanish from the observation. This is why we say the process is partially observed: the observation is a strict subset of the state.
Action $a_t$ is the motor command the policy sends to the robot at each control tick. Common choices include: target joint positions $q_{\text{target}} \in \mathbb{R}^7$ (the low-level PD controller handles the torques), end-effector pose deltas $\Delta T \in \mathbb{R}^6$, or raw joint torques $\tau \in \mathbb{R}^7$. The choice of action space is a structural commitment that cascades through the entire stack (section 02).
MDP vs. POMDP. In a Markov decision process (MDP), the agent observes the full state: $o_t = s_t$. The current observation is sufficient to make an optimal decision. In a partially-observed MDP (POMDP), the agent sees only a projection: $o_t = g(s_t) + \text{noise}$. The current observation is not sufficient — the agent needs memory (history $h_t$) to infer hidden state. All real robots live in POMDPs. A camera cannot see behind the mug. Proprioception cannot measure the mug's mass. Force sensors at the wrist cannot distinguish a stuck lid from a heavy lid. The correct response to partial observability is either history conditioning (feed the last $k$ observations) or explicit belief tracking (maintain a distribution over hidden state). Modern policies use history conditioning because it is cheap and effective.
That's the textbook frame. The reasons robot learning is hard are not in the textbook:
Compounding error. A small mistake at step $t$ moves the robot to a state $s_{t+1}$ slightly outside the training distribution, so the next action is worse, and so on. The error grows in the worst case as $O(T^2)$ in horizon $T$ for naive behavior cloning — a result that is the original sin of the field.
Multimodality. Humans demonstrating the same task move differently. Averaging two valid trajectories produces an invalid one (think: two ways around an obstacle, averaged, hits the obstacle). A policy that regresses to the mean is broken.
Non-stationary distributions. The robot's actions shape its future observations. This is the difference between supervised learning, where the data is fixed, and any learning paradigm where the policy is a participant in data generation.
The reality gap. Simulators are fast, free, and wrong. Real robots are slow, expensive, and right. Bridging the two is the central engineering problem of the field.
Tight latency budgets. A 200ms control loop is luxurious; many tasks need 50ms or less. Architectures compete on inference latency, not just success rate.
Scale of the challenge
To put these numbers in perspective: a table-top pick-and-place runs ~100 steps at 10Hz (a 10-second episode). The action space is 7-dimensional (6 joint targets + gripper). Two cameras produce 480×640 RGB images per step. A single demonstration yields ~100 observation-action pairs.
To train a reliable policy, you need 50–200 demonstrations of one task. At 5 minutes per demonstration (setup, teleop, reset), that is 4–16 hours of human time. A general-purpose robot needing 100 tasks requires 5,000–20,000 demonstrations. This is why data efficiency is the bottleneck, and why pre-trained VLAs that share representations across tasks are so important.
Domain
Training data
Feedback loop
Failure cost
Image classification
Millions of labeled images (cheap)
None — iid data
Wrong label (harmless)
Language modeling
Trillions of tokens (free)
None — iid data
Bad text (harmless)
Game RL (Atari, Go)
Billions of sim steps (free)
Yes, in simulation
Lost game (harmless)
Robot learning
Hundreds of demos ($$$)
Yes, in physics
Broken robot ($10K+)
expert (stays on manifold)linear · O(εT)quadratic · O(εT²)
Stated bluntly: a 99%-accurate per-step naive BC policy on a 200-step task fails reliably, because the second-order growth term dominates. Action chunking, DAgger-style correction, and chunked / receding-horizon control are all attacks on the same root cause — they break the feedback loop between policy errors and state distribution shift.
A robot policy is not a classifier. It is a controller embedded in a feedback loop with physics, and almost every architectural choice in modern robot learning is a response to that loop.
Derivation: the $O(\epsilon T^2)$ compounding-error bound
This result, due to Ross and Bagnell (2010), is foundational. Let us derive it step by step.
Setup. Let $\pi^*$ be the expert policy and $\hat\pi$ be the learned policy. Both operate over a horizon of $T$ steps. At each step, $\hat\pi$ makes an error with probability at most $\epsilon$ — meaning the per-step probability of deviating from the expert's action satisfies $\Pr[\hat\pi(o_t) \neq \pi^*(o_t)] \leq \epsilon$.
Step 1: define the distribution shift. Let $d_t^*$ be the state distribution at time $t$ under the expert, and $d_t^{\hat\pi}$ under the learned policy. At $t=0$ they are identical (same start state). At each subsequent step, a mistake by $\hat\pi$ can move the state off the expert's distribution.
Step 2: bound the deviation per step. At step $t$, the total variation distance between the two state distributions satisfies:
Per-step distribution shift
$$ \| d_t^{\hat\pi} - d_t^* \|_{\text{TV}} \leq t \cdot \epsilon $$
This follows by induction. At $t=0$ the distance is zero. Each step, the policy either follows the expert (no additional shift) or deviates (shift grows by at most $\epsilon$). After $t$ steps, the cumulative shift is at most $t\epsilon$.
Step 3: sum the cost over the horizon. The expected cost (number of mistakes) at step $t$ has two components: (1) the probability $\epsilon$ of a direct mistake even when on-distribution, and (2) the additional cost from being off-distribution, which is at most $t\epsilon$ (since on unfamiliar states, the policy might always err). Thus:
What this means for your system: The $O(\epsilon T^2)$ bound is the reason you cannot just "train a better model" and expect manipulation to work. A 200-step task with 1% per-step error gives ~200 expected mistakes — certain failure. In practice, this bound forces three architectural responses: action chunking (reduces effective $T$ by predicting $H$ steps at once), DAgger / on-policy correction (reduces the bound to $O(\epsilon T)$ by eliminating distribution shift), or multimodal action heads (reduces $\epsilon$ by not averaging conflicting modes). Every architecture choice in sections 05–14 is an attack on one of these three factors. When you see a new robot learning paper, ask: "which term in this bound does it shrink?"
The $\epsilon T$ term is what you would get if distribution shift did not exist — just per-step error accumulated linearly. The $\epsilon T^2 / 2$ term is the compounding penalty. For $T = 200$ and $\epsilon = 0.01$, the linear term gives 2 expected mistakes but the quadratic term gives 199 — completely dominating.
Worked example. A policy is 99% accurate per step ($\epsilon = 0.01$) on a $T = 200$ step task. The compounding bound gives $\mathbb{E}[\text{mistakes}] \leq 0.01 \times 200 + 0.01 \times \frac{200 \times 199}{2} = 2 + 199 = 201$. The policy expects to make more mistakes than there are steps — meaning it reliably fails. This is why DAgger (which keeps $\hat\pi$'s state distribution close to training) reduces the bound to $O(\epsilon T)$: it eliminates the quadratic term by correcting the distribution shift online.
Code: compounding error in 1D
The abstraction becomes visceral with a concrete example. Here is a minimal simulation: the "expert" traces a sine-wave trajectory, and a naive BC policy has been trained to imitate it with small Gaussian noise. Watch how a tiny per-step error ($\sigma = 0.02$) compounds into total failure over 200 steps.
naive_bc_compound.py
importnumpyasnpT = 200# horizondt = 0.05# timestepnoise_std = 0.02# per-step action noise (1% of range)# Expert trajectory: sine waveexpert_pos = np.sin(np.arange(T) * dt)
expert_vel = np.cos(np.arange(T) * dt) * dt# actions = velocity# Naive BC: predict expert action + small noisebc_pos = np.zeros(T)
fortinrange(T - 1):
# Policy sees its OWN state (not expert's) — this is the keyaction = expert_vel[t] + np.random.randn() * noise_stdbc_pos[t + 1] = bc_pos[t] + action# integrates from bc_pos, not expert_pos# By step 200, |bc_pos - expert_pos| ≈ 0.02 * sqrt(200) ≈ 0.28# But it's worse: the policy was trained on expert states, not its own.# On unfamiliar states, errors are correlated, not random — they compound.print(f"Final drift: {abs(bc_pos[-1] - expert_pos[-1]):.3f}")
print(f"Max drift: {np.max(np.abs(bc_pos - expert_pos)):.3f}")
The critical detail is on line 10: the policy integrates from bc_pos[t], not expert_pos[t]. At training time, the observation always came from the expert's trajectory. At test time, the policy sees its own drifted state — a state it never trained on. The error is not just additive noise; it is systematic because the policy's predictions become increasingly unreliable on out-of-distribution states.
The control loop: where policy meets physics
A robot policy does not operate in isolation. It is one component in a real-time feedback loop with hard timing constraints:
Component
Rate
Latency budget
What it does
Low-level servo
1–10 kHz
< 1 ms
PD controller that tracks joint position/torque targets. Runs on the robot's internal controller.
Policy inference
5–50 Hz
20–200 ms
Neural network forward pass: observation → action. This is the bottleneck.
Vision encoder
5–30 Hz
10–100 ms
ResNet, ViT, or SigLIP encodes camera images into feature vectors. Often dominates inference time.
Action head
5–50 Hz
1–50 ms
Converts features to actions. MSE head: 1ms. Diffusion head: 10–50ms (iterative denoising).
The total loop latency determines what tasks are feasible. A pick-and-place on a static table can tolerate 200ms policy latency (5 Hz control). Catching a thrown ball requires < 20ms (50+ Hz). Diffusion policies, which need 10–100 denoising steps, push toward the slow end; this is why DDIM acceleration, consistency distillation, and flow matching (which needs fewer steps) are active research areas.
Worked example: latency budget. Diffusion Policy on a single NVIDIA 3090. Vision: 2 cameras × ResNet-18 = 6.4ms. Denoising: 10 DDIM steps × 2.1ms = 21ms. Total: ~28ms = 36 Hz. Fine for tabletop. With 100 DDPM steps (no DDIM): 210ms = 4.8 Hz — too slow for reactive tasks. With 4-step consistency distillation: ~15ms = 67 Hz — fast enough for dynamic manipulation.
The observation-to-state ratio captures partial observability: from ~1M pixels the policy must infer ~1000 state dimensions. The camera is a massive lossy compressor. Modern vision encoders (SigLIP, DINOv2) trained on internet-scale data make this compression tractable.
Worked example: what the camera cannot see. A robot is picking up a mug. The scene camera shows the mug sitting on the table. From this single image, the policy must infer:
Observable: mug position (2D projected), mug orientation (partial — one viewpoint), whether the mug is right-side-up, the distance from the gripper (from known camera intrinsics).
Partially observable: mug height (foreshortened in 2D), handle orientation (occluded side unknown), whether there is liquid inside (invisible at most angles).
Unobservable: mug mass (requires contact), surface friction coefficient (requires sliding), whether the mug is glued to the table (requires force), internal temperature.
This is why proprioception + force sensing complement vision: they provide precisely the information the camera cannot. Joint torques reveal mass on contact. Wrist force/torque reveals friction during sliding. The policy must fuse these modalities to make informed decisions about grasp force and lift velocity.
Why this is not supervised learning
The non-stationarity point deserves emphasis. In standard supervised learning (image classification, language modeling), the data distribution is fixed: the model's predictions do not change the inputs it will see next. A cat classifier that misclassifies a dog does not somehow cause more dogs to appear.
In robot learning, a policy that drifts left at step 10 will see a different scene at step 11 than it would have seen if it had gone right. The policy's outputs change its future inputs. This creates a feedback loop between the model and its data distribution — a distribution shift that grows over time. This is the fundamental reason why per-step accuracy (the metric supervised learning optimizes) is a poor predictor of task success (the metric that matters). A 99% per-step accuracy policy can have a 0% task success rate, as the worked example above demonstrates.
The key distinction. Supervised learning minimizes $\mathbb{E}_{(o,a) \sim \mathcal{D}}[\ell(\pi_\theta(o), a)]$ — the loss on the training distribution. Task success requires the policy to perform well on $d_t^{\hat\pi}$ — its own induced state distribution. These two distributions diverge as soon as the policy makes a single mistake. Every modern technique in robot learning (DAgger, action chunking, RL fine-tuning, residual policies) is an attempt to close this gap.
Roadmap: a taxonomy of responses
Every major idea in the field attacks one or more of the five failure modes above:
Problem
Response
Section
Compounding error
Action chunking (predict $H$ steps at once)
06
Compounding error
DAgger (train on policy's own states)
04
Compounding error
RL fine-tuning (correct via reward signal)
14
Multimodality
Diffusion / flow / VQ action heads
05, 08, 09
Distribution shift
Temporal ensembling, relative actions
02, 06
Reality gap
Domain randomization, sim-to-real
15–16
Latency
DDIM, consistency distillation, flow matching
08, 09
All of the above
Scale (pre-trained VLAs amortize learning)
10–13
The rest of this article walks through each response, starting from the foundations and building toward the full 2026 stack.
The key mental model. Robot learning is a stack with three layers. At the bottom: representations (how to encode observations and parameterize actions). In the middle: generative models (how to produce multimodal, temporally coherent action sequences). At the top: scale and adaptation (pre-trained backbones, RL fine-tuning, sim-to-real transfer). Each layer depends on the one below. A diffusion policy with the wrong action space fails. A VLA with the wrong action head fails. Understanding the stack from the bottom up is the only way to diagnose failures and design improvements.
Interactive: error growth curves
expert (zero error)DAgger · O(εT)naive BC · O(εT²)
02Spaces of action and observation
Choose the action space carelessly and no architecture will save you.
Action representations
The choice of action space is a structural commitment that propagates through the entire stack. Five common options:
Representation
What the policy outputs
Use when
Joint positions
Target $q_t \in \mathbb{R}^n$ for a position controller running at 500–1000Hz underneath
Bimanual tabletop, precise contact (ALOHA / ACT)
Joint velocities
Target $\dot q_t$
Compliant control, when integration drift is acceptable
EE pose, abs.
$T \in SE(3)$ for the end-effector, solved by IK
Cross-embodiment, when the body shouldn't matter
EE pose, rel.
$\Delta T$ relative to current pose
UMI, Diffusion Policy — robust to recovery from drift
Torques
$\tau_t$
Locomotion, rich contact, sim-trained policies
The relative-EE space is quietly the most important shift of the last three years. A relative-pose policy that drifts can recover by issuing a corrective $\Delta T$; an absolute-pose policy that drifts is permanently confused. Relative actions also factor out the absolute pose of the demonstration, which means the same demo collected at any table works.
Forward and inverse kinematics
To understand why joint-space and task-space (end-effector) representations are fundamentally different, consider the simplest possible robot: a 2-link planar arm with link lengths $l_1, l_2$ and joint angles $\theta_1, \theta_2$.
Forward kinematics (FK) maps joint angles to end-effector position. It is always unique and always differentiable:
2-link planar FK
$$ x = l_1 \cos\theta_1 + l_2 \cos(\theta_1 + \theta_2), \qquad y = l_1 \sin\theta_1 + l_2 \sin(\theta_1 + \theta_2) $$
$l_1, l_2$ — link lengths (fixed constants of the robot)
$\theta_1$ — the shoulder angle, measured from the positive x-axis
$\theta_2$ — the elbow angle, measured relative to link 1 (not the world frame)
$(x, y)$ — the end-effector position in world coordinates
Worked example: FK. Let $l_1 = l_2 = 1.0$, $\theta_1 = 30\degree = \pi/6$, $\theta_2 = 45\degree = \pi/4$.
$x = 1.0 \cos(30\degree) + 1.0 \cos(30\degree + 45\degree) = 0.866 + \cos(75\degree) = 0.866 + 0.259 = 1.125$.
$y = 1.0 \sin(30\degree) + 1.0 \sin(75\degree) = 0.500 + 0.966 = 1.466$.
The end-effector is at $(1.125, 1.466)$. There is exactly one answer — FK is a deterministic function.
Inverse kinematics (IK) goes the other direction: given a desired end-effector position $(x, y)$, find joint angles $(\theta_1, \theta_2)$. This is harder for three reasons:
Multiple solutions. For the 2-link arm, there are generically two configurations that reach the same point — elbow-up and elbow-down. For a 7-DOF arm, there is a continuous manifold of solutions (a 1D null space).
No solution. If $(x, y)$ is beyond reach ($\sqrt{x^2 + y^2} > l_1 + l_2$), no joint angles work.
Singularities. When the arm is fully extended ($\theta_2 = 0$), the Jacobian loses rank and small task-space motions require infinite joint velocities.
This is why policies that output joint positions avoid the IK problem entirely — the FK is computed only for monitoring, not for control. Policies that output end-effector poses must solve IK at every step, introducing a numerical solver into the control loop.
forward_kinematics.py
importnumpyasnpdeffk_2link(theta1, theta2, l1=1.0, l2=1.0):
"""Forward kinematics for a 2-link planar arm."""x = l1 * np.cos(theta1) + l2 * np.cos(theta1 + theta2)
y = l1 * np.sin(theta1) + l2 * np.sin(theta1 + theta2)
returnx, y# Verify worked examplex, y = fk_2link(np.radians(30), np.radians(45))
print(f"EE position: ({x:.3f}, {y:.3f})") # (1.125, 1.466)# The Jacobian: how EE velocity relates to joint velocitydefjacobian_2link(theta1, theta2, l1=1.0, l2=1.0):
"""2x2 Jacobian: [dx/dθ1, dx/dθ2; dy/dθ1, dy/dθ2]"""s1, c1 = np.sin(theta1), np.cos(theta1)
s12, c12 = np.sin(theta1 + theta2), np.cos(theta1 + theta2)
returnnp.array([
[-l1*s1 - l2*s12, -l2*s12],
[ l1*c1 + l2*c12, l2*c12]
])
Worked example: IK ambiguity. Target: $(x, y) = (1.0, 0.5)$ with $l_1 = l_2 = 1.0$. Using the law of cosines:
$$\cos\theta_2 = \frac{x^2 + y^2 - l_1^2 - l_2^2}{2 l_1 l_2} = \frac{1.0 + 0.25 - 1 - 1}{2} = -0.375$$
This gives $\theta_2 = \pm \arccos(-0.375) = \pm 112.0\degree$. Two solutions — elbow-up and elbow-down — both reaching the same point.
For each $\theta_2$, $\theta_1$ follows from $\theta_1 = \text{atan2}(y, x) - \text{atan2}(l_2 \sin\theta_2, l_1 + l_2 \cos\theta_2)$.
If $(x,y) = (2.5, 0)$, then $\cos\theta_2 = (6.25 - 2)/2 = 2.125 > 1$. No solution — the target is out of reach. The policy would need to output an infeasible action, and the IK solver would fail or clip. This failure mode simply does not exist in joint-space policies.
Rotation parameterization
Never regress to Euler angles or raw quaternions. Both have discontinuities or double-cover problems that confuse gradients. The accepted choice is the 6D continuous representation from Zhou et al. — predict the first two columns of the rotation matrix and Gram–Schmidt them. It has no discontinuities and trains cleanly.
Why Euler angles fail: gimbal lock
Euler angles parameterize rotation as three sequential rotations: $R = R_z(\psi) \, R_y(\theta) \, R_x(\phi)$ (yaw-pitch-roll). When the pitch angle $\theta = \pm 90\degree$, the yaw and roll axes align — two of the three degrees of freedom collapse into one. Formally, the derivative $\partial R / \partial \psi$ and $\partial R / \partial \phi$ become linearly dependent at $\theta = \pm 90\degree$, so the Jacobian from Euler angles to the rotation matrix drops rank from 3 to 2. This means:
A neural network predicting Euler angles near $\theta \approx 90\degree$ faces a discontinuity in the mapping: a tiny change in the target rotation requires a huge jump in the predicted angles.
The gradient signal becomes degenerate — the network cannot tell which direction to move.
This is not a rare edge case. A robot reaching forward and down regularly passes through pitch = $\pm 90\degree$.
Gram-Schmidt orthogonalization: step by step
The 6D representation predicts a raw $3 \times 2$ matrix (6 numbers) and converts it to a valid rotation matrix $R \in SO(3)$ via Gram-Schmidt. The steps are:
Cross-product for column 3: $\hat{c}_3 = \hat{c}_1 \times \hat{c}_2$
The resulting $R = [\hat{c}_1 \mid \hat{c}_2 \mid \hat{c}_3]$ is guaranteed to be a valid rotation matrix ($R^\top R = I$, $\det R = +1$), and the mapping from 6D input to $SO(3)$ is continuous everywhere. This continuity is why gradient-based learning works cleanly.
Why quaternions fail for regression. Quaternions have a double-cover problem: $q$ and $-q$ represent the same rotation. If two demonstrations of the same motion produce quaternions $q$ and $-q$ (equally valid), their average is $\mathbf{0}$ — which is not a valid rotation. Euler angles have gimbal lock: at $\theta = \pm 90\degree$ pitch, yaw and roll become degenerate, creating discontinuities in the action space. The 6D representation avoids both: it predicts a $3 \times 2$ matrix (6 numbers), Gram-Schmidt orthogonalizes it to get a valid rotation matrix, and the mapping is continuous everywhere. The cost is 6 output dimensions instead of 3 (Euler) or 4 (quaternion), but continuity matters far more than compactness for gradient-based learning.
This function is differentiable everywhere, so gradients flow cleanly from the action loss back through the rotation representation to the network weights. Every modern manipulation policy that outputs rotations (ACT, Diffusion Policy, $\pi_0$) uses this or an equivalent representation.
Observations
The modern observation tuple is some subset of:
RGB images from one or more cameras — fixed scene cams, wrist cams, fisheye on a handheld stick. Wrist cams are extraordinarily helpful for fine manipulation; one wrist camera often beats two scene cameras.
Force / torque — six-axis F/T sensors at the wrist. Crucial for contact-rich tasks; usually ignored because data is hard to collect.
Tactile — DIGIT, GelSight, pressure arrays. Promising; not yet load-bearing in flagship policies.
Language — instruction strings, encoded by a frozen language encoder (T5, CLIP text, an LLM).
Goal images — used by Octo and others to specify task without language.
Worked example: observation tensor shapes for a Diffusion Policy. A typical setup with 2 cameras and proprioception:
Scene camera: 480×640×3 RGB → ResNet-18 → 512-dim feature vector. With 2-step history: (2, 512).
Wrist camera: 480×640×3 RGB → ResNet-18 → 512-dim feature vector. With 2-step history: (2, 512).
Proprioception: 7 joint positions + 7 joint velocities + 1 gripper width = 15-dim. With 2-step history: (2, 15).
Concatenated observation token: (2, 512 + 512 + 15) = (2, 1039) — flattened to 2078-dim, or kept as a sequence of 4+4+2 = 10 tokens for a transformer.
Action output: H = 16 steps × 7-dim (6 DoF EE + gripper) = 16×7 = 112 total values. In a diffusion policy, this is the shape of the noise and the denoised output.
Memory footprint per sample: 2 × (480×640×3) = 1.76 MB raw images. After encoding: 2×1039×4 bytes = 8.3 KB features. The vision encoder dominates compute; the policy itself is cheap.
The observation stack that works
For most manipulation tasks in 2026, the winning observation is:
One wrist camera (the most informative single sensor for contact-phase manipulation).
One scene camera (for spatial context and reaching).
Joint positions + gripper width (proprioception).
2-step observation history (captures short-term dynamics without causal confusion).
A language instruction or goal image (for multi-task policies).
Force/torque is the next modality to add when it's available; it consistently helps on contact-rich tasks. Tactile is promising but not yet load-bearing in flagship systems.
Why wrist cameras matter so much. A scene camera 1 meter above the table sees the workspace at ~2mm per pixel resolution at the tabletop. But during the final 5cm of a grasp approach, the fingers occlude the object from above. The wrist camera, mounted on the forearm, sees the grasp point from 10–15cm away at ~0.3mm per pixel — 6× higher resolution on the exact region that matters. This is why wrist cameras often improve success rate by 15–25% on fine manipulation tasks (inserting a USB plug, threading a needle, closing a zipper). The cost: the wrist camera image changes rapidly as the arm moves, making temporal consistency harder.
Observation normalization. Always normalize observations before feeding them to the policy. Joint positions, joint velocities, and end-effector poses live on different scales. Standard practice: compute per-dimension mean and standard deviation from the training data, then normalize to zero mean and unit variance. Image observations are normalized by the vision encoder (ImageNet stats for ResNet, or the encoder's own normalization). Failing to normalize is a common source of "the policy does nothing" bugs — one dimension with 100× larger scale dominates the gradient.
03Three paradigms
Imitation, reinforcement, and the rapidly growing hybrid in between.
Every modern robot policy lives somewhere on a triangle whose vertices are imitation learning, reinforcement learning, and model-based / world-model methods. The dominant practical paradigm in 2026 is imitation, often pre-trained at scale, sometimes fine-tuned with RL. The pure-RL vertex remains alive in locomotion, dexterous in-hand manipulation, and any setting where a simulator is faithful enough to train in.
/01
Imitation
Behavior cloning · supervised
Match demonstration actions, conditioned on observations. Cheap to start, expensive to scale (data is the bottleneck). The default for manipulation.
/02
Reinforcement
Reward-driven · trial & error
Maximize expected return. Powerful when the simulator is good or the real-world reward is dense. Brittle when reward shaping is wrong, sample-hungry when it isn't.
/03
Model-based
Imagined rollouts · planning
Learn a dynamics model, then plan or learn inside it. Dreamer, MuZero. Sample-efficient; brittle under distribution shift in the model.
The triangle has interesting interior points. Offline RL trains a value-aware policy on demonstration data without further interaction — useful when you have demos but no simulator. Residual RL trains an RL correction on top of a frozen BC base. HIL-SERL blends online RL with human interventions for sample-efficient real-world learning. The big new entrant — VLA fine-tuning with RL — uses a pre-trained vision-language-action backbone and a small amount of task RL to specialize.
The interesting question is no longer "BC or RL?" — it is "where in the pipeline does each one belong?". The 2026 stack uses BC for the prior, RL for the polish, and a world model for the dream loop when you can afford it.
Derivation: BC as maximum likelihood
Behavior cloning looks like "just regress to actions," but it is maximum likelihood estimation in disguise. Understanding this connection reveals why MSE is the natural loss and what assumptions it bakes in.
Setup. We have a dataset of expert demonstrations $\mathcal{D} = \{(o_i, a_i)\}_{i=1}^{N}$. We model the expert's action distribution as a Gaussian conditioned on the observation:
In plain English: We're assuming the expert's action at any given observation is "the right action, plus some random wobble." The neural network predicts the right action; the wobble is a fixed-width bell curve around it. This is the simplest possible model of expert behavior — one answer per situation, with noise. It's also the model that makes MSE loss "the correct thing to do" (as we'll derive below). The hidden cost: it assumes there's only ONE right answer per observation. When there are two valid ways to do something, this model averages them.
In code:loss = F.mse_loss(policy(obs), expert_action) — that's the entire BC training loss. The $1/(2\sigma^2)$ constant disappears because it doesn't affect which $\theta$ minimizes the loss (and PyTorch optimizers don't care about constant scaling). In your training loop, this is the number you watch. If it plateaus above ~0.01 (normalized action space), the policy isn't learning fine motions. If it drops below 0.001, check for overfitting by evaluating on held-out demonstrations.
This is MSE loss, scaled by a constant. The derivation reveals the hidden assumption: the expert's actions are unimodal Gaussian around a single mean. When this assumption fails — when the distribution is bimodal, skewed, or heavy-tailed — MSE is the wrong loss, and you need the expressive action heads of section 05.
Worked example: BC as MLE. Policy predicts $f_\theta(o) = [0.32, -0.15]$. Expert: $a = [0.30, -0.12]$. With $\sigma = 0.05$, squared error = $0.02^2 + 0.03^2 = 0.0013$. NLL = $0.0013/(2 \times 0.0025) + \text{const} = 0.26 + \text{const}$. Since the constant does not depend on $\theta$, minimizing NLL = minimizing MSE. This is why BC uses F.mse_loss rather than explicit Gaussian NLL — same optimization problem.
The Gaussian assumption determines uncertainty handling. A learned $\sigma$ inflates variance in ambiguous situations — useful for RL exploration but harmful in BC (sampled actions become noisy). Clamping $\sigma$ to a small constant is standard for BC; learning $\sigma$ is for RL (SAC, PPO).
Derivation: the Bellman equation
Reinforcement learning is built on a single recursive identity. The value of being in state $s$ under policy $\pi$ is the reward you get now plus the discounted value of where you end up:
In plain English: "How good is it to be here?" equals "what I get right now" plus "how good is where I'll end up, slightly discounted because the future is less certain." It's a recursive definition: the value of THIS state depends on the value of the NEXT state. Think of it like asking "how much is this house worth?" — it's the rent you collect this month, plus the discounted value of all future rent. The Bellman equation is the backbone of every RL algorithm ever written.
$V^\pi(s)$ — the value function: expected cumulative discounted reward starting from state $s$ and following policy $\pi$ forever
$r(s, a)$ — the immediate reward for taking action $a$ in state $s$
$\gamma \in [0, 1)$ — the discount factor: how much we value future reward relative to present reward. $\gamma = 0.99$ means reward 100 steps away is worth $0.99^{100} \approx 0.37$ of today's reward
$p(s' \mid s, a)$ — the transition dynamics: the probability of landing in state $s'$ after taking action $a$ in state $s$
The outer expectation is over the policy's action choice; the inner expectation is over the stochastic dynamics
In code:target = reward + gamma * V_next — that is ONE line in every RL codebase. The TD (temporal difference) update computes the target value from the current reward and the estimated next-state value, then updates the value network to match: loss = F.mse_loss(V(state), target.detach()). The .detach() is critical — without it, the target moves as you update, and training diverges. This is bootstrapping: you're using your own (approximate) estimate of $V$ to improve $V$.
The Bellman equation says: the value of a state is self-consistent. If you know $V^\pi$ for all next states, you can compute $V^\pi$ for the current state. This recursive structure is the foundation of every RL algorithm — policy gradient methods estimate $V^\pi$ to compute advantages, and value-based methods iterate the Bellman equation directly to convergence.
Worked example: Bellman computation. A robot arm has three states: $s_1$ (reaching), $s_2$ (grasping), $s_3$ (lifted — terminal). The policy always advances. Rewards: $r(s_1) = 0$, $r(s_2) = +1$. Discount $\gamma = 0.9$. Transitions are deterministic.
Working backward: $V^\pi(s_3) = 0$ (episode ends). $V^\pi(s_2) = 1 + 0.9 \times 0 = 1.0$. $V^\pi(s_1) = 0 + 0.9 \times 1.0 = 0.9$.
The reaching state gets no immediate reward, but the discounted future reward from the upcoming grasp is worth 0.9. Add a 4th state before reaching: its value is $0.9^2 = 0.81$. The discount factor creates a pressure to succeed sooner.
The Q-function extends the value function to state-action pairs: $Q^\pi(s, a) = r(s,a) + \gamma \, \mathbb{E}_{s'}[V^\pi(s')]$. The optimal policy simply picks $a^* = \arg\max_a Q^*(s, a)$. Deep RL algorithms like SAC (used in HIL-SERL) approximate $Q^*$ with a neural network and derive the policy from it.
The Bellman equation also reveals RL's core computational challenge: credit assignment. In our mug-picking example, the robot receives reward +1 only when the mug is lifted. But the action that caused success was the approach trajectory 30 steps earlier. The Bellman recursion propagates this +1 reward backward through time via the discount factor: the approaching state gets value $0.9^{30} \approx 0.04$ — a faint signal that the network must detect amidst noise. This is why sparse rewards make RL hard (the signal is too faint) and why reward shaping (adding intermediate rewards) helps but introduces its own problems (the policy may exploit the shaped reward without solving the actual task).
Advantage function
$$ A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s) $$
$A^\pi(s, a)$ — the advantage: how much better action $a$ is compared to the average action the policy would take in state $s$. Positive advantage = better than average; negative = worse.
Policy gradient methods (PPO, section 14) use the advantage to update the policy: increase the probability of actions with positive advantage, decrease those with negative advantage.
Worked example: advantage computation. The robot is in state $s_2$ (about to grasp). Under the current policy, $V^\pi(s_2) = 0.7$ (the average outcome from this state). The policy considers two actions:
Grasp firmly: $Q^\pi(s_2, \text{firm}) = 0.95$ (almost always succeeds). Advantage: $0.95 - 0.7 = +0.25$.
Grasp loosely: $Q^\pi(s_2, \text{loose}) = 0.3$ (usually drops the object). Advantage: $0.3 - 0.7 = -0.4$.
PPO will increase the probability of "grasp firmly" (positive advantage) and decrease "grasp loosely" (negative advantage). The magnitude of the update is proportional to the advantage: the loose grasp gets a larger negative update because it is further below average.
The hybrid spectrum
No modern system sits at a pure vertex. The practical spectrum includes:
Pure BC — ACT, Diffusion Policy with only demo data. Fast to train, limited by data quality.
BC + RL fine-tuning — the "pre-train then adapt" recipe. $\pi_0$ fine-tuned with RLHF. BC provides a strong prior; RL corrects its distribution shift.
Offline RL — run RL algorithms on a fixed dataset without further interaction. IQL fits here. Useful when you have demos but no simulator.
Residual RL — freeze a BC base policy, train an RL "residual" $a = \pi_{\text{BC}}(o) + \pi_{\text{res}}(o)$. BC handles coarse motion; RL learns fine adjustments.
Human-in-the-loop RL — HIL-SERL. A human intervenes during RL rollouts, dramatically improving sample efficiency.
World model + planning — Dreamer, UniSim. Learn dynamics from data, plan inside the model. Most sample-efficient; model errors compound during planning.
The practical takeaway: start with BC, upgrade as needed. For a new task, collect 50–100 demonstrations and train a Diffusion Policy. If the success rate plateaus below target, the bottleneck determines the next step. If the policy fails on novel object positions (distribution shift), try RL fine-tuning or more diverse demonstrations. If the policy is multimodal (inconsistent actions), check that the action head is expressive enough. If the policy succeeds in simulation but fails on the real robot, the problem is the sim-to-real gap.
The 2026 recipe in one sentence. Pre-train a VLA on internet-scale data (BC over millions of trajectories from many robots), fine-tune on your task with 50–200 demonstrations (task-specific BC), and optionally polish with 1–2 hours of RL in sim or real. Each stage addresses a different bottleneck: the VLA provides visual understanding and language grounding; task-specific BC provides the motor skill; RL provides the distribution-shift correction that pushes success rate from 80% to 95%.
Paradigm comparison
Property
Behavior Cloning
Reinforcement Learning
Model-Based
Data
Expert demos (expensive per sample, cheap per step)
Self-generated rollouts (free per sample, slow per step)
Any experience (demos or rollouts)
Compute
Low — standard supervised learning
High — millions of env steps in sim, or slow real-world rollouts
Medium — model learning + planning
Sim required?
No — train on real data
Usually yes — real-world RL is too sample-hungry without one
Helpful but not required
Failure mode
Compounding error, multimodal averaging, distribution shift
Reward hacking, sparse reward starvation, sim-to-real gap
Model error compounds in planning horizon
Best for
Manipulation with teleop data, rapid prototyping, pre-training VLAs
Locomotion, dexterous manipulation, fine-tuning after BC
Sample-efficient exploration, when dynamics are learnable
Ceiling
Limited by demonstrator skill — cannot exceed expert
Unbounded in principle — can discover superhuman strategies
Limited by model accuracy — compounding model error is the analog of compounding BC error
BC's hidden ceiling. Behavior cloning can at best reproduce the expert. If the demonstrator's success rate is 85%, the BC policy's ceiling is 85% (and in practice lower, due to compounding error). RL has no such ceiling — it can discover strategies the expert never tried. The 2026 recipe: BC to get to 80% success, then RL to push to 95%. The BC phase is fast (hours of data, minutes of training); the RL phase is slow but goes beyond the human demonstrator.
Why model-based methods are sample-efficient but fragile. A model-based agent that learns dynamics $\hat{p}(s' \mid s, a)$ can generate imagined rollouts for free. If the model is accurate for $k$ steps, the agent effectively multiplies its data by $k$. But model error compounds: if each step has error $\delta$, a $k$-step imagined rollout accumulates $O(\delta k^2)$ error — the same quadratic compounding as naive BC, but in the model's predictions rather than the policy's actions. This is why Dreamer uses short imagined rollouts (15 steps) and why world models work best for short-horizon tasks.
The simplest behavior cloning recipe: collect demonstrations $\{(o_i, a_i)\}$, train a network $\pi_\theta(o)$ to minimize $\sum_i \| \pi_\theta(o_i) - a_i \|^2$, deploy. It works on toy problems and fails on real ones for three reasons that compound.
Compounding error
Train-time observations come from the expert. Test-time observations come from the policy. Even if the policy is $\epsilon$-accurate per step, after $T$ steps it has wandered $O(\epsilon T)$ from the demonstration distribution; the loss bound on the expected number of mistakes is $O(\epsilon T^2)$. This is Ross & Bagnell, 2010. It means a 99%-accurate per-step BC policy fails reliably on a 200-step task.
Multimodality
The expert distribution $p(a \mid o)$ is often multimodal. Mean-squared regression converges to the conditional mean, which can be a bad action: imagine demonstrations split between going around an obstacle on the left and on the right; the mean goes through the obstacle.
valid demo modesmean of modes (invalid)goal
In plain English: If you ask a neural network "minimize the squared error against ALL the demonstrations," the mathematically optimal answer is the AVERAGE of all demonstrated actions for each observation. The network isn't being dumb — it's doing exactly what you asked. The problem is that the average of two valid actions can be an invalid action, the same way the average of "turn left" and "turn right" is "go straight into the wall."
MSE collapses to the mean
$$ \arg\min_\theta \mathbb{E}_{o,a}[\|f_\theta(o) - a\|^2] = f^*(o) = \mathbb{E}[a \mid o] $$
What this means for your system: If you are using F.mse_loss and your task has multiple valid strategies (it almost always does in manipulation), your policy WILL average them. This is not a training bug — it is the mathematically guaranteed behavior of MSE. The fix is not more data or longer training; the fix is a different output head. You MUST use a multimodal head: GMM (torch.distributions.MixtureSameFamily), diffusion (predict noise, denoise iteratively), flow matching (predict velocity field), or discretized actions (F.cross_entropy over bins). Every architecture from section 05 onward exists because of this one equation.
Two valid trajectories average to one invalid trajectory. The squared-error minimizer is the conditional mean, and the mean of two perfectly-good modes can sit exactly where neither one ever went. The remedy is not better optimization — it is an output distribution that can represent a bimodal answer.
Derivation: why MSE yields the conditional mean
Claim. For any loss of the form $\mathcal{L}(f) = \mathbb{E}_{o,a}[\|f(o) - a\|^2]$, the minimizer is $f^*(o) = \mathbb{E}[a \mid o]$.
Proof. Fix an observation $o$. The inner optimization is:
The result is immediate: the MSE-optimal prediction for observation $o$ is the conditional mean $\mathbb{E}[a \mid o]$. When $p(a \mid o)$ has two modes at $a_L$ and $a_R$ with equal probability, the conditional mean is $(a_L + a_R)/2$ — which is exactly the midpoint between the modes. If those modes go around an obstacle, the midpoint goes through it.
Worked example: bimodal Gaussian. Let $p(a \mid o) = 0.5 \cdot \mathcal{N}(-3, 0.5^2) + 0.5 \cdot \mathcal{N}(+3, 0.5^2)$. The two modes are at $a = -3$ (go left) and $a = +3$ (go right), each with a tight spread of $\sigma = 0.5$. The conditional mean is $0.5 \times (-3) + 0.5 \times (+3) = 0$. The MSE-optimal prediction is $a = 0$, which has probability density essentially zero under the true distribution — it's the worst possible action, sitting exactly between the two valid choices. Under $\mathcal{N}(-3, 0.5^2)$, the density at $a=0$ is $\frac{1}{\sqrt{2\pi}\cdot 0.5}\exp(-\frac{9}{0.5}) \approx 5.2 \times 10^{-8}$. The MSE "optimal" action has never been seen in any demonstration.
Interactive: bimodal action distribution
true p(a|o)MSE prediction (mean)mode peaks
Causal confusion
A policy with too much information can learn shortcuts that don't generalize. The classic example: a self-driving model with access to a brake-light indicator learns to brake when the brake light is on — which works perfectly on demonstration data and catastrophically when deployed, because it's predicting its own past action rather than reading the road.
In manipulation, causal confusion often appears when the policy is given proprioceptive history (past joint positions). The policy can learn to "replay" the demonstration trajectory by attending to where the joints were one step ago, rather than looking at the camera to see where the object is. This works perfectly on the training distribution (because the joints were always in the same place at the same point in the demo) and fails completely on new object positions.
The cure for causal confusion. Three practical fixes: (1) Observation dropout: randomly drop proprioceptive inputs during training (zero them out 50% of the time) so the policy cannot rely on them exclusively. (2) No past-action conditioning: never feed the policy its own previous action $a_{t-1}$, because this creates a direct causal shortcut. (3) Limited observation history: use only 1–2 past observations instead of the full trajectory. Longer history = more opportunity for the policy to memorize temporal patterns instead of learning to observe.
Worked example: bimodal averaging. Two expert demonstrations from the same observation $o$:
Demo A: move the arm left at $+0.1$ rad/s.
Demo B: move the arm right at $-0.1$ rad/s.
Both are valid (two ways to reach around an obstacle). The MSE minimizer computes:
$$\hat{a} = \mathbb{E}[a \mid o] = 0.5 \times (+0.1) + 0.5 \times (-0.1) = 0.0 \text{ rad/s}$$
The predicted action is "don't move" — which is the one thing neither expert ever did. The robot freezes in front of the obstacle. This is not a bug in the optimizer; it is the mathematically correct answer to the wrong question. MSE asks "what is the mean?" when the question should be "what are the modes?"
DAgger: fixing distribution shift
Dataset Aggregation (Ross et al., 2011) directly attacks the distribution shift problem. The insight: if the policy's mistakes lead it to unfamiliar states, get expert labels for those states and add them to the training set. After enough rounds, the policy has seen its own failure modes and can recover from them.
The algorithm has five steps, repeated in a loop:
Train initial policy
Train $\hat\pi_1$ on the initial expert dataset $\mathcal{D}_0 = \{(o_i^*, a_i^*)\}$ using standard supervised learning (MSE or your preferred loss).
Roll out the learned policy
Deploy $\hat\pi_n$ in the environment (or simulator). Collect the observations $\{o_1, o_2, \ldots, o_T\}$ that the policy visits — not the expert. These are the states where the policy actually operates, including the out-of-distribution ones where it fails.
Query the expert
For each policy-visited observation $o_t$, ask the expert: "What would you do here?" Record the expert's action $a_t^* = \pi^*(o_t)$. This is the expensive step — it requires a human or an optimal controller to label novel states.
Aggregate the dataset
$\mathcal{D}_n = \mathcal{D}_{n-1} \cup \{(o_t, a_t^*)\}$. The training set now contains both the original expert demonstrations and expert corrections for the policy's mistakes.
Retrain
Train $\hat\pi_{n+1}$ on the aggregated dataset $\mathcal{D}_n$. Go to step 2.
Why it works. After $N$ rounds of DAgger, the policy has been trained on states drawn from its own distribution (not just the expert's). The distribution shift vanishes, and the error bound drops from $O(\epsilon T^2)$ to $O(\epsilon T)$ — linear in the horizon, not quadratic. The quadratic term disappears because the policy no longer encounters states it hasn't been trained on; every state it visits is (approximately) in the training set.
Why $O(\epsilon T)$? In the DAgger bound, the policy's state distribution $d_t^{\hat\pi}$ converges to the training distribution (because we train on $d_t^{\hat\pi}$ itself). So the "being off-distribution" penalty in the original bound — the $t\epsilon$ term per step — vanishes. The only remaining error is the per-step mistake rate $\epsilon$, accumulated $T$ times: $\mathbb{E}[\text{mistakes}] \leq \epsilon T$. For $\epsilon = 0.01, T = 200$: 2 mistakes instead of 201.
dagger_loop.py
defdagger(env, expert, policy, initial_data, n_rounds=10, n_rollouts=5):
"""DAgger: Dataset Aggregation for imitation learning."""dataset = initial_data.copy() # D_0 = expert demosforround_iinrange(n_rounds):
# Step 1 & 5: train on aggregated datasetpolicy.train(dataset)
# Step 2: roll out the learned policynew_data = []
for_inrange(n_rollouts):
obs = env.reset()
fortinrange(env.max_steps):
action = policy.predict(obs) # policy's action# Step 3: query expert for the CORRECT action at this stateexpert_action = expert.label(obs)
new_data.append((obs, expert_action))
# Execute the POLICY's action (not the expert's)obs, _, done, _ = env.step(action)
ifdone: break# Step 4: aggregatedataset.extend(new_data)
print(f"Round {round_i}: dataset size = {len(dataset)}")
returnpolicy
The critical detail: on line 14, we execute the policy's action (so the rollout visits the policy's distribution), but we record the expert's action (so the policy learns to correct its own mistakes). This is what closes the distribution gap.
DAgger in practice: limitations and variants
Pure DAgger is rarely used in modern manipulation for a practical reason: step 3 requires an expert to label arbitrary states, including states the robot reaches through failures. For a human teleoperator, this means watching the robot fail and retroactively answering "what would I have done at each frame?" — which is cognitively demanding and error-prone. Several practical variants address this:
HG-DAgger (Human-Gated DAgger): The human watches the robot execute the policy and intervenes only when necessary, taking over control. The intervention data is added to the dataset. This is psychologically easier than labeling every state.
IWR (Interventionist Learning with Recovery): When the human intervenes, the system records the transition from the policy's state back to the expert's trajectory. This "recovery" data teaches the policy to correct its own mistakes.
RL fine-tuning as implicit DAgger: Running RL on top of a BC policy has the same effect: the policy encounters its own failure states and learns to handle them, but through reward signal rather than expert labels. This is why BC + RL fine-tuning has largely replaced DAgger for high-stakes manipulation.
Worked example: DAgger sample efficiency. A pick-and-place task with $T = 100$ steps.
Round 0 (initial BC): 50 expert demonstrations = 5,000 observation-action pairs. Policy success rate: 40%.
Round 1: 5 rollouts of the learned policy = 500 new observations. Expert labels each = 500 new pairs. Dataset: 5,500. Success rate: 62%.
Round 2: 5 more rollouts = 500 new pairs from harder (failure) states. Dataset: 6,000. Success rate: 78%.
Round 3: 500 more pairs. Dataset: 6,500. Success rate: 85%.
With just 1,500 additional labeled observations (30% more data), the success rate doubled from 40% to 85%. The marginal value of DAgger data is much higher than random expert demonstrations because it targets exactly the states where the policy fails.
The fixes
Each failure has a class of solutions:
For compounding error: action chunking, receding-horizon control, history conditioning, DAgger-style on-policy correction, RL fine-tuning.
For causal confusion: careful observation design, dropout on proprioception during training, and refusing to feed the policy the previous action.
05The multimodality problem
Five answers to "how do you parameterize a multimodal action distribution?", in roughly chronological order.
Gaussian mixture heads
Predict a mixture: $\pi(a \mid o) = \sum_{k=1}^{K} w_k(o) \, \mathcal{N}(a; \mu_k(o), \Sigma_k(o))$. Train with negative log-likelihood. Robust, simple, interpretable. Limited by the number of mixture components and a tendency toward mode collapse during training. Works well as a baseline; was the basis of BC-RNN in early Robomimic experiments.
Deriving the GMM loss
The negative log-likelihood for a GMM is more complex than MSE because it involves a log-sum-exp over mixture components. Given an observed action $a$ and observation $o$:
In plain English: Instead of predicting one action (MSE), the network predicts $K$ different "candidate actions" along with how confident it is in each one. The loss says: "look at all your candidates — how well does at least one of them explain the expert's actual action?" If any candidate is close, the loss is low. This lets the network maintain multiple valid strategies simultaneously without averaging them.
In code: PyTorch has this built in: mix = torch.distributions.Categorical(logits=log_weights), comp = torch.distributions.Independent(torch.distributions.Normal(means, stds), 1), gmm = torch.distributions.MixtureSameFamily(mix, comp), then loss = -gmm.log_prob(actions).mean(). Four lines to replace MSE with a multimodal head. In practice, most codebases implement GMM NLL manually (as in the code below) for control over numerical stability via torch.logsumexp.
The gradient with respect to the mean $\mu_j$ of component $j$ is:
The term $r_j$ is the responsibility of component $j$ for this datapoint — the posterior probability that $a$ was generated by component $j$. Each component's mean only gets pulled toward actions that it "claims." If component 1 has responsibility 0.95 for a left-reaching action, component 2's mean barely moves. This is what lets GMMs maintain separate modes without collapsing.
The difficulty: the gradient involves a ratio of exponentials (the responsibilities), which can saturate. When components are far apart, responsibilities are nearly 0 or 1, and the gradient for the losing component vanishes — leading to mode collapse if training starts with poor initialization.
importtorchimporttorch.nnasnnimporttorch.nn.functionalasFclassGMMActionHead(nn.Module):
"""Gaussian Mixture Model action head for behavior cloning."""def__init__(self, obs_dim, act_dim, n_components=5):
super().__init__()
self.K = n_componentsself.act_dim = act_dim# Predict: K weights + K means + K log-stdsself.net = nn.Linear(obs_dim, n_components * (1 + act_dim + act_dim))
defforward(self, obs):
"""Returns (log_weights, means, log_stds), each (B, K, ...)."""raw = self.net(obs) # (B, K*(1 + D + D))raw = raw.view(-1, self.K, 1 + 2 * self.act_dim)
log_w = F.log_softmax(raw[..., 0], dim=-1) # (B, K)means = raw[..., 1:1+self.act_dim] # (B, K, D)log_s = raw[..., 1+self.act_dim:] # (B, K, D)returnlog_w, means, log_sdefnll_loss(self, obs, actions):
"""Negative log-likelihood of actions under the GMM."""log_w, means, log_s = self(obs) # (B,K), (B,K,D), (B,K,D)a = actions.unsqueeze(1) # (B, 1, D)# Per-component log-prob: sum of D independent Gaussiansvar = torch.exp(2 * log_s)
log_p = -0.5 * ((a - means)**2 / var + 2*log_s + 1.8379).sum(-1) # (B,K)# log-sum-exp over components: log sum_k w_k * N(a; mu_k, sig_k^2)log_mix = torch.logsumexp(log_w + log_p, dim=-1) # (B,)return -log_mix.mean()
Worked example: GMM with K=2. Observation $o$ shows a mug equidistant from two valid grasp points. The network predicts: $w_1 = 0.55$, $\mu_1 = [0.3, 0.1, 0.0]$ (approach from left), $\sigma_1 = 0.02$; $w_2 = 0.45$, $\mu_2 = [0.3, -0.1, 0.0]$ (approach from right), $\sigma_2 = 0.02$. At inference, sample the mixture: with probability 0.55, go left; with 0.45, go right. The NLL training loss is:
$$\mathcal{L} = -\log\big(0.55 \cdot \mathcal{N}(a; \mu_1, \sigma_1^2 I) + 0.45 \cdot \mathcal{N}(a; \mu_2, \sigma_2^2 I)\big)$$
This cleanly represents the bimodality. The practical limit: when $K < $ true number of modes, some modes are lost. When $K$ is too large, components collapse onto each other.
Discretized / categorical actions
Bin each action dimension into $B$ bins (typically 256), predict a categorical distribution per dimension, sample at test time. Expressive — a 256-bin categorical can represent any 1D distribution to that resolution — and the training objective is just cross-entropy. This is the recipe behind RT-1, RT-2, OpenVLA, and most VLAs. The loss reads cleanly:
In plain English: Instead of predicting continuous numbers, we chop each action dimension into 256 buckets (like rounding to 256 levels) and treat it as a classification problem. "Which bin should joint 3 be in?" is the same kind of question as "which word comes next?" — which is why this approach lets you reuse a pretrained language model.
RT-style discretized BC
$$\mathcal{L} = -\sum_{t=1}^{T_p} \sum_{d=1}^{D} \log p_\theta\big(\text{bin}(a_t^d) \mid o, a_{<t}\big)$$
In code:bin_indices = ((actions - action_min) / (action_max - action_min) * 255).long().clamp(0, 255) to discretize, then loss = F.cross_entropy(logits.view(-1, 256), bin_indices.view(-1)). That's it — standard classification. At inference: action = logits.argmax(-1).float() / 255 * (action_max - action_min) + action_min to convert back to continuous. The 256-bin resolution gives ~0.4% precision per dimension — well below teleop noise for most tasks.
$T_p$ — the prediction horizon, i.e. the number of future timesteps in the action chunk. RT-2 typically uses $T_p = 1$ (single-step); longer horizons (4–8) appear in chunked variants.
$D$ — the action dimensionality. For a 7-DoF arm this is 7 (6 joint/EE dims + 1 gripper). Each dimension is predicted independently.
$\text{bin}(a_t^d)$ — the discretization function that maps a continuous action value in dimension $d$ at timestep $t$ into one of $B$ bins (typically $B = 256$). The bin boundaries are uniformly spaced over each dimension's observed range.
$a_{<t}$ — previously predicted actions. In autoregressive variants (RT-2), actions at earlier timesteps condition later ones. In non-autoregressive variants (RT-1), this term is dropped and all dimensions are predicted in parallel.
$p_\theta(\cdot \mid o, a_{<t})$ — a categorical distribution over $B$ bins, output by the model. The loss is just cross-entropy: maximize the probability assigned to the correct bin.
Per-dimension factorization throws away cross-dimension correlation in a single timestep, which is mostly fine because action chunking gives you temporal structure to pick up the slack. Autoregressive variants restore correlation at the cost of slower inference.
Worked example: the combinatorial explosion and its fix. A 7-DOF arm with $B = 256$ bins per dimension. If we modeled the full joint distribution over all dimensions, the action space would be $256^7 \approx 7.2 \times 10^{16}$ categories — a softmax over 72 quadrillion options is not happening.
Per-dimension factorization fixes this: we predict 7 independent categoricals, each over 256 bins. Total parameters in the output layer: $7 \times 256 = 1{,}792$ logits. The cost: we assume the 7 dimensions are conditionally independent given the observation. For most manipulation tasks, this assumption is surprisingly benign — joint correlations within a single timestep are weak compared to temporal correlations across timesteps, which action chunking captures.
The cross-entropy loss for one timestep, one dimension $d$, with true bin index $b^* \in \{1, \ldots, 256\}$:
$$\mathcal{L}_d = -\log \frac{\exp(z_{b^*})}{\sum_{j=1}^{256} \exp(z_j)}$$
where $z_j$ are the raw logits from the network. The total loss sums over all dimensions and all timesteps in the chunk.
Implicit / energy-based
Train an energy function $E_\theta(o, a)$ and define $\pi(a \mid o) \propto e^{-E_\theta(o, a)}$. Minimize an InfoNCE-style contrastive loss with negatives sampled from a proposal distribution. Implicit BC (Florence et al., 2021) showed this beats MSE on multimodal tasks. The downside is sampling: at inference you have to do gradient descent or rejection sampling on the energy, which is slow and brittle. Largely superseded by diffusion.
The InfoNCE loss works as follows. For each observation-action pair $(o, a^+)$ from the dataset (the "positive"), sample $N$ negative actions $\{a^-_1, \ldots, a^-_N\}$ from a proposal distribution (e.g., uniform over the action space). The loss pushes the energy of the positive down and negatives up:
In plain English: Show the network one correct action and $N$ random wrong actions, and ask "which one is real?" This is the same contrastive setup behind CLIP (match image to caption) and SimCLR (match two augmented views). The energy function learns to assign low energy (high "confidence") to expert actions and high energy to random ones. The network doesn't predict an action directly — it scores candidate actions, and you find the best one at inference by searching.
InfoNCE for implicit BC
$$ \mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(-E_\theta(o, a^+))}{\exp(-E_\theta(o, a^+)) + \sum_{i=1}^{N} \exp(-E_\theta(o, a^-_i))} $$
In code:logits = -torch.cat([E(o, a_pos).unsqueeze(1), E(o, a_neg)], dim=1) then loss = F.cross_entropy(logits, torch.zeros(B, dtype=torch.long)) — the positive is always at index 0. This is structurally identical to CLIP's loss. The catch is inference: unlike MSE or GMM where you get an action in one forward pass, here you must search for the action that minimizes $E_\theta(o, a)$ via gradient descent or sampling — 50–100 optimization steps per control tick. This inference cost is why energy-based policies lost to diffusion.
$E_\theta(o, a)$ — the energy function: a scalar-valued neural network. Low energy = good action, high energy = bad action.
$a^+$ — the positive sample: the expert's actual action for observation $o$.
$a^-_i$ — negative samples: random actions drawn from a proposal distribution. The quality of negatives matters — if they are too easy (far from the expert's action), the loss provides little gradient signal.
The loss is equivalent to $(N+1)$-way classification: "which of these actions is the real one?"
At inference, finding the best action requires minimizing the energy function. Florence et al. use Langevin MCMC: start from a random action, take gradient steps $a \leftarrow a - \eta \nabla_a E_\theta(o, a) + \sqrt{2\eta}\,\epsilon$, and iterate. This is the bottleneck: 50–100 gradient steps per control tick, each requiring a backward pass through the energy network. Diffusion policies achieve similar expressiveness with a fixed, non-iterative sampling schedule.
Diffusion
The current default for high-fidelity manipulation. Train a denoiser $\epsilon_\theta(a^{(k)}, k, o)$ to remove Gaussian noise from a noised action sequence; sample by iterative denoising. Naturally multimodal, expressive, stable to train. The next section is dedicated to it.
The key idea: instead of predicting the action directly (which requires choosing one mode), predict how to remove noise from a corrupted action. Start from pure Gaussian noise $a^{(K)} \sim \mathcal{N}(0, I)$ and iteratively denoise: $a^{(k-1)} = a^{(k)} - \epsilon_\theta(a^{(k)}, k, o)$ (simplified). After $K$ steps, $a^{(0)}$ is a clean action. Because the denoiser is trained on all modes of the data, the sampling process naturally produces samples from any mode — no explicit mixture weights needed.
Flow matching
A close cousin of diffusion that learns a velocity field instead of a noise prediction. Cleaner objective, often fewer sampling steps, and the basis of Physical Intelligence's $\pi_0$. Covered in section 09.
Where diffusion removes noise step by step, flow matching learns a vector field $v_\theta(a_t, t, o)$ that transports samples from noise ($t = 0$) to data ($t = 1$) along straight paths. The training loss is:
where $a_t = (1-t) a_0 + t \, a_1$ is a linear interpolation between noise $a_0$ and data $a_1$. The target velocity is $(a_1 - a_0)$ — the constant-speed straight line. This is simpler than diffusion's DDPM loss and often requires fewer ODE integration steps at inference (1–10 vs. 10–100).
Vector-quantized
Train a VQ-VAE over short action chunks, then learn an autoregressive transformer that predicts the discrete codes. VQ-BeT (Lee et al., 2024) is the canonical example. You get the multimodality benefits of categorical actions without per-dimension factorization, at the cost of a two-stage training pipeline.
Stage 1 trains a VQ-VAE that compresses an action chunk $a_{1:H} \in \mathbb{R}^{H \times D}$ into discrete codebook indices $[z_1, \ldots, z_M]$ where $M \ll H \times D$. Stage 2 trains a transformer to predict these codes autoregressively: $p(z_m \mid z_{<m}, o)$. At inference, sample codes from the transformer and decode through the VQ-VAE to get continuous actions. Because the codebook captures the natural modes of the data (different grasp strategies get different codes), the transformer can represent multimodality by assigning probability mass to different code sequences.
Action head comparison
Head type
Multimodality
Inference
Pros
Cons
MSE (Gaussian)
None — collapses to mean
1 pass (fastest)
Simple, fast, good baseline
Averages modes → invalid actions
GMM
K modes (fixed)
1 pass + sample
Interpretable, moderate expressiveness
Must choose K; mode collapse risk
Discretized
Arbitrary per-dim
1 pass (fast)
Native to LMs, easy tokenization
Loses cross-dim correlations; resolution = B
Energy-based
Arbitrary
50–100 grad steps
Maximally expressive in theory
Slow inference, sampling brittleness
Diffusion
Arbitrary
10–100 denoise steps
Expressive, stable, SOTA fidelity
Multi-step latency
Flow matching
Arbitrary
1–10 ODE steps
Fewer steps, clean objective
Newer, less battle-tested
VQ
Arbitrary (codebook)
AR over codes
Full correlation, discrete sampling
Two-stage pipeline, codebook collapse
The taxonomy of action heads is converging. By 2026, you pick between discretized (when you want a single transformer stack and language pretraining), diffusion / flow (when you want maximum fidelity at the cost of multi-step sampling), or VQ (when you want the best of both and can afford a two-stage recipe).
Practical decision tree
Task unimodal? MSE is fine. Rare in manipulation, common in locomotion.
Need language conditioning? Use discretized — only choice that fits natively into autoregressive LMs (RT-2, OpenVLA, $\pi_{0.5}$).
Need maximum fidelity on contact tasks? Use diffusion or flow matching — highest-quality samples.
Need fast inference AND multimodality?Flow matching (1–5 steps) or VQ (fixed code tokens).
Prototyping? Start with GMM ($K = 5$). Simplest multimodal head.
The field consolidates: diffusion/flow for research, discretized for production VLAs. Energy-based models are historically important but effectively superseded by diffusion.
Worked example: choosing an action head. You are building a bimanual manipulation policy for laundry folding with an ALOHA-style robot.
Is the task unimodal? No — multiple valid folding strategies exist.
Need language conditioning? Not initially — single-task.
Latency? Moderate — folding is slow, 10Hz (100ms) is fine. Diffusion with 10 DDIM steps fits.
Action space: 14-DOF (7 per arm), joint positions from teleop.
Recommendation: Diffusion Policy with $H = 16, K = 8$. Handles multimodality naturally, comfortable latency, best-validated for bimanual.
Alternative: ACT with CVAE if you prefer simpler training and have enough demonstrations for the VAE to capture modes.
06Action chunking
The single most underrated idea in modern robot learning.
The original BC recipe predicts one action per observation. Action chunking predicts a sequence of $H$ future actions per observation. The change is small in code and large in consequence.
predictedre-plannext
Why it works
Three reasons, all important:
It captures non-Markovian behavior. Real demonstrations have temporal structure — pre-grasps, follow-throughs — that a single-step policy must reproduce from scratch each tick. Chunks let the model commit to a plan.
It reduces the frequency of compounding-error opportunities. If you re-observe and re-decide every $K$ steps instead of every step, the policy has $T/K$ chances to go wrong instead of $T$. The compounding error bound becomes $O(\epsilon (T/K)^2)$ — for $K = 8$ and $T = 200$, this is $O(\epsilon \cdot 625)$ instead of $O(\epsilon \cdot 40000)$, a 64× reduction.
It is a regularizer against pathological idle modes. Single-step policies trained on humans full of pauses learn to predict "stay still" because most actions are small; chunked policies see the whole motion and stop pausing.
Derivation: why chunking reduces the compounding bound
Let us formalize the second point. In the Ross & Bagnell framework, the compounding error bound is $O(\epsilon T^2)$ where $T$ is the number of independent policy decisions over the horizon. Without chunking, $T$ equals the number of timesteps.
With action chunking (chunk length $H$, execution length $K$), the policy makes a decision every $K$ steps. The number of independent decisions is $\lceil T / K \rceil$. Within each chunk, the $K$ executed actions are a single coherent plan — they do not independently compound. Substituting into the Ross-Bagnell bound:
$K$ — the number of steps executed per chunk before re-planning
$T/K$ — the effective horizon (number of independent decisions)
The quadratic term dominates, so the bound scales as $T^2/K^2$
In code:t = torch.rand(B, 1, 1), a0 = torch.randn_like(actions), a_t = (1 - t) * a0 + t * actions, loss = F.mse_loss(v_net(a_t, t, obs), actions - a0). Four lines, same as DDPM but simpler — no noise schedule, no $\bar\alpha_k$ bookkeeping. The target is the displacement from noise to data, and the network learns to predict that displacement at any point along the interpolation path. See the full training and inference code below.
Worked example: error reduction by chunk size. $T = 200$ steps, $\epsilon = 0.01$.
No chunking ($K=1$): $\epsilon T^2/2 = 0.01 \times 40{,}000/2 = 200$ expected mistakes.
$K=4$: $\epsilon (T/4)^2/2 = 0.01 \times 2{,}500/2 = 12.5$ mistakes. A 16× reduction.
$K=8$ (Diffusion Policy default): $\epsilon (200/8)^2/2 = 0.01 \times 625/2 = 3.1$ mistakes. A 64× reduction.
$K=16$: $0.01 \times (200/16)^2/2 = 0.01 \times 156.25/2 = 0.78$ mistakes. A 256× reduction.
Doubling $K$ gives a $4\times$ reduction in the quadratic term. The cost: within each chunk of $K$ steps, the policy is open-loop (no feedback), so it cannot react to unexpected perturbations. The optimal $K$ balances compounding-error reduction against open-loop risk.
Receding horizon control
The standard inference recipe: predict $H$ actions, execute the first $K \leq H$, replan. This is classical model-predictive control with a learned policy as the model. Diffusion Policy popularized $H = 16, K = 8$. ACT pushed harder with $H = 100, K = 1$ plus temporal ensembling (next).
Worked example: receding-horizon error reduction. Without chunking ($H = 1, K = 1$): the policy makes $T = 200$ independent decisions. Compounding error bound: $O(\epsilon \times 200^2) = O(40000\epsilon)$.
With Diffusion Policy ($H = 16, K = 8$): the policy makes $200/8 = 25$ independent decisions. Within each chunk of 8, the actions are coherent (planned together). Error bound: $O(\epsilon \times 25^2) = O(625\epsilon)$. A 64× reduction.
With ACT ($H = 100, K = 1$): re-plan every step, but temporal ensembling averages overlapping predictions. The effective number of "independent" decisions depends on $\alpha$. With heavy smoothing ($\alpha = 0.01$), the effective decision frequency is $\sim T/50 = 4$. Error bound: $O(\epsilon \times 16) = O(16\epsilon)$. A 2500× reduction — which is why ACT works on 2-second bimanual tasks where naive BC fails immediately.
Temporal ensembling
If you re-predict every step but each prediction is a chunk, you can average overlapping predictions for the same future timestep. ACT's recipe: at inference time $t$, average all predictions of action $a_t$ made at recent timesteps, weighted exponentially by recency. This drops control-signal jitter and is essentially free.
In plain English: Since we re-predict every step but each prediction covers 100 future steps, we have MANY predictions for the same future timestep — one from now, one from 1 step ago, one from 2 steps ago, etc. We blend them using an exponential moving average: recent predictions get more weight, older ones fade. This is a free low-pass filter that smooths out the jitter from noisy camera observations.
$a_t$ — the executed action at timestep $t$, after ensembling. This is what actually goes to the robot.
$\hat{a}_t^{(t-i)}$ — the prediction of action $a_t$ that was made $i$ steps ago, when the observation was $o_{t-i}$. Because each inference produces a chunk of $H$ future actions, many past inferences will have included a prediction for the current timestep $t$.
$m$ — the ensemble window size, i.e. how many past predictions to average over. Bounded by the chunk length $H$: you can look back at most $H-1$ steps (since a prediction made $H$ steps ago wouldn't have reached timestep $t$). In practice $m = H - 1$.
$w_i = \exp(-\alpha i)$ — the exponential recency weight. Predictions made more recently ($i = 0$) get weight 1.0; older predictions ($i = m$) get weight $e^{-\alpha m}$. This is an exponential moving average kernel.
$\alpha$ — the decay rate. Controls how much to trust recent vs. old predictions. $\alpha = 0.01$ (ACT's default) means very slow decay — nearly uniform averaging, strong smoothing. $\alpha = 0.5$ would trust only the newest 2–3 predictions. Larger $\alpha$ = more responsive but jitterier.
In code:weights = torch.exp(-alpha * torch.arange(len(preds))) then action = (weights[:, None] * torch.stack(preds)).sum(0) / weights.sum() — an exponential moving average over overlapping chunk predictions. The full implementation is ~20 lines (see the TemporalEnsemble class below). In your system, this replaces the raw policy output with no additional network evaluations — it is pure post-processing that runs in microseconds. If your deployed policy is twitchy, add temporal ensembling before anything else.
Worked example: temporal ensembling. At timestep $t = 50$, we want to compute the executed action $a_{50}$. We have predictions from three recent timesteps (with $\alpha = 0.01$, keeping a window of $m = 2$ past predictions):
$\hat{a}_{50}^{(50)}$ (just predicted now) = $[0.12, -0.05, 0.31]$, weight $w_0 = e^{0} = 1.0$.
$\hat{a}_{50}^{(49)}$ (predicted one step ago) = $[0.13, -0.04, 0.30]$, weight $w_1 = e^{-0.01} = 0.990$.
$\hat{a}_{50}^{(48)}$ (predicted two steps ago) = $[0.11, -0.06, 0.32]$, weight $w_2 = e^{-0.02} = 0.980$.
Sum of weights = 2.970. Ensembled action: $a_{50} = (1.0 \times [0.12, -0.05, 0.31] + 0.990 \times [0.13, -0.04, 0.30] + 0.980 \times [0.11, -0.06, 0.32]) / 2.970 = [0.120, -0.050, 0.310]$.
The result is almost identical to the newest prediction because $\alpha$ is so small — but the small averaging smooths out jitter from observation noise. If the predictions had been [0.12, 0.50, 0.31] (a sudden spike in dimension 2), the ensemble would have dampened it to ~0.13 — catching the occasional bad prediction that would cause a jerk.
Code: temporal ensembling
Temporal ensembling is simple to implement. The key data structure is a buffer of recent chunk predictions, indexed by the target timestep.
Worked example: 5 overlapping predictions. With $H = 8$ and $K = 1$ (re-predict every step), at timestep $t = 10$ we have predictions from 5 recent inferences, all of which included a prediction for $a_{10}$. With $\alpha = 0.01$:
$\hat{a}_{10}^{(t=10)}$: age 0, weight $e^0 = 1.000$, value $= 0.250$.
$\hat{a}_{10}^{(t=9)}$: age 1, weight $e^{-0.01} = 0.990$, value $= 0.248$.
$\hat{a}_{10}^{(t=8)}$: age 2, weight $e^{-0.02} = 0.980$, value $= 0.253$.
$\hat{a}_{10}^{(t=7)}$: age 3, weight $e^{-0.03} = 0.970$, value $= 0.246$.
$\hat{a}_{10}^{(t=6)}$: age 4, weight $e^{-0.04} = 0.961$, value $= 0.251$.
Weight sum: $4.901$. Weighted sum: $1.000 \times 0.250 + 0.990 \times 0.248 + 0.980 \times 0.253 + 0.970 \times 0.246 + 0.961 \times 0.251 = 1.222$.
Ensembled: $a_{10} = 1.222 / 4.901 = 0.2494$. The five predictions were already close (range 0.246–0.253), so the ensemble barely changes the answer. But if prediction 3 had been $0.350$ (a spike from observation noise), the ensemble would return $\approx 0.270$ — dampening a 40% outlier to an 8% deviation. This is the value: a free low-pass filter on policy jitter.
Interactive: temporal ensembling
raw predictions (jittery)ensembled outputtrue signal
If your BC policy is twitchy at inference time, temporal ensembling is a one-paragraph code change that often fixes it. If it doesn't, your policy is multimodal in a way that ensembling will worsen — you need a multimodal head, not a smoother.
Choosing H and K in practice
The chunk length $H$ and execution length $K$ are the two most important hyperparameters in modern BC. They interact in non-obvious ways:
System
$H$
$K$
Ensemble
Why
Diffusion Policy
16
8
No
Execute half the chunk. Good balance for general manipulation.
ACT (ALOHA)
100
1
Yes
Re-predict every step, ensemble for smoothness. Fine bimanual tasks.
RT-2
1
1
No
Single-step. Relies on autoregressive tokens for coherence.
$\pi_0$
50
1
Yes
Long chunk + flow matching. Re-predict every step.
The tradeoff: large $K$ (execute many steps before re-observing) gives maximum compounding-error reduction but is open-loop within the chunk — the robot cannot react to perturbations. Small $K$ (re-observe frequently) preserves reactivity but needs temporal ensembling to avoid jitter. The sweet spot depends on the task: slow assembly tolerates $K = 8$; reactive catching needs $K = 1$ with ensembling.
When temporal ensembling hurts. If the policy's predictions for the same timestep are bimodal (sometimes "go left," sometimes "go right"), averaging them produces the mean — the same pathology as MSE on multimodal actions. Ensembling is actively harmful here: it converts a policy that sometimes picks the right mode into one that always picks the invalid mean. The fix: use a multimodal action head (diffusion, flow, VQ) so each prediction is a committed mode sample, and then ensemble.
The information-theoretic view
Action chunking is a form of temporal abstraction. A single-step policy encodes its future plan into $D$ numbers (e.g. 7 for a 7-DOF arm). A chunked policy packs $H \times D$ numbers, giving it $H\times$ more bandwidth to communicate intent to the actuator.
Consider a task requiring a specific velocity profile during approach (slow down, align, contact). A single-step policy re-derives the velocity at each tick from the image alone — any observation noise produces jitter. A chunked policy predicts the entire approach profile at once, encoding the velocity ramp as a smooth curve. The smoothness is built into the output representation, not forced by a post-hoc filter.
Worked example: information bandwidth. 7-DOF arm at 10Hz, $T = 200$ steps:
Single-step ($H = 1$): 7 numbers per decision. 200 decisions, each independently predicted.
Chunked ($H = 16, K = 8$): $16 \times 7 = 112$ numbers per decision. 25 decisions with built-in temporal coherence. The policy communicates 2× more information about its plan while making 8× fewer independent decisions.
07ACT, in full
A conditional VAE wrapped around a transformer encoder–decoder, predicting one hundred actions at a time.
Action Chunking with Transformers (Zhao et al., 2023) is the policy that ships with ALOHA — a low-cost bimanual teleop platform whose data made fine bimanual tasks tractable for amateurs. The policy itself is small (≈80M parameters) and trains from scratch in hours on a single GPU. It is a useful object lesson in modern BC because every architectural choice answers a specific failure mode of section 04.
The CVAE wrapping
ACT is a conditional variational autoencoder over action chunks. Two encoders, one decoder.
Style encoder $q_\phi(z \mid a_{1:H}, q)$ — a transformer that takes the ground-truth action chunk plus current joint positions and emits parameters of a Gaussian over a latent $z \in \mathbb{R}^{32}$. Used only at training time.
Observation encoder — ResNet-18 image backbones run on each camera (typically four: top, front, two wrist cams). The features get flattened into tokens, joined by a proprioception token and the latent token, and fed to a transformer encoder that fuses them.
Decoder — a transformer decoder cross-attends to the encoder output and emits an action chunk of length $H = 100$ in parallel (not autoregressive — the queries are fixed positional embeddings for the $H$ output slots).
output pathtraining-only pathshared
The loss
In plain English: The ACT loss has two jobs. First, make the predicted actions match the demonstration (the L1 reconstruction term — "did you get the motion right?"). Second, keep the "style encoder" from memorizing the demonstrations too precisely (the KL term — "don't cheat by encoding the answer into z"). At test time, z is set to zero, so if the encoder smuggled too much information into z during training, the decoder won't know what to do without it. The KL penalty forces the decoder to work well even when z is uninformative.
$a_{1:H}$ — the ground-truth action chunk from the demonstration: a sequence of $H$ actions (typically $H = 100$ at 50 Hz = 2 seconds of motion). Each action is a vector in $\mathbb{R}^D$ (e.g., $D = 14$ for bimanual).
$\hat{a}_{1:H}$ — the predicted action chunk from the decoder, conditioned on the observation and the latent $z$.
$\| \cdot \|_1$ — the L1 (Manhattan) norm, summed over all $H \times D$ entries. L1 is more robust to teleoperation jitter than L2 and discourages over-smoothing.
$q_\phi(z \mid a_{1:H}, q)$ — the style encoder (variational posterior). During training, it sees the ground-truth action chunk and the joint positions $q$, and outputs a distribution over the latent $z \in \mathbb{R}^{32}$. It captures which mode the demonstration is in (e.g., approach from left vs. right).
$\mathcal{N}(0, I)$ — the prior over $z$. A standard multivariate Gaussian with zero mean and identity covariance.
$\mathrm{KL}(\cdot \| \cdot)$ — the Kullback-Leibler divergence, measuring how far the learned posterior $q_\phi$ is from the simple prior. Always $\geq 0$; equals zero only when $q_\phi$ exactly matches $\mathcal{N}(0,I)$. Think of it as the "information cost" of encoding style information into $z$.
$\beta = 10$ — the KL weight. A value much larger than the standard $\beta = 1$ aggressively pushes $q_\phi$ toward the prior, so that at test time (when $z = \mathbf{0}$, the prior mean) the decoder still produces sensible actions.
In code: The reconstruction loss is F.l1_loss(predicted_chunk, target_chunk) — L1, not L2, because L1 is more robust to teleop jitter and discourages over-smoothing. The KL is kl = 0.5 * (mu**2 + sigma**2 - 1 - torch.log(sigma**2)).sum(-1).mean(). Combined: loss = l1_loss + 10.0 * kl. The factor of 10 is not a typo — it aggressively pushes the posterior toward the prior so that the decoder works at inference when $z = \mathbf{0}$. If you reduce $\beta$ below ~5, the policy becomes erratic at test time because the decoder has learned to depend on style information that is no longer available.
Fundamental: KL divergence. The Kullback-Leibler divergence $\mathrm{KL}(q \| p) = \mathbb{E}_q[\log(q/p)]$ measures how different distribution $q$ is from distribution $p$. It is always $\geq 0$ and equals zero only when $q = p$. Think of it as the "extra bits" you pay if you use $p$ to encode data that actually comes from $q$. In the CVAE context, it penalizes the learned posterior $q_\phi$ for being too different from the simple prior $\mathcal{N}(0, I)$ — ensuring the decoder works well at test time when $z$ is drawn from the prior (or set to its mean).
Two pieces deserve scrutiny:
L1 not L2. L1 is robust to the small label noise that comes from imperfect teleoperation; it discourages over-smoothing of fine motions. The original paper ablated this — L2 gave noticeably worse fine-control success.
$\beta = 10$ in the original code. A relatively strong KL pulls the posterior toward the prior so that test-time inference (with $z$ set to the prior mean of zero) produces sensible actions.
Derivation: the ELBO that gives ACT's loss
Goal. Derive the ACT loss from the variational lower bound on $\log p_\theta(a_{1:H} \mid o)$.
We want to maximize $\log p_\theta(a_{1:H} \mid o)$ but cannot compute this directly because it involves marginalizing over $z$. Introduce a variational posterior $q_\phi(z \mid a_{1:H}, o)$ and apply Jensen's inequality:
ACT makes two concrete choices. First, the reconstruction term: rather than modeling $p_\theta(a \mid z, o)$ as a Gaussian (which would give L2), they use a Laplace likelihood, $p_\theta(a \mid z, o) \propto \exp(-\|a - \hat{a}\|_1 / b)$, giving the L1 loss. Second, $\beta = 10$ instead of $\beta = 1$, over-weighting the KL to ensure the posterior $q_\phi$ stays close to the prior — so that at inference, when $z = \mathbf{0}$ (the prior mean), the decoder still produces coherent actions.
Worked KL computation. Suppose the style encoder outputs $\mu_z = [0.3, -0.1, \ldots]$ and $\log \sigma_z^2 = [-2.0, -1.5, \ldots]$ for a single training example (32-dim latent). The KL divergence from $\mathcal{N}(\mu_z, \sigma_z^2 I)$ to $\mathcal{N}(0, I)$ is:
$$\mathrm{KL} = \frac{1}{2}\sum_{j=1}^{32}\left(\sigma_j^2 + \mu_j^2 - 1 - \log \sigma_j^2\right)$$
For dimension $j=1$: $\sigma_1^2 = e^{-2.0} = 0.135$, $\mu_1^2 = 0.09$. Term = $\frac{1}{2}(0.135 + 0.09 - 1 - (-2.0)) = \frac{1}{2}(1.225) = 0.613$.
For dimension $j=2$: $\sigma_2^2 = e^{-1.5} = 0.223$, $\mu_2^2 = 0.01$. Term = $\frac{1}{2}(0.223 + 0.01 - 1 - (-1.5)) = \frac{1}{2}(0.733) = 0.367$.
Sum all 32 dimensions and multiply by $\beta = 10$. A typical KL per example is around 5–20 nats; the $\beta = 10$ weighting makes this 50–200 in the loss, comparable to the L1 reconstruction term which is roughly $H \times D \times \bar\epsilon \approx 100 \times 14 \times 0.02 = 28$.
Inference
At test time the style encoder is discarded; $z$ is fixed to $\mathbf{0}$. The model is now a deterministic regressor: observations in, $H$ actions out. Receding horizon with $K = 1$ and temporal ensembling. The CVAE is therefore not used to sample diverse actions — it's used as a training-time regularizer that lets the model represent action multimodality during training without forcing the deployed model to be stochastic.
ACT's CVAE is a clever cheat: the posterior absorbs the multimodality of human demonstrations during training, the prior collapses to zero at inference, and what you ship is a deterministic transformer with very stable behavior.
What ACT gets right
Action chunking $H = 100$ at 50Hz means one prediction covers two seconds of motion.
Wrist cameras and DETR-style positional queries for the action slots — fast, parallel, no autoregressive bottleneck.
Temporal ensembling with $\alpha \approx 0.01$ — a one-line change with outsized impact.
What ACT doesn't do
It does not generalize across embodiments — it is trained per-platform.
It does not condition on language out of the box.
The 80M parameter size is not enough to absorb a large multi-task dataset; ACT is a strong single-task or few-task policy, not a foundation model.
Hyperparameters that matter
Knob
Default
If you change it
Chunk H
100
Smaller = more reactive, less smooth. Below 20, multimodality issues return.
Wrist cams justify the compute; lower is fine for scene cams.
08Diffusion Policy, in full
The action distribution as a denoising process. The architecture that ate manipulation.
Diffusion Policy (Chi et al., 2023) is a behavior-cloning architecture that models $p(a_{1:H} \mid o)$ as the reverse of a Gaussian diffusion process. It is the strongest single-task BC architecture in published benchmarks, and its variants underpin most of the post-2024 generalist policies.
The forward and reverse processes
In plain English: We're going to gradually drown a clean action sequence in static, one step at a time. At step 0, it's the expert's perfect trajectory. At step 50, it's mostly signal with some fuzz. At step 100, it's pure white noise — completely random numbers, no trace of the original. Then we train a network to UNDO this process: given a noisy mess and "you're at noise level k," predict what the noise was. If it can do that, we can start from pure static and iteratively subtract noise until a clean action chunk emerges. The forward process is never run at inference — it exists only to create training data for the denoiser.
Pick a sequence of noise levels $\{\beta_k\}_{k=1}^{K}$ and define $\alpha_k = 1 - \beta_k$, $\bar\alpha_k = \prod_{i=1}^k \alpha_i$. The forward process gradually corrupts an action chunk:
$a_{1:H}^{(0)}$ — the clean action chunk from the demonstration. Superscript $(0)$ means zero noise. This is what we want to recover at the end of denoising.
$a_{1:H}^{(k)}$ — the noised action chunk at noise level $k$. As $k$ increases, this looks less like the original and more like pure Gaussian noise.
$\beta_k$ — the noise schedule at step $k$. Controls how much noise is added per step. Typically starts near 0.0001 and increases to ~0.02 over $K$ steps. Think of it as the per-step "corruption rate."
$\alpha_k = 1 - \beta_k$ — the signal retention at step $k$. The fraction of signal surviving one noise step. Close to 1.0 for early steps (mostly signal preserved), smaller for later steps.
$\bar\alpha_k = \prod_{i=1}^{k} \alpha_i$ — the cumulative signal retention. After $k$ noise steps, this fraction of the original signal survives. $\bar\alpha_k \approx 1.0$ for small $k$ (mostly signal), $\bar\alpha_k \approx 0$ for large $k$ (mostly noise). This is what lets you jump directly to any noise level without iterating.
$\sqrt{\bar\alpha_k}$ — the scaling factor on the clean signal. Shrinks the original action toward zero as $k$ grows.
$\sqrt{1 - \bar\alpha_k}$ — the scaling factor on the noise. Grows toward 1.0 as $k$ increases, so at the final step the sample is nearly pure noise.
$\epsilon \sim \mathcal{N}(0, I)$ — standard Gaussian noise, sampled fresh for each training example. Same shape as the action chunk ($H \times D$).
The model learns to undo this. Specifically, it learns a noise predictor $\epsilon_\theta\big(a^{(k)}_{1:H}, k, o\big)$ trained with the simple DDPM objective:
$\mathbb{E}_{a_0, k, \epsilon}$ — the expectation over three random variables: $a_0 \sim \mathcal{D}$ (a clean action chunk from the dataset), $k \sim \text{Uniform}\{1, \ldots, K\}$ (a randomly chosen noise level), and $\epsilon \sim \mathcal{N}(0, I)$ (fresh Gaussian noise). Each training step samples one of each.
$\epsilon$ — the true noise that was added to corrupt $a_0$. This is the training target — the network must learn to predict exactly which noise was injected.
$\epsilon_\theta(\cdot, k, o)$ — the noise predictor (the denoiser network). Takes the noised action chunk, the noise level $k$, and the observation $o$ as input. Outputs a prediction of $\epsilon$. This is the only learned component.
$\sqrt{\bar\alpha_k} a_0 + \sqrt{1 - \bar\alpha_k}\epsilon$ — the noised input, constructed from the forward process. This is $a^{(k)}$ — what the action chunk looks like at noise level $k$.
$\| \cdot \|^2$ — squared L2 norm (MSE). The loss is the squared error between the true noise and the predicted noise. Minimizing this is equivalent to learning the score function $\nabla \log p(a^{(k)})$.
In code:noise = torch.randn_like(actions), k = torch.randint(0, K, (B,)), noised = sqrt_alpha_bar[k] * actions + sqrt_one_minus[k] * noise, loss = F.mse_loss(denoiser(noised, k, obs), noise) — that is literally the entire DDPM training step. Four lines. The network learns to predict which random noise was added, conditioned on the observation. Everything else (the noise schedule, the DDIM sampler, the EMA) is infrastructure around these four lines.
Derivation: DDPM forward process closed form
Goal. Show that $q(a^{(k)} \mid a^{(0)}) = \mathcal{N}(\sqrt{\bar\alpha_k}\, a^{(0)}, (1-\bar\alpha_k) I)$ — that is, we can jump from the clean sample directly to any noise level $k$ without iterating through intermediate steps.
Step 1. The single-step forward process adds Gaussian noise: $a^{(k)} = \sqrt{\alpha_k}\, a^{(k-1)} + \sqrt{1-\alpha_k}\, \epsilon_k$ where $\epsilon_k \sim \mathcal{N}(0, I)$.
Step 2. Apply this recursively. At step $k=1$: $a^{(1)} = \sqrt{\alpha_1}\, a^{(0)} + \sqrt{1-\alpha_1}\, \epsilon_1$. At step $k=2$:
Step 3. The sum of two independent Gaussians $\mathcal{N}(0, \sigma_1^2 I)$ and $\mathcal{N}(0, \sigma_2^2 I)$ is $\mathcal{N}(0, (\sigma_1^2 + \sigma_2^2) I)$. The combined noise variance is $\alpha_2(1-\alpha_1) + (1-\alpha_2) = 1 - \alpha_1\alpha_2 = 1 - \bar\alpha_2$.
Step 4. By induction, at step $k$: $a^{(k)} = \sqrt{\bar\alpha_k}\, a^{(0)} + \sqrt{1-\bar\alpha_k}\, \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$. This is the closed-form sampling formula that makes training efficient — you sample $k$ uniformly, compute $a^{(k)}$ in one step, and train.
Three tiny details that matter more than the equation:
$\epsilon$-prediction beats $a_0$-prediction in practice; the gradient signal is better-conditioned across noise levels.
Cosine noise schedule from Improved-DDPM works better than linear for action spaces.
The observation $o$ is a short history — typically two timesteps. Longer history hurts: the policy starts inferring its own past actions and gets causally confused.
Two backbone variants
The denoiser $\epsilon_\theta$ has two canonical implementations.
CNN-based: 1D temporal U-Net
A 1D U-Net over the time axis of the action chunk. Observations are encoded once, broadcast as a conditioning vector, and injected via FiLM layers (feature-wise affine modulation: $h \leftarrow \gamma(o) \odot h + \beta(o)$). The CNN exploits the locality of action sequences and is fast.
FiLM conditioning, unpacked. Feature-wise Linear Modulation is an affine transformation on hidden features, conditioned on an external signal. Given a hidden activation $h \in \mathbb{R}^{C \times T}$ (channels × time), and an observation encoding $o \in \mathbb{R}^d$:
1. Project $o$ to scale and shift parameters: $\gamma = W_\gamma o + b_\gamma$, $\beta = W_\beta o + b_\beta$, both in $\mathbb{R}^C$.
2. Apply channel-wise: $h'_c = \gamma_c \cdot h_c + \beta_c$ for each channel $c$.
This lets the observation multiplicatively gate and additively shift the denoiser's features. It is lightweight (two linear projections), effective (the observation controls the denoiser at every layer), and the reason the CNN variant conditions on observations without cross-attention.
Transformer-based
Action tokens cross-attend to observation tokens. More expressive, slower, the right pick when the task has long-range structure or the observation is multimodal. Octo and most generalist policies use this variant.
denoiser (the work)iterative loopdata flow
Inference: DDIM
Naive DDPM sampling needs $K = 100$ denoising steps, which is too slow for a 10Hz control loop. The fix is DDIM sampling: a deterministic (or low-noise) reverse process with the same training objective that works at $K = 10$–$16$ steps with negligible quality loss.
$a^{(k-1)}$ — the denoised action at the next (less noisy) level. One step closer to a clean action chunk.
$a^{(k)}$ — the current noisy action we are denoising from.
$\hat a_0(a^{(k)})$ — the implied clean prediction, i.e. "what the model thinks the original clean action was" given the current noisy sample. Computed as $\hat a_0 = (a^{(k)} - \sqrt{1-\bar\alpha_k}\,\epsilon_\theta) / \sqrt{\bar\alpha_k}$.
$\epsilon_\theta(a^{(k)}, k, o)$ — the predicted noise at the current level, from the trained denoiser.
$\eta = 0$ — the stochasticity parameter. $\eta = 0$ makes DDIM fully deterministic (same initial noise $\to$ same output). $\eta = 1$ recovers the original stochastic DDPM. Deterministic sampling is preferred for robotics because it gives consistent behavior.
In code: One DDIM step is: a0_hat = (a_k - sqrt_1m_ab[k] * eps_pred) / sqrt_ab[k] then a_prev = sqrt_ab[k-1] * a0_hat + sqrt_1m_ab[k-1] * eps_pred. That's two lines of tensor arithmetic, called 16 times in a loop from pure noise to a clean action chunk. The entire inference function is ~10 lines: sample noise, loop 16 times calling the denoiser and applying this update, return the result. On an NVIDIA 3090, each denoiser call takes ~2ms, so 16 steps = ~32ms total inference latency.
where $\hat a_0$ is the implied clean prediction, $\hat a_0 = (a^{(k)} - \sqrt{1-\bar\alpha_k}\,\epsilon_\theta) / \sqrt{\bar\alpha_k}$. With $K = 16$ DDIM steps and a small CNN denoiser, inference is comfortably under 50ms on a single GPU.
Worked DDIM step. Suppose we are at noise level $k=8$ (of 16 total DDIM steps), with $\bar\alpha_8 = 0.5$ and $\bar\alpha_7 = 0.6$. The current noisy action is $a^{(8)} = [0.42, -0.31, \ldots]$. The denoiser predicts $\epsilon_\theta = [0.15, -0.08, \ldots]$.
First, compute the implied clean action:
$$\hat a_0 = \frac{a^{(8)} - \sqrt{1-\bar\alpha_8}\, \epsilon_\theta}{\sqrt{\bar\alpha_8}} = \frac{[0.42, -0.31] - \sqrt{0.5}\,[0.15, -0.08]}{\sqrt{0.5}}$$
$$= \frac{[0.42 - 0.106, -0.31 + 0.057]}{0.707} = \frac{[0.314, -0.253]}{0.707} = [0.444, -0.358]$$
Then step to level $k=7$:
$$a^{(7)} = \sqrt{0.6}\,[0.444, -0.358] + \sqrt{0.4}\,[0.15, -0.08]$$
$$= [0.344, -0.277] + [0.095, -0.051] = [0.439, -0.328]$$
The action moved slightly toward the predicted clean sample. After 16 such steps from pure noise, we arrive at a clean action chunk.
Interactive: diffusion noise levels
clean signalnoised signalε (noise)
EMA — the unsung hero
The single most underrated detail of training a diffusion policy is the exponential moving average of the model weights. Maintain a shadow copy $\theta_{\text{EMA}} \leftarrow \tau \theta_{\text{EMA}} + (1-\tau)\theta$ with $\tau \approx 0.9999$. Use the EMA copy for inference. Without it, the model is twitchy and unstable; with it, the same training run produces reliable behavior. The reason is mostly empirical — diffusion losses are noisy across noise levels and the EMA averages over the noise.
Why diffusion beat the alternatives
Multimodal by construction. Different draws of the initial Gaussian sample produce different action chunks; the model never has to commit to a single mode.
Stable training. The DDPM loss is well-conditioned and converges reliably; no GAN instabilities, no mixture-component collapse, no contrastive sampling.
Receding-horizon natural fit. Predicting a chunk is what diffusion does anyway; chunking is free.
Composes with vision encoders. The conditioning interface is a flat vector or token sequence — drop in any encoder.
If you are starting a new manipulation policy in 2026 and have no other constraints, train a transformer Diffusion Policy on a relative-EE action space with a strong vision encoder. It is the highest-floor recipe in the field.
09Flow matching policies
A simpler, faster cousin of diffusion. The basis of $\pi_0$.
Flow matching trains a velocity field that transports a simple base distribution (Gaussian) to the data distribution along straight-ish paths in time. It is mathematically simpler than diffusion and empirically faster to sample. Lipman et al. (2023) introduced conditional flow matching; Physical Intelligence's $\pi_0$ (2024) is the most prominent robotics application.
The objective
Define a continuous time $t \in [0, 1]$. Pair a noise sample $a_0 \sim \mathcal{N}(0, I)$ with a data sample $a_1$ and define the linear interpolant $a_t = (1-t) a_0 + t \, a_1$. The "true" velocity along this path is $a_1 - a_0$. Train a network $v_\theta(a_t, t, o)$ to predict it:
$t \sim U(0,1)$ — a continuous time variable sampled uniformly from $[0, 1]$. At $t=0$ we are at the noise distribution; at $t=1$ we are at the data distribution. No discrete schedule to tune — just a uniform random number.
$a_0 \sim \mathcal{N}(0, I)$ — a noise sample from the base distribution. This is the starting point of the flow — pure Gaussian noise, same shape as the action chunk.
$a_1 \sim \mathcal{D}$ — a data sample (clean action chunk from the demonstration dataset). This is the endpoint of the flow.
$(1-t)a_0 + t\, a_1$ — the linear interpolant between noise and data. At $t=0$ it is pure noise $a_0$; at $t=1$ it is the clean action $a_1$. The flow path is a straight line in action space.
$v_\theta(\cdot, t, o)$ — the learned velocity field. A neural network that predicts the velocity (direction and magnitude of motion) at any point along the flow, conditioned on the observation $o$.
$(a_1 - a_0)$ — the target velocity. The true velocity along the linear interpolant path. It is constant in time — the "displacement" from noise to data. The network learns to predict this displacement.
Derivation: the flow matching ODE
Why the linear interpolant works. In conditional flow matching, we define a per-sample conditional probability path $p_t(a \mid a_1)$ that starts at $\mathcal{N}(0, I)$ and ends at a Dirac delta at $a_1$. The linear interpolant $a_t = (1-t)a_0 + t\, a_1$ does exactly this.
The marginal velocity field at time $t$ and position $a_t$ is defined by the ODE $\frac{da_t}{dt} = u_t(a_t)$ where $u_t$ is the velocity that pushes probability mass from the prior to the data distribution. For the linear interpolant:
This is constant in time along each conditional path — the "velocity" is always just the displacement from noise to data. The flow matching loss trains $v_\theta$ to match this velocity, and at inference we integrate the learned velocity field from $t=0$ to $t=1$.
The key insight: unlike diffusion, the paths are straight lines. The ODE integrator can follow them with large steps (fewer function evaluations). In practice, 5–10 Euler steps suffice, compared to 16+ DDIM steps for diffusion.
DDIM vs flow matching: head-to-head
Property
Diffusion (DDIM)
Flow Matching
Path shape
Curved (through noise schedule)
Straight lines (linear interpolant)
Training target
Noise $\epsilon$ or clean $a_0$
Velocity $a_1 - a_0$
Time parameterization
Discrete $k \in \{1, \ldots, K\}$ + schedule
Continuous $t \in [0,1]$ + uniform sampling
Inference sampler
DDIM deterministic reverse
ODE Euler integration
Typical steps
16
5–10
Schedule tuning
Cosine / linear / learned $\beta_k$
None (uniform $t$)
Code complexity
~100 lines for schedule + sampler
~40 lines total
Quality
State-of-the-art
Comparable to slightly better
Inference
Sample $a_0 \sim \mathcal{N}(0, I)$, then integrate the ODE $\frac{d a_t}{dt} = v_\theta(a_t, t, o)$ from $t=0$ to $t=1$ using Euler with 5–10 steps. The ODE is not stochastic — there's no noise at inference, just integration of a learned vector field. This is part of why fewer steps suffice.
Euler integration at inference
$$ a_{t + \Delta t} = a_t + \Delta t \cdot v_\theta(a_t, t, o), \qquad \Delta t = \tfrac{1}{N}, \; N \in [5, 10]$$
$a_t$ — the current point in action space at flow time $t$. Starts as Gaussian noise ($a_0$) and progressively becomes a valid action chunk.
$v_\theta(a_t, t, o)$ — the learned velocity at the current point, time, and observation. One forward pass of the network.
$\Delta t = 1/N$ — the step size. With $N = 5$ Euler steps, $\Delta t = 0.2$; with $N = 10$, $\Delta t = 0.1$. Larger steps are faster but less accurate. Because the flow paths are nearly straight, even 5 large steps work well.
$N$ — the number of integration steps. Each step costs one network evaluation. Typical range is 5–10, compared to 16+ for DDIM — this is why flow matching is faster at inference.
Worked example: flow matching inference with 5 Euler steps. Action dimension = 7 (6-DoF EE + gripper). $\Delta t = 1/5 = 0.2$.
Step 0 ($t = 0$): Sample $a_0 \sim \mathcal{N}(0, I)$, e.g. $a_0 = [0.82, -0.45, 1.13, -0.67, 0.29, -0.91, 0.53]$.
Step 1 ($t = 0$): Query the velocity network: $v_\theta(a_0, 0, o) = [0.41, 0.82, -1.35, 1.20, -0.15, 0.65, -0.28]$.
Update: $a_{0.2} = a_0 + 0.2 \times v_\theta = [0.82 + 0.082, -0.45 + 0.164, \ldots] = [0.902, -0.286, 0.86, -0.43, 0.26, -0.78, 0.47]$.
Steps 2-4: Repeat with $t = 0.2, 0.4, 0.6, 0.8$. Each step, the velocity field steers $a_t$ toward the data manifold.
Step 5 ($t = 0.8 \to 1.0$): Final update gives $a_1 = [0.12, -0.03, 0.08, 0.15, -0.02, 0.04, 0.85]$ — a valid action chunk that started as pure noise.
Total compute: 5 network forward passes. Compare to DDIM: 16 passes. The flow matching trajectory is nearly straight (the velocity barely changes between steps), which is why fewer steps suffice.
Why fewer steps suffice: path curvature
The core reason flow matching needs fewer inference steps than DDPM is path geometry. In DDPM, the forward process follows a curved trajectory through latent space — the noise schedule $\{\beta_k\}$ defines a non-linear path, and the reverse process must trace that curve backward. Curved paths are hard to follow with large step sizes; the integrator overshoots on bends.
Flow matching defines paths as straight lines: $a_t = (1-t)a_0 + ta_1$. The velocity along a straight line is constant — it does not change between $t=0.1$ and $t=0.9$. An Euler integrator with step size $\Delta t = 0.2$ introduces zero discretization error on a truly straight path. In practice, the learned velocity field is not perfectly constant (the marginal velocity field averages over many conditional paths), so there is some error — but far less than DDPM's curved paths. This is why 5 Euler steps match 16 DDIM steps.
Straight vs curved paths, precisely. For a conditional path $a_t = (1-t)a_0 + ta_1$, the velocity is $\dot{a}_t = a_1 - a_0$ — a constant vector. The acceleration is $\ddot{a}_t = 0$. Euler integration is exact when acceleration is zero: $a_{t+\Delta t} = a_t + \Delta t \cdot \dot{a}_t$ recovers the true $a_{t + \Delta t}$ with no truncation error.
For DDPM, the equivalent reverse path is $a^{(k-1)} = f(a^{(k)}, \epsilon_\theta)$ where $f$ is a nonlinear function of the noise schedule. The "acceleration" (curvature of the reverse path) is non-zero, so each Euler-like step accumulates truncation error proportional to $\ddot{a} \cdot (\Delta t)^2$. More steps are needed to keep total error bounded.
@torch.no_grad()
defflow_matching_sample(v_theta, obs, action_shape, n_steps=5):
"""Generate an action chunk via Euler integration.
Returns: (1, H, act_dim) action chunk.
"""dt = 1.0 / n_steps# Start from noisea_t = torch.randn(1, *action_shape, device=obs.device)
t = 0.0for_inrange(n_steps):
# One network evaluation per stepv = v_theta(a_t, torch.tensor([t], device=obs.device), obs)
a_t = a_t + dt * v# Euler stept += dtreturna_t# a_1 ~ clean action chunk
Compare this to DDPM: the training loop is 3 lines shorter (no $\bar\alpha_k$ schedule, no $\epsilon$-vs-$x_0$ choice), and the inference loop is a plain Euler integrator (no $\hat{a}_0$ reconstruction, no noise-schedule indexing). This code-level simplicity is not cosmetic — it translates to fewer hyperparameters, fewer bugs, and faster iteration.
Why it matters
Flow matching has three advantages over diffusion at the level of practical robot policies:
Fewer sampling steps — 5–10 vs 16+ — at comparable quality. Lower latency.
Cleaner conditioning — the model is a single network that takes time as a continuous input, no noise-schedule gymnastics.
Simpler loss — no $\sqrt{\bar\alpha_k}$ algebra, no $\epsilon$-vs-$x_0$ choice, no schedule tuning beyond uniform $t$ sampling.
$\pi_0$ specifics
$\pi_0$ couples a frozen vision-language backbone (PaliGemma, ~3B parameters) with an "action expert" — a small transformer that generates actions via flow matching, conditioned on the VLM's hidden states via cross-attention. The action expert is on the order of 300M parameters. The full model produces 50Hz control on bimanual platforms with action chunks of ~50 steps integrated over 10 Euler steps. The follow-up $\pi_{0.5}$ added open-vocabulary transfer and broader cross-embodiment generalization.
The architectural commitment is worth naming explicitly: the VLM does perception and high-level reasoning; the small action expert does motor control. This split is becoming standard — see also Helix's "system 1 / system 2" framing.
From DDPM to flow matching: the conceptual bridge
It helps to see flow matching as a simplification of diffusion, not a replacement. Both methods learn a map from noise to data. The difference is the path taken:
The shared structure. Both DDPM and flow matching define a family of distributions $p_t$ indexed by time, where $p_0$ is a simple base distribution (Gaussian) and $p_1$ is the data distribution. Both train a neural network to approximate a vector field that transforms samples from $p_0$ to samples from $p_1$.
In DDPM, the vector field is the score function $\nabla_a \log p_t(a)$, and the transformation follows a stochastic differential equation (SDE) with both drift and diffusion terms. The noise schedule $\{\beta_k\}$ parameterizes the SDE.
In flow matching, the vector field is a velocity field $v_t(a)$, and the transformation follows an ordinary differential equation (ODE) with drift only — no stochasticity. The linear interpolant $a_t = (1-t)a_0 + ta_1$ defines the path.
The key simplification: the ODE formulation has no noise injection at inference time, which means the trajectory from noise to data is deterministic given the initial sample. Deterministic trajectories are easier to integrate numerically (ODE solvers are simpler and faster than SDE solvers), which is why flow matching needs fewer steps.
There is a deeper mathematical connection: Song et al. (2021) showed that every diffusion SDE has an equivalent probability flow ODE that generates the same marginal distributions. Flow matching can be understood as training the velocity field of this probability flow ODE directly, bypassing the SDE formulation entirely. The result is the same distribution; the path to get there is straighter.
Practical hyperparameters
Flow matching has fewer hyperparameters than diffusion, but the ones it has still matter:
Number of Euler steps $N$. Start with 10, reduce to 5 if latency is tight. Below 3, quality degrades noticeably. Above 15, returns diminish.
Network architecture. Identical to diffusion — either a 1D temporal U-Net with FiLM conditioning or a transformer with cross-attention to observation tokens. The velocity network has the same input/output shape as the noise predictor.
Time conditioning. Sinusoidal positional encoding of $t$ (continuous, not discrete), injected via FiLM or additive embedding. The model must know where it is on the $[0,1]$ timeline.
EMA. Still essential. Use $\tau = 0.9999$ as with diffusion. The EMA model is used for inference.
Observation history length. Two timesteps, same as diffusion. Longer history adds noise more than signal for flow matching policies.
Worked example: one Euler ODE step, hand-computed. We are at flow time $t = 0.0$ with noise sample $a_0 = [-0.3, 0.1, 0.5, -0.2, 0.8, -0.1, 0.4]$ (7-DoF action). The velocity network predicts $v_\theta(a_0, 0, o) = [0.5, -0.2, -0.6, 0.4, -0.9, 0.3, 0.5]$. Step size $\Delta t = 0.2$.
$$a_{0.2} = a_0 + 0.2 \times v_\theta = [-0.3 + 0.10, \; 0.1 - 0.04, \; 0.5 - 0.12, \; -0.2 + 0.08, \; 0.8 - 0.18, \; -0.1 + 0.06, \; 0.4 + 0.10]$$
$$= [-0.20, \; 0.06, \; 0.38, \; -0.12, \; 0.62, \; -0.04, \; 0.50]$$
The action moved 20% of the predicted velocity toward the data distribution. After 4 more such steps, $a_{1.0}$ will be approximately a valid action chunk. The key observation: the velocity $v_\theta$ is a displacement estimate — it predicts where the data sample is relative to the current position. Each Euler step moves a fraction $\Delta t$ of that displacement.
Flow matching vs diffusion: when to choose which
In practice, the choice between diffusion and flow matching for a robot policy is not about quality — they are comparable. It is about engineering overhead and latency budget:
Choose flow matching when: you are building a new policy from scratch, latency is tight (<30ms), or you want minimal hyperparameter tuning. The simpler codebase means faster iteration.
Choose diffusion when: you are building on existing DDPM infrastructure (e.g., fine-tuning from a pretrained diffusion checkpoint), or you need compatibility with classifier-free guidance or other diffusion-specific conditioning techniques.
Choose FAST tokens when: you are using a VLM backbone and want to share the autoregressive decoding infrastructure. The FAST tokenizer converts continuous actions to tokens that the LLM can predict natively.
The trend is clear: new projects in 2026 default to flow matching. Diffusion is the incumbent with a larger ecosystem. FAST is the disruptor for VLAs specifically. All three produce comparable action quality; the differences are in engineering convenience, inference speed, and composability with pretrained language models.
Flow matching is not a generational leap over diffusion in capability — it's a generational leap in cleanliness. The math is shorter, the training loop is shorter, and the sampler is shorter. That matters at scale.
10Tokenized actions
When the action head is just another decoder of a transformer that already exists.
Two pressures push toward representing actions as tokens. First, you want to share weights with a pretrained language or vision-language model; the simplest way is to put actions in the model's own vocabulary. Second, transformers handle categorical sequences brilliantly, and decades of NLP optimization apply for free.
Per-dimension binning (RT-1 family)
Each action dimension is binned into 256 buckets uniformly across its observed range. A 7-dof end-effector action becomes seven categorical predictions per timestep. Training is cross-entropy; inference is argmax (or sampled, for diversity).
RT-1
Image tokens come from EfficientNet + TokenLearner (a small attention module that distills $H \times W$ patch tokens into $\sim$8 informative tokens). Language is encoded by Universal Sentence Encoder. A FiLM layer fuses language with image features. A transformer decoder predicts action token sequences. 35M parameters; trained on 130k Google demonstrations across 700+ tasks.
RT-2
The shift was simple and important: take a pre-trained vision-language model (PaLI-X, PaLM-E), overload some of its existing vocab tokens to mean action bins, and co-finetune on web data + robot data. The model can now answer "what should the robot do" the same way it answers "what's in the image" — by emitting tokens. This is the genealogical root of every modern VLA.
OpenVLA
An open re-implementation of the RT-2 idea. Llama-2 7B + DINOv2 + SigLIP vision, action tokens in the vocabulary, trained on a 970k-trajectory subset of Open X-Embodiment. It works because the recipe was always more important than the secret sauce.
VQ-BeT
The other path is a learned action codebook. VQ-BeT trains a VQ-VAE over short action chunks (say, 5 steps), producing a codebook of ~16 codes per chunk. A transformer is then trained to predict code indices autoregressively conditioned on observations.
$a$ — a short action chunk (e.g., 5 timesteps × 7 DoF = 35-dim vector) from the demonstration dataset.
$h$ — the encoder output: a continuous embedding of the action chunk, before quantization.
$e_z$ — the nearest codebook vector to $h$. The VQ-VAE codebook has $N_{\text{codes}}$ entries (e.g., 512); $e_z$ is the one closest to $h$ in Euclidean distance.
$\text{Dec}(e_z)$ — the decoder that reconstructs the action chunk from the discrete code. The first term $\|a - \text{Dec}(e_z)\|^2$ is the reconstruction loss.
$\text{sg}[\cdot]$ — the stop-gradient operator. Blocks gradients from flowing through its argument. This is needed because the argmin (nearest-neighbor lookup) is not differentiable.
$\|\text{sg}[h] - e_z\|^2$ — the codebook loss: moves the codebook vectors toward the encoder outputs (updates codebook only).
$\beta \|h - \text{sg}[e_z]\|^2$ — the commitment loss: moves encoder outputs toward their assigned codebook vectors (updates encoder only). $\beta \approx 0.25$ is typical.
$z_t$ — the discrete code index for the action chunk at timestep $t$. In Stage 2, a transformer predicts these indices autoregressively.
$p_\theta(z_t \mid o_t, z_{<t})$ — the policy's predicted probability of code $z_t$ given the observation and previously predicted codes. Trained with cross-entropy.
In code: The VQ commitment loss is vq_loss = F.mse_loss(z_e.detach(), e) + 0.25 * F.mse_loss(z_e, e.detach()) where z_e is the encoder output and e is the nearest codebook vector. The first term moves the codebook toward the encoder (stop gradient on encoder); the second moves the encoder toward the codebook (stop gradient on codebook). Stage 2 is just F.cross_entropy(logits, code_indices) — standard next-token prediction over discrete codes. In your system, VQ-BeT requires training two separate models sequentially, which is why it is less popular than diffusion despite often matching it on quality.
The advantages over per-dimension binning: the codebook captures cross-dimension structure (an entire pre-grasp posture is one code) and each "action" the policy commits to is a coherent multi-step motion, not seven independent bins.
Worked example: VQ-BeT codebook. After Stage 1 training on 10,000 action chunks (each 5 steps × 7 DOF = 35 dimensions), we have a codebook of 512 codes. Each code is a 35-dim vector representing a common short motion pattern. Example codes:
Code 42: $[0.01, 0.02, -0.01, 0.00, 0.01, 0.02, 1.0, \ldots]$ — small approach + gripper close. This code fires ~8% of the time. It's a "grasp initiation" pattern.
Code 187: $[0.05, 0.0, 0.03, 0.0, -0.01, 0.0, 0.0, \ldots]$ — smooth reach forward. This code represents a "reaching" motion.
Code 3: $[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, \ldots]$ — hold still (idle). High frequency in datasets with pauses.
At Stage 2, the policy sees an observation and predicts the next code index autoregressively. If it predicts code 187 (reach) followed by code 42 (grasp), the decoded action sequence is a coherent reach-and-grasp motion. This is much better than independently binning 35 numbers — the codebook enforces cross-dimension and cross-timestep coherence.
FAST — frequency-space tokenization
The 2025 advance that made autoregressive VLAs competitive with diffusion. The observation behind FAST (Pertsch et al., 2025) is that per-dimension binning fails on high-frequency dexterous tasks because adjacent timesteps in an action chunk are highly correlated — binning them independently produces enormous, redundant token sequences that the autoregressive model can't predict accurately. The fix borrows from JPEG.
compression corepropertyresult
Four steps:
Normalize the action chunk (subtract mean, divide by 99th-percentile range).
DCT each action dimension along the time axis. The discrete cosine transform concentrates signal energy in the low-frequency coefficients — same reason it's the heart of JPEG.
Quantize via scale-and-round, with a hyperparameter trading lossiness for compression. Most of the high-frequency coefficients round to zero and disappear.
BPE the resulting integer sequence using byte-pair encoding to losslessly compress repeated patterns. The output token IDs slot into the least-used positions in the LLM vocabulary.
The training loss is plain next-token cross-entropy. At inference, generate tokens autoregressively, run the inverse pipeline (un-BPE → de-quantize → inverse DCT) offline. The pipeline is invertible, the LLM machinery is unchanged, and the released FAST+ tokenizer is universal across embodiments — trained on 1M trajectories, it works zero-shot on new robots.
The empirical headline: π₀-FAST matches diffusion-π₀ on quality while training 5× faster. The result reframes the diffusion-vs-tokens debate. With FAST, autoregressive VLAs are no longer the speed-vs-fidelity compromise — they are competitive on both axes.
Worked example: FAST compression. A 7-DOF action chunk of 50 timesteps:
Naive tokenization: 7 dimensions × 50 timesteps = 350 per-dimension bins. Each is one token. Total: 350 tokens to generate autoregressively. At 20ms/token (for a 3B model), that's 7 seconds — far too slow for real-time control.
FAST pipeline:
1. Normalize the 7×50 matrix.
2. DCT per dimension: most energy in the first 8–12 frequency coefficients. The remaining 38+ coefficients are near-zero.
3. Quantize: multiply by a scale factor (e.g., 32), round to integers. High-frequency zeros disappear. From 350 real numbers down to ~80 non-zero integers.
4. BPE: column-flatten and compress repeated patterns. 80 integers → ~30 BPE tokens.
Result: 30 tokens instead of 350. At 20ms/token: 0.6s instead of 7s. The inverse pipeline (un-BPE, de-quantize, inverse DCT) runs in <1ms on CPU. The reconstruction error is ~2% of the action range — below the noise floor of teleop data.
The trade-off, restated
Token-based heads are cheap at inference (one transformer forward, no iterative sampling) and they reuse pretrained weights for free. They are less expressive than diffusion or flow matching for fine continuous control — 256-bin discretization caps fidelity, and per-dimension factorization throws away within-step correlation. Empirically: tokens win when scale and multitask transfer dominate; diffusion / flow win when single-task fine motor control is the bottleneck. The 2026 generalist policy stack often combines them: a VLM backbone with tokens for routing, a flow-matching head for execution.
11UMI and the data shift
The most important paper of 2024 is not about a model. It is about a stick.
Universal Manipulation Interface (Chi et al., 2024) is a handheld parallel-jaw gripper with a GoPro camera, two side mirrors, and a fingertip-mounted IMU. A human picks it up and performs the task. Software extracts the 6-DoF gripper trajectory from visual SLAM and the gripper width from a fiducial; the resulting (image, EE-pose, gripper-width) trajectory is then used to train a Diffusion Policy. That policy is then transferred to a real robot with the same parallel-jaw end-effector.
UMI is not a new architecture. The policy on top is vanilla Diffusion Policy. The contribution is the data layer — and the contribution is large enough to reshape the field.
Why this works
Embodiment is the gripper, not the arm. If the policy outputs relative EE poses and gripper width, the body that holds the gripper does not need to match between collection and deployment. A human's wrist is a perfectly good "robot arm" for data purposes.
Mirrors give multi-view from one camera. The fisheye GoPro plus side mirrors yields three pseudo-views in a single frame. The policy gets multi-camera robustness from a single sensor.
SLAM gives proprioception. No motion-capture rig, no instrumented environment. The trajectory is recovered from the camera's own motion.
Latency-matched action representation. UMI shifts the predicted action sequence forward in time to compensate for robot actuation delay, so a policy trained on instantaneous-human-motion data still works on a robot with $\sim$200ms latency.
The policy stack
Two-step observation history of (RGB, EE pose, gripper width).
CLIP-pretrained ViT vision encoder; the EE-pose history is an MLP-encoded vector token.
The UMI gripper is deliberately low-tech. A 3D-printed parallel-jaw gripper body, a GoPro Hero 10 in a fisheye housing, two planar mirrors angled at ~45° on each side, an ArUco fiducial sticker on each fingertip, and an optional IMU for gravity-aligned orientation. Total hardware cost: under $200. The human picks up this device like a pair of tongs and performs the task naturally — no robot arm, no teleoperation rig, no motion-capture suit.
The key physical insight: the gripper is the embodiment that matters. A parallel-jaw gripper has one degree of freedom (open/close width) plus 6-DoF end-effector pose. Whether that gripper is held by a human hand or mounted on a Franka, a UR5, or a Sawyer does not change the gripper's interaction with the object. By decoupling data collection from the robot, UMI makes the cost of one demonstration approximately 15 seconds of human effort plus no marginal hardware cost.
Why relative EE actions are essential
The UMI gripper has no absolute reference frame. It does not know where it is in the room — only where it was relative to where it just was. This forces the entire pipeline to operate in relative end-effector coordinates: the action at time $t$ is $\Delta T_t = T_{t+1} \cdot T_t^{-1}$, a relative SE(3) transform. This is a feature, not a limitation:
Robot-agnostic. The same relative-EE actions deploy on any robot that has a parallel-jaw gripper and an operational-space controller.
Translation-invariant. The policy does not memorize workspace positions. It learns motions relative to the current gripper pose, which generalizes across table heights, object placements, and starting configurations.
No calibration. There is no camera-to-robot-base transform to estimate. The SLAM trajectory is in the camera's own coordinate frame, and relative actions cancel the frame origin.
The latency matching trick
A human demonstrating with the UMI gripper reacts instantly — the delay between intention and motion is effectively zero. A robot has actuation latency: commands issued at time $t$ are not executed until $t + \delta$, where $\delta \approx 100$–$300$ms depending on the robot's control pipeline. If you train on human-speed data and deploy on a latency-ridden robot, the policy is always "behind" — it predicts actions that were appropriate $\delta$ milliseconds ago.
UMI's fix is temporal resampling. During data processing, the action labels are shifted forward in time by $\delta / \Delta t$ steps (where $\Delta t$ is the control period). At time $t$, the training target is the action the human actually performed at $t + \delta / \Delta t$. At deployment, the robot's actuation delay means the action arrives at the gripper at approximately the right time. This is a simple index shift in the trajectory array, but without it the policy consistently undershoots and lags behind the task.
Collection cost comparison
System
Hardware cost
Setup time
Time per demo
Expert needed?
Robot needed?
UMI
~$200 (gripper + GoPro)
Minutes
~15s
No
No
ALOHA teleoperation
~$30K (full ALOHA rig)
Hours (calibration)
~20s + reset
Trained teleop
Yes
DexCap (dexterous)
~$500 (glove + cameras)
30 min
~20s
No
No
Kinesthetic teaching
$0 (use robot)
Minutes
~30s + reset
Robot operator
Yes
VR teleoperation
~$1K (Quest headset)
30 min
~25s + reset
Trained teleop
Yes
The cost difference is not incremental — it is structural. UMI removes the robot from the data collection loop entirely. A lab can hand 10 UMI grippers to 10 undergrads and collect 1,000 demonstrations in an afternoon. No scheduling the robot cell, no teleoperator training, no reset scripts. This is the reason UMI's impact exceeds its technical novelty.
The full data pipeline
The UMI data pipeline has five stages, each solving a specific problem that arises from collecting robot data without a robot:
Stage 1: Raw video capture. The GoPro records 4K video at 30fps with a fisheye lens. The wide field of view captures the gripper, the object, and the surrounding scene in every frame. The side mirrors extend the effective FOV to nearly 270° — the camera can "see" objects approaching from the sides that a standard lens would miss. Raw output: $\sim$13,500 frames per 15-second trial at 4K resolution.
Stage 2: SLAM-based pose estimation. ORB-SLAM3 processes the fisheye video to recover the camera's 6-DoF pose at each frame. Because the camera is rigidly mounted to the gripper, the camera pose is the gripper pose. The SLAM system uses visual features (ORB keypoints) to track the camera's motion through 3D space. ArUco fiducial markers on the gripper fingertips provide a secondary measurement of gripper width: the distance between the two fiducials in the camera image, scaled by the known fiducial size, gives the finger separation to $\pm$1mm.
Why SLAM and not AprilTags alone? A common alternative is to place AprilTag fiducials in the workspace and track the gripper relative to them. This requires instrumenting the environment (taping tags to the table, walls, etc.) and limits data collection to that specific workspace. SLAM is environment-agnostic — it builds its own map of the scene on the fly. The UMI gripper can be used in a kitchen, an office, or outdoors without any setup. This is the difference between "robotics-grade data collection" and "anyone can do it anywhere."
Stage 3: Temporal resampling. The raw trajectories are at 30fps (camera rate). The robot policy will run at 10Hz (the standard control frequency for manipulation). Linear interpolation for positions and SLERP (spherical linear interpolation) for rotations downsample each trajectory from 30fps to 10Hz while preserving smooth motion. Each 15-second trial becomes 150 timesteps.
Stage 4: Action space conversion. Absolute 6-DoF poses are converted to relative actions: $\Delta T_t = T_{t+1} \cdot T_t^{-1}$. The relative transform is decomposed into a position delta $(\Delta x, \Delta y, \Delta z)$ and a rotation delta (three Euler angles or a rotation vector). Combined with gripper width, the action vector is 7-dimensional: $a_t = [\Delta x, \Delta y, \Delta z, \Delta\text{roll}, \Delta\text{pitch}, \Delta\text{yaw}, w_{\text{gripper}}]$.
Stage 5: Latency compensation and normalization. Actions are shifted forward by $\lceil \delta / \Delta t \rceil$ timesteps to account for robot actuation delay $\delta$. All action dimensions are normalized to zero mean, unit variance based on the training set statistics. Images are resized to 224×224 and normalized to match the vision encoder's expected input distribution.
Generalizations of the UMI idea
The UMI principle — decouple data collection from the robot by matching the end-effector — has been extended to other embodiments, each with its own proxy device:
System
End-effector
Proxy device
Pose estimation
Key innovation
UMI
Parallel jaw gripper
Handheld gripper + GoPro
SLAM + ArUco
Original concept; $200 BOM
DexCap
Dexterous hand (16-DoF)
Glove with finger tracking
Multi-camera hand pose
Per-finger retargeting
HumanPlus
Humanoid whole-body
Human body + motion capture
RGB pose estimation
Full-body teleoperation-free demos
AnyTeleop
Various
VR headset + controllers
VR tracking
Universal retargeting across embodiments
The deeper lesson
For two decades the bottleneck on imitation learning was data — specifically, synchronized expert action data, which is expensive because it requires a robot. UMI shows that for parallel-jaw manipulation, much of that data can be collected without a robot at all, by humans acting through a handheld proxy. The implications cascade: cross-embodiment datasets, in-the-wild collection by non-experts, scaled-out pretraining corpora.
The same idea has been generalized: DexCap for dexterous hands, HumanPlus for whole-body humanoids, and a long tail of "make a thing a human can wear or hold to record actions" projects. The common thread is that the action space of the gripper or hand is shared between human and robot; everything else can vary.
Worked example: UMI data pipeline. A human picks up the UMI gripper and performs a "pick up cup and place it on the saucer" task 50 times. Each trial takes ~15 seconds at 30fps = 450 frames. The pipeline:
1. SLAM trajectory extraction: ORB-SLAM3 on the GoPro fisheye camera recovers the 6-DoF gripper pose at each frame. Typical precision: ±2mm position, ±1° rotation. ArUco fiducial on the gripper fingers tracks gripper width.
2. Resampling: Camera runs at 30fps; the robot policy runs at 10Hz. Resample trajectories to 10Hz using linear interpolation for positions and SLERP for rotations. Each 15s trial → 150 timesteps.
3. Relative action conversion: Convert absolute EE poses to relative: $\Delta T_t = T_{t+1} \cdot T_t^{-1}$. This is the action the robot will predict.
4. Latency compensation: The robot has ~200ms actuation latency. Shift the action sequence forward by $200\text{ms} / 100\text{ms/step} = 2$ timesteps. At time $t$, the action label is what the human did at $t + 2$.
5. Dataset: 50 episodes × 150 steps = 7,500 observation-action pairs. With 2-step observation history, 7,350 training samples. This is enough to train a Diffusion Policy to ~85% success on this specific task.
Total time: 50 × 15s = 12.5 minutes of human demonstration time. No robot involved in data collection.
The data pipeline, formally
The full transformation from raw UMI video to training-ready dataset can be expressed as a sequence of six operators:
$\text{Norm}$ — normalize action dimensions to zero mean, unit variance.
Each operator is deterministic and invertible (except quantization in normalization). The pipeline takes ~10 minutes per 50 episodes on a laptop CPU, dominated by SLAM processing.
Worked example: UMI deployment on a new robot. The lab has trained a Diffusion Policy on 50 UMI demonstrations of "pick up cup and place on saucer." They want to deploy it on a Franka Panda with a Robotiq 2F-85 gripper.
Step 1: Hardware matching. The Robotiq 2F-85 is a parallel-jaw gripper with similar finger geometry to the UMI gripper. Mount a fisheye camera (Intel RealSense D435 with fisheye firmware) at the same relative position as the GoPro on the UMI gripper. The camera-to-gripper transform is measured once with a calibration pattern.
Step 2: Action space mapping. The policy outputs relative EE poses $(\Delta x, \Delta y, \Delta z, \Delta\text{roll}, \Delta\text{pitch}, \Delta\text{yaw})$ and gripper width. The Franka's operational-space controller accepts Cartesian velocity commands and a gripper position target. The mapping is: $v_{\text{EE}} = \Delta T / \Delta t$ (relative transform divided by control period), $w_{\text{target}} = w_{\text{current}} + \Delta w$. The Franka's inverse kinematics solver converts EE velocity to joint velocities internally.
Step 3: Latency recalibration. The Franka + Robotiq system has ~150ms actuation latency (vs 200ms assumed during UMI training). Two options: (a) retrain with the correct latency shift, which requires re-processing the data (10 minutes of compute); (b) add a 50ms software delay to match the training assumption. Option (b) is simpler and works in practice.
Step 4: Deploy. Run the trained policy in a receding-horizon loop at 10Hz. The camera captures the scene, the policy predicts 16 relative-EE actions, the first 8 are sent to the robot. Success rate: ~80% (vs ~85% on the original UMI gripper). The 5-point gap comes from minor differences in finger geometry and gripper stiffness.
Total deployment effort: ~2 hours (camera mounting, calibration, latency tuning). No retraining of the policy network.
The UMI ecosystem in 2026
The UMI design has spawned a family of handheld data-collection devices, each targeting a different end-effector morphology. The common principle — decouple data collection from the robot by building a human-holdable proxy of the end-effector — has been validated across grippers, dexterous hands, and even whole-body humanoid motion. The limitation remains end-effectors with high force requirements (industrial grippers, heavy-payload arms) where a human cannot replicate the necessary forces. For these, teleoperation remains necessary.
The UMI codebase is open-source (MIT license), with documented instructions for 3D-printing the gripper body, sourcing the GoPro and ArUco markers, and running the data pipeline. Multiple research groups have independently reproduced the setup and confirmed the reported results. This reproducibility is itself a contribution: UMI is not just a paper — it is a protocol that any lab can adopt in an afternoon.
Scaling the UMI approach
The natural question: if 50 demos cost 12.5 minutes of human time, how far can you scale? The answer is limited not by collection cost but by task diversity. Collecting 10,000 demos of the same pick-and-place task adds diminishing returns after ~500. The leverage comes from collecting across many tasks and environments: 50 demos each of 200 different tasks, collected by 20 different humans in 10 different kitchens. This is the UMI vision — not a data collection tool for one lab, but a data collection protocol for the entire field.
Several groups have begun organized UMI data collection campaigns: multiple research labs sharing a common task protocol, shipping UMI grippers to collaborators, and aggregating the resulting datasets into cross-institution training corpora. The target: 100K diverse demonstrations across 50+ tasks and 10+ environments, collected at a cost that would be impossible with robot teleoperation.
Worked example: UMI from zero to deployed policy in 6 hours. You want a robot that folds dish towels. You have never collected robot data before. Here is exactly what happens.
Hour 0–0.5: Hardware setup. Unbox the UMI gripper kit ($200). Attach the GoPro Hero 10 to the fisheye mount. Stick the ArUco fiducials to the fingertips. Charge the GoPro battery. Download the UMI data-processing codebase. Total prep: 30 minutes, no robotics expertise required.
Hour 0.5–2.5: Data collection. Collect 50 demonstrations of the towel-folding task. Each demo: pick up the UMI gripper, grasp one corner of the towel, fold it in half, release. Each trial takes ~30 seconds of actual manipulation plus ~10 seconds of repositioning the towel. The GoPro records continuously. You accumulate 50 × 30s = 25 minutes of task-relevant video at 30fps → 45,000 frames. A non-expert undergraduate can do this with 5 minutes of verbal instruction.
Hour 2.5–3: Data processing. Run the UMI pipeline on a laptop:
(1) Hand keypoint detection: MediaPipe processes each frame to detect the human's hand, confirming that the gripper is being held and providing coarse hand-pose priors. Runtime: ~5 minutes for 45K frames on a laptop GPU.
(2) 6-DOF wrist pose estimation: ORB-SLAM3 processes the fisheye video to recover the camera (= gripper) trajectory. The ArUco markers on the fingertips give gripper width at each frame. Runtime: ~15 minutes, dominated by SLAM.
(3) Relative EE action extraction: Convert absolute SE(3) poses to relative: $\Delta T_t = T_{t+1} \cdot T_t^{-1}$. Resample from 30fps to 10Hz. Shift forward by 2 timesteps for latency compensation (assuming 200ms robot delay). Runtime: seconds.
(4) Normalization: Compute per-dimension mean and standard deviation across all 50 episodes. Normalize actions to zero mean, unit variance. Resize images to 224×224.
Hour 3–3.5: Dataset verification. Replay the extracted trajectories in a visualizer. Check that the gripper poses track the actual motion. Check that gripper-width labels match the actual open/close events. Discard any demos where SLAM lost tracking (typically 2–5 out of 50). Final dataset: ~45 good episodes × ~300 timesteps each (towel folding is longer than a quick pick) = ~13,500 training samples.
Hour 3.5–6: Training. Train a Diffusion Policy on 1 GPU (RTX 4090). Architecture: CLIP-pretrained ViT encoder (frozen) for images, MLP encoder for EE-pose history, Transformer denoiser predicting 16 future action steps. Batch size 256, 200K gradient steps, ~2.5 hours. The loss curve should plateau by 150K steps.
Deployment: Mount a fisheye camera on the robot's Robotiq gripper at the same relative position as the GoPro on the UMI gripper. Run the trained policy at 10Hz in receding-horizon mode (predict 16, execute 8). Expected first-attempt success rate: 60–75% (towel folding is deformable manipulation — harder than rigid pick-and-place). With 30 more targeted demos on failure cases and a 1-hour retrain: 80–85%.
Total wall-clock: 6 hours from unboxing to a deployed towel-folding policy. Total compute cost: ~$2 of GPU time. Total human labor: ~3 hours of active work.
Scaling: how many demos for how hard a task?
The number of demonstrations required depends on the task's complexity along three axes: precision of the required motion, variability of the objects and scene, and number of sequential stages. The following table summarizes empirical findings across multiple UMI deployment campaigns:
Task category
Example
Demos needed
Expected success
Bottleneck
Simple pick-and-place
Pick up a mug, place on coaster
10–20
85–95%
Gripper alignment with handle
Moderate pick-and-place
Stack 3 blocks in a specific order
30–50
75–85%
Sequencing and block-pose variation
Precise insertion
Insert USB connector into port
80–120
70–80%
Sub-millimeter alignment, contact forces
Articulated manipulation
Open a drawer, place object inside
50–80
75–85%
Handle grasp + pull trajectory
Deformable manipulation
Fold a towel, tie a knot
200–500+
60–80%
Fabric state is high-dimensional and stochastic
Tool use
Use a spatula to flip a pancake
100–200
50–70%
Tool-object interaction dynamics
Multi-step kitchen
Pour, stir, plate (5+ stages)
300–500
40–60%
Error compounds across stages
The scaling is sub-linear in task complexity but super-linear in precision requirements. Doubling the number of demos typically adds 5–10 percentage points of success rate until saturation. The saturation point — where more demos stop helping — is determined by the policy architecture's capacity and the irreducible stochasticity of the task (a towel falls differently each time, and no amount of data can eliminate that variance). Beyond saturation, the next lever is environment diversity: collecting the same task in 10 different kitchens with 10 different towels beats collecting 10x more demos in one kitchen with one towel.
Data quality signals: how to spot a bad UMI demo
Not all UMI demos are usable. The SLAM trajectory can fail silently, producing a dataset that looks complete but contains garbage actions. Three quality signals to check before training:
SLAM tracking loss. ORB-SLAM3 reports a "tracking state" per frame. If the tracker enters "lost" state for more than 5 consecutive frames, the recovered pose after re-localization will have a discontinuous jump. Discard any episode with a position jump > 5cm between consecutive timesteps after resampling.
Gripper width consistency. The ArUco-based gripper width should change smoothly (< 2mm/timestep at 10Hz). A spike > 5mm in one timestep means the fiducial detection failed (occlusion, motion blur). Interpolate over short gaps (< 3 frames); discard episodes with long gaps.
Action magnitude distribution. Compute the L2 norm of the relative action at each timestep across all episodes. The distribution should be unimodal with a thin tail. Episodes whose mean action magnitude is > 3 standard deviations from the population mean are likely corrupted (SLAM drift, human fumble, accidental recording). These should be reviewed manually before inclusion.
In a typical UMI data collection campaign, 5–10% of episodes fail quality checks. This is a tolerable loss rate — it takes 15 seconds to collect another demo. The alternative (no quality filtering) produces a dataset where 5% of the training samples have corrupted action labels, which can reduce final policy success rate by 10–15 points.
A simple automated pipeline: after running the SLAM + resampling pipeline on all episodes, compute the per-episode statistics (mean action norm, max gripper-width delta, SLAM tracking loss count). Flag episodes that exceed 2.5 standard deviations from the population mean on any metric. Manually review only the flagged episodes (typically 10–15% of the total) and discard the truly corrupted ones. This takes 5–10 minutes for a 50-episode dataset and catches the most damaging outliers without requiring frame-by-frame inspection of every demo.
Architectures saturate. Data does not. The single highest-leverage move in modern robot learning is finding a way to collect more demonstrations faster, more cheaply, and from less specialized labor.
12Vision–Language–Action models
When the policy is a frozen LLM with a different output head — and increasingly, with two heads running at different speeds.
A VLA is a single network that ingests images and natural-language instructions and emits robot actions. The bet behind every VLA is that the abstractions a model learns from internet-scale vision-language data — objects, affordances, spatial relations, intent — transfer to robotics, and that they transfer better than anything you could pretrain on robot data alone. By 2026 the bet has paid off, the architectures have converged, and the open question is no longer "do VLAs work" but "what fraction of the stack should be the VLM versus the action expert, and at what frequencies."
The lineage, in one table
Model
Year
Backbone
Action head
Notable
RT-1
2022
EfficientNet + USE + FiLM
Discrete tokens (256 bins)
First scaled VLA recipe; 35M params; 130k demos.
RT-2
2023
PaLI-X / PaLM-E (12B–55B)
Tokens overloaded into LLM vocab
First true VLA; web + robot co-finetuning.
Octo
2024
Custom transformer (27M / 93M)
Diffusion (continuous)
Open. Goal-image or language; 800k demos.
OpenVLA
2024
Llama-2 7B + DINOv2 + SigLIP
Discrete tokens
Open RT-2 recipe; 970k demos.
RDT-1B
2024
DiT (1B)
Diffusion
Bimanual specialist; 1M+ episodes.
π₀
2024
PaliGemma 3B + 300M expert
Flow matching
50Hz bimanual; cross-embodiment training.
π₀-FAST
2025
Same backbone
Autoregressive on FAST tokens
5× faster training; matches diffusion quality.
π₀.₅
2025
PaliGemma + action expert
Flow matching
Open-world generalization; new kitchens/bedrooms.
π₀.₇
2026
+ MEM, RL Token
Flow + RL fine-tuning
Steerable; multi-scale memory; >10-min tasks.
GR00T N1
2025
Eagle-2 VLM (1.34B) + DiT
Diffusion / flow matching
Humanoid; 2.2B; 63.9ms / 16-action chunk.
Helix
2025
7B VLM at 7–9Hz
Visuomotor at 200Hz
35-DOF upper body; runs on Jetson Orin; <100ms.
SmolVLA
2025
SmolVLM (450M)
Flow matching expert
Compact; matches 10× larger models on benchmarks.
The two-system split, explicitly
The convergent architecture of 2026 has two unequal halves. A large vision-language model — the slow brain — observes the scene at 5–10Hz and emits either a latent plan, a chain-of-thought string, or a sequence of FAST tokens. A small action expert — the fast brain — runs at 50–200Hz, reads the latest observation plus the slow brain's output, and produces continuous joint or end-effector commands. The split is what makes language-conditioned humanoid control viable: a 7B forward pass per control tick is not feasible; a 7B forward pass per plan with a 100M expert per tick is.
System 2 · slowSystem 1 · fastbridge signal
The five families of bridges
Different VLAs disagree about what the slow brain sends to the fast brain:
Hidden states. π₀ and GR00T pass the VLM's last-layer hidden states through cross-attention into the action expert. Highest bandwidth; tightest coupling; requires joint training.
Discrete tokens. RT-2 / OpenVLA / π₀-FAST emit action tokens from the LLM's own vocabulary, decoded back into actions. Lowest latency for the VLM; throws away cross-dimension structure unless paired with FAST.
Latent plan vectors. Helix-style designs emit a small "plan vector" updated at System-2 frequency that conditions System 1. Loose coupling; allows the two halves to be trained separately.
Natural-language reasoning. Gemini Robotics 1.5 interleaves language reasoning steps with action chunks — "first I'll pick up the cup, then place it in the sink" — making behavior interpretable.
Tool calls. Gemini Robotics-ER 1.5 acts as an orchestrator, calling a separate VLA as a tool. The reasoning model never sees the actuators directly.
Motion Transfer and embodiment soup
A VLA trained on Open X-Embodiment sees seven different arms doing similar tasks with different action spaces. Motion Transfer (Gemini Robotics 1.5) and π₀'s zero-padding to the largest action vector are two answers to the same question: how do you make a single policy reuse motor knowledge across robots? The recipe that works is a shared semantic representation in the VLM, plus an action expert whose output is masked to the active embodiment's true degrees of freedom.
Embodied thinking
Gemini Robotics 1.5 added an explicit reasoning trace before action emission — the model writes natural language describing what it is about to do, then emits the action tokens. The trace is conditioned on by the action head, so the reasoning is causally upstream of motion. The cost is latency. The benefit is that "pour the milk before the cereal" requires reasoning the model could not previously do at all.
Inside the VLA: tokenization walkthrough
A VLA ingests three modalities and must convert all of them into a common token format that the transformer backbone can process. Here is the data flow, layer by layer:
Image tokenization. Each camera image (typically 224×224 or 336×336) passes through a ViT encoder (DINOv2, SigLIP, or CLIP). The ViT divides the image into non-overlapping patches (14×14 or 16×16 pixels each), projects each patch to an embedding, and outputs $N$ spatial tokens — typically 256 for a 224/14 grid. For multi-camera setups, each camera produces its own $N$ tokens, which are concatenated into the sequence. A two-camera robot thus starts with $2 \times 256 = 512$ image tokens.
Text tokenization. The language instruction ("pick up the red cup") is tokenized by the VLM's text tokenizer (SentencePiece for Gemma-family, BPE for Llama-family). A typical instruction becomes 5–20 text tokens. These are prepended or interleaved with the image tokens.
Action prediction. The transformer processes the combined token sequence and must produce actions. Two families:
Discrete action tokens (RT-2, OpenVLA). Each action dimension is quantized into 256 bins. A 7-DoF action becomes 7 tokens, predicted autoregressively. The loss is cross-entropy per token.
Continuous action heads (Diffusion Policy, flow matching expert). The transformer's last hidden states are fed to a separate action expert network that generates continuous actions. The loss is the diffusion or flow matching objective, conditioned on the transformer's hidden representations.
The autoregressive action loss, derived. For discretized actions (RT-2 family), each action dimension $d$ is binned into $B = 256$ buckets. The bin index for dimension $d$ at timestep $t$ is $b_{t,d} = \lfloor (a_{t,d} - a_{\min,d}) / (a_{\max,d} - a_{\min,d}) \times (B-1) \rceil$. The training loss is cross-entropy over bins:
$$\mathcal{L}_{\text{action}} = -\sum_{t=1}^{H} \sum_{d=1}^{D} \log p_\theta(b_{t,d} \mid b_{<(t,d)}, o)$$
where $b_{<(t,d)}$ is all previously predicted bin indices (autoregressive ordering). At inference, the model samples (or argmaxes) one bin per dimension, then converts back: $\hat{a}_{t,d} = a_{\min,d} + b_{t,d} \cdot (a_{\max,d} - a_{\min,d}) / (B-1)$.
The resolution bottleneck: with 256 bins over a 0.4m workspace, each bin spans $0.4/256 \approx 1.6$mm. This is adequate for pick-and-place but coarse for insertion tasks requiring sub-millimeter precision. Diffusion and flow matching heads avoid this quantization ceiling entirely.
The two-system split: a concrete example
Worked example: two-system execution of "pick up the red cup." The robot is a bimanual humanoid with 35 DOF upper body, a scene camera, and wrist cameras.
System 2 (slow brain, 7B VLM, 7Hz). At $t = 0$: the VLM receives the scene image + "pick up the red cup." It outputs a latent plan vector $z_{\text{plan}} \in \mathbb{R}^{512}$ encoding "reach toward red cup with right hand, grasp." This forward pass takes ~140ms.
System 1 (fast brain, 200M action expert, 200Hz). Between $t = 0$ and $t = 143$ms (the next System 2 tick), System 1 runs ~28 control steps. Each step: read the latest joint positions $q_t$, wrist camera image token, and the cached $z_{\text{plan}}$ from System 2. Output: 35-dimensional joint velocity target. Each forward pass: ~4ms.
At $t = 143$ms: System 2 re-observes the scene. The hand is now closer to the cup. It updates $z_{\text{plan}}$ to encode "close fingers around cup." System 1 seamlessly transitions to grasping motions using the updated plan.
At $t = 286$ms: System 2 sees the cup is grasped. Updates plan: "lift cup." System 1 executes the lift.
Total task time: ~2 seconds. System 2 ran ~14 times. System 1 ran ~400 times. The VLM provided semantic understanding; the action expert provided fast motor control. Neither alone could have done the task — the VLM is too slow for 200Hz control, the action expert has no concept of "red cup."
The data scaling equation
Training a VLA requires data from three very different sources, at very different scales:
Data source
Scale
What it teaches
Example
Internet text
Trillions of tokens
Language understanding, common-sense reasoning, world knowledge
Motor skills, contact physics, action-observation mapping
"To grasp this cup, close fingers at this pose"
The ratio matters. RT-2 used ~4:1 web:robot data; $\pi_0$ used comparable ratios. Too much robot data and the model overfits to the robot distribution, losing web knowledge. Too little robot data and the model knows what a cup is but cannot grasp one. The emerging consensus: pretrain on web data (text + images), then fine-tune on robot data with a low learning rate and frozen early layers. This is why Knowledge Insulating ($\pi_0$, 2025) — freezing most VLM weights during robot fine-tuning — works: it preserves the internet knowledge structurally while adapting only the action-relevant layers.
Co-training on web data
RT-2 introduced and every successor confirmed: continue training on web vision-language data while fine-tuning on robot data. Otherwise the model loses its world knowledge — it can grab the green block, but ask it to "grab the dinosaur" and it doesn't know what a dinosaur looks like anymore. Mix ratios run 1:1 to 4:1 web:robot. π₀ + Knowledge Insulating (2025) takes this further: freeze most VLM weights through fine-tuning so internet knowledge is preserved structurally, not just statistically.
The VLA architecture in detail
A modern VLA has four distinct components, each with different computational profiles:
1. Vision encoder ($\sim$300M params, frozen). Typically a ViT-L/14 (DINOv2 or SigLIP). Processes each camera image into spatial tokens. For a 224×224 image with 14×14 patches: 256 tokens per image, each $\in \mathbb{R}^{1024}$. Forward pass: ~8ms on an A100. This is the most expensive per-image operation, but it only runs once per observation (not per denoising step).
2. Language encoder ($\sim$0 additional params, shared). In most VLAs, the language encoder is the same transformer backbone that processes the combined sequence. The text tokens are embedded by the VLM's standard tokenizer and processed alongside the image tokens. In $\pi_0$, the language is processed by PaliGemma's text encoder, which shares parameters with the vision pathway.
3. Transformer backbone ($\sim$1B–7B params, frozen or LoRA). The core of the VLA. Processes the concatenated sequence of [text tokens, image tokens, proprioception tokens, (optional) action tokens]. Self-attention allows every token to attend to every other token, enabling cross-modal reasoning: the model can correlate the word "red" with the red-colored image patches and the proprioceptive state indicating the arm is near a red object.
4. Action head ($\sim$100M–300M params, trained). Converts the backbone's output into robot actions. This is where the architectural diversity lives. The action head might be a diffusion denoiser (conditioned on backbone hidden states), a flow matching expert, an autoregressive token predictor, or a simple MLP. The head is almost always trained from scratch — unlike the backbone, it has no useful pretrained initialization for robot actions.
Worked example: token sequence for one VLA forward pass. An OpenVLA-7B processing a language-conditioned pick task:
Input sequence construction:
1. Language: "pick up the red block" → SentencePiece → 7 text tokens, each $\in \mathbb{R}^{4096}$
2. Image: 224×224 RGB from scene camera → DINOv2 ViT-L/14 → 256 spatial tokens, projected to $\mathbb{R}^{4096}$
3. Image: 224×224 RGB from wrist camera → same encoder → 256 spatial tokens
4. Proprioception: [joint angles (7); gripper width (1); EE pose (6)] = 14D → MLP → 1 token $\in \mathbb{R}^{4096}$
Total input: 7 + 256 + 256 + 1 = 520 tokens. Backbone forward pass: 520 tokens through 32 transformer layers × 4096 dim = ~120ms on A100.
Action prediction: The backbone's last hidden state at the proprioception token position is passed to the action head. For discrete tokens: predict 7 bin indices autoregressively (7 × ~3ms = 21ms). For flow matching: 10 Euler steps × ~5ms = 50ms.
Total latency: vision encoding (16ms) + backbone (120ms) + action head (21–50ms) = ~160–190ms. At 5Hz control, this fits comfortably. At 10Hz (100ms budget), only the action head can run repeatedly while the backbone is amortized across multiple control steps — which is exactly the two-system split.
Synthetic data is the new Open X-Embodiment
GR00T N1's training mix is real-robot trajectories, human videos, and entire neural-generated trajectories from video diffusion models. The shift is significant: when image and video generation are themselves at foundation-model scale, the cheapest source of robot training data may be a generative model rather than a teleoperator.
The VLA training pipeline
Training a VLA from a pretrained VLM checkpoint follows a consistent recipe across most published models:
Stage 1: Pretrain on web data (already done). The VLM backbone arrives pretrained on internet text and images. This is the most expensive stage (thousands of GPU-hours) and is done by the model provider, not the robotics lab.
Stage 2: Co-finetune on web + robot data. Interleave batches of web VQA data with robot demonstration data. The web data preserves the VLM's general knowledge; the robot data teaches action prediction. Typical mix ratio: 1:1 to 4:1 web:robot. Learning rate: 1e-5 to 5e-5. Duration: 50K–200K gradient steps on 8–64 GPUs.
Stage 3: Task-specific fine-tuning (LoRA). Freeze the backbone, attach LoRA adapters, and fine-tune on the target robot's demonstrations. This is the step most practitioners perform. Learning rate: 2e-5. Duration: 10–30 epochs on a single GPU. Trainable parameters: 0.2–2% of total.
The key engineering decision in Stage 2 is what to supervise. For discrete action tokens (RT-2 family), the loss is next-token cross-entropy on both web tokens and action tokens — the same loss function for both modalities. For continuous action heads ($\pi_0$ family), the web data is supervised with the VLM's original loss (captioning, VQA) while the robot data is supervised with the action head's loss (flow matching, diffusion). The two loss terms are weighted and summed.
The knowledge insulation problem. When you fine-tune a VLM on robot data, the model's internet knowledge degrades. This is called catastrophic forgetting. The model learns to predict actions but forgets what a dinosaur looks like. Two solutions:
1. Co-training (RT-2 approach): keep web data in the training loop. The model sees both web and robot data at every step. This works but requires maintaining a large web dataset during robot training.
2. Weight freezing ($\pi_0$ approach): freeze most VLM weights. Only the action expert and a few adapter layers are trainable. The internet knowledge is preserved by construction, because the weights that encode it cannot change. This is simpler and increasingly preferred.
The emerging best practice: freeze the VLM backbone entirely, use LoRA adapters for the minimal adaptation needed, and train the action head from scratch. This gives the best of both worlds: internet knowledge preservation + task-specific motor skill learning.
The generalist vs specialist tradeoff, quantified
The empirical data on when generalist VLAs beat specialist policies is now clear enough to state as a rough rule:
The crossover analysis. Consider a deployment with $N$ distinct tasks, each with $D$ demonstrations.
Specialist approach: train $N$ separate Diffusion Policies. Each achieves $\sim$90% success with 50+ demos. Total training: $N$ models × 2 hours = $2N$ GPU-hours. Deployment: one model per task, hot-swap at task boundaries.
Generalist VLA approach: fine-tune one OpenVLA with LoRA on all $N \times D$ demonstrations. Success rate: $\sim$80% on average (worse than specialist on any single task, but covers all tasks with one model). Total training: 1 model × 8 hours = 8 GPU-hours. Deployment: one model, task selected by language instruction.
Crossover at $N \approx 10$–$15$. Below 10 tasks, the specialist is both better (90% vs 80%) and cheaper ($2N < 8$ when $N < 4$). Above 15 tasks, the VLA wins on engineering cost: one model to maintain, one inference pipeline, no task-switching logic. The success rate gap narrows as the VLA sees more diverse data. At $N = 50$ tasks, the VLA often matches or beats the specialist because cross-task transfer improves the shared representations.
Where VLAs are weak
Raw latency. A 7B forward pass dominates the control budget. Two-system splits, FAST tokenization, INT4 quantization, and speculative decoding are the four levers.
Fine motor control. A generalist policy still underperforms a specialist on its specialty by 5–15 points. RL fine-tuning closes most of the gap.
Out-of-distribution physics. A VLA that never saw deformable cloth does not learn cloth physics from a few demos.
A VLA is not a robot policy that happened to use a language model. It is a language model that happens to have a robot as an output device. The implications of that framing — for data, architecture, evaluation, and team structure — are still being worked out.
12·53D representations and equivariance
When the input is a point cloud, the symmetries of physics start paying for themselves.
2D image policies are the dominant paradigm for one reason: 2D images are easy to collect, easy to encode, and have ImageNet-scale priors available. They are also geometrically lossy. A policy trained on RGB images alone has no built-in notion of where things are in 3D space; it has to learn that from data, every time. A small but rapidly growing corner of the field argues that the right move is to give the policy 3D structure directly — and, while you're there, to bake the physical symmetries of 3D space into the architecture.
Why 3D helps
Spatial generalization for free. A policy that sees raw RGB has to learn that an object 30cm to the left looks similar to one straight ahead. A policy that operates on 3D points has the translation built into the input geometry.
Camera invariance. 3D point clouds aggregated from RGB-D or stereo cameras are indifferent to camera placement.
Sample efficiency. 3D Diffusion Policy needs ~10× fewer demos than 2D Diffusion Policy on contact-rich tasks.
The 3D policy zoo
Model
Input
Architecture
Notable
3D Diffusion Policy
Sparse point cloud (~512 pts)
1D embedding + diffusion
Cheap; strong on data-scarce tasks.
3D Diffuser Actor
Multi-view RGB-D → 3D scene tokens
Relative-position 3D attention
Translation equivariant; SOTA on RLBench.
EquiBot
Point cloud
Sim(3)-equivariant network
Scale-equivariant; data efficient.
Spherical Diffusion Policy
Point cloud
SE(3)-equivariant in spherical Fourier space
Full 3D rotational equivariance.
The symmetry argument
If you rotate the entire scene by some $R \in SO(3)$, the correct robot action rotates by the same $R$. A policy that doesn't know this has to learn it from data — separately for every angle. A policy that has it baked in is, by construction, correct for every angle the moment it works for one. This is the same argument that made convolutional networks beat MLPs on images: a network that respects translation symmetry sees the same image once, regardless of where the object is. 3D policies extend the argument from $\mathbb{R}^2$ translations to $SE(3)$ rigid motions.
SE(3) equivariance, formally
In plain English: if you rotate and shift the entire scene — the table, the cup, the robot's coordinate frame — the robot's planned motion rotates and shifts by the exact same amount. The policy "understands" 3D geometry well enough that its answer transforms correctly with the world, rather than memorizing specific positions.
A policy $\pi$ is SE(3)-equivariant if for any rigid transform $g = (R, t) \in SE(3)$ (a rotation $R \in SO(3)$ and translation $t \in \mathbb{R}^3$):
$g \cdot o$ — the transformed observation. If $o$ is a point cloud, $g \cdot o$ rotates and translates every point. If $o$ includes images, $g$ transforms the 3D scene that the images depict.
$\pi(o)$ — the policy output (an action, typically in SE(3) end-effector space). Given the original scene, the policy predicts this action.
$g \cdot \pi(o)$ — the transformed action. If the scene rotates by $R$ and translates by $t$, the correct action rotates and translates by the same $R$ and $t$.
What this means concretely: if you rotate the entire scene 90° clockwise (the table, the cup, the robot's coordinate frame), the predicted end-effector target rotates 90° clockwise too. The policy does not need to learn rotational invariance from data — it is built into the architecture via equivariant layers that preserve the group structure through every computation.
What this means for your system: building equivariant layers requires libraries like e3nn or escnn. The upside is 6–10× data efficiency on orientation-diverse tasks. The downside is 2–5× slower inference (tensor products in spherical harmonic space) and 3–4 weeks of implementation time versus 1 week for a standard policy. If you have >100 demos or your objects appear in fixed orientations, skip equivariance and spend the engineering time collecting more data.
Why equivariance gives sample efficiency. An SE(3)-equivariant policy with $N$ demonstrations effectively has $N \times |SE(3)|$ demonstrations — which is infinite, because $SE(3)$ is a continuous group. For any training scene, the equivariance constraint guarantees correct behavior for every possible rigid transform of that scene, without ever seeing those transforms in the data. This is not data augmentation (which is approximate and finite); it is an exact architectural constraint that provides infinite augmentation for free. The practical consequence: 10 demos with equivariance match or beat 100+ demos without it on contact-rich manipulation tasks.
The 3D policy zoo, expanded
3D Diffusion Policy (Ze et al., 2024). Takes a sparse point cloud (~512 points from depth cameras), encodes it with PointNet++, and uses the resulting feature vector to condition a standard diffusion action head. The 3D structure is in the input representation, not the network architecture — the diffusion process itself operates in flat action space. Cheap to implement; gives the sample efficiency of 3D input without requiring equivariant network layers.
EquiBot (Yang et al., 2024). A Sim(3)-equivariant network that handles not just rotations and translations but also scaling. The architecture uses steerable features based on spherical harmonics, ensuring that the output transforms correctly under any similarity transform of the input. This means the same policy that picks up a small cup can pick up a large bowl without retraining — the scale equivariance handles the geometric adaptation.
Equivariant Diffusion Policy (Wang et al., 2024). Combines SE(3)-equivariant networks with diffusion action generation. The key architectural choice: the denoiser operates in the group's irreducible representations (irreps), so the denoising process itself respects the symmetry. This is harder to implement than 3D Diffusion Policy (which only uses 3D input, not 3D-equivariant layers) but gives stronger generalization guarantees.
The engineering cost: when is equivariance worth it?
Equivariant architectures carry real costs that must be weighed against their sample efficiency benefits:
Implementation complexity. Libraries like e3nn and escnn provide equivariant layers, but they are far less mature than standard PyTorch modules. Debugging requires understanding representation theory (irreps, Wigner-D matrices, Clebsch-Gordan decomposition). A team that takes one week to implement a standard diffusion policy will take 3–4 weeks for an equivariant one.
Inference speed. Equivariant layers involve tensor products in spherical harmonic space, which are 2–5× slower than standard linear layers at comparable parameter counts. For a 200Hz control loop, this overhead can push inference time over budget.
Incompatibility with 2D priors. Foundation model vision encoders (CLIP, DINOv2) produce 2D features. Feeding these into an equivariant 3D network requires lifting — projecting 2D features into 3D space — which loses some of the pretraining benefit.
The decision heuristic: use equivariant architectures when (a) you have fewer than 100 demonstrations, (b) the task involves diverse orientations (e.g., grasping objects in arbitrary poses), and (c) you do not need to leverage 2D foundation model features. If any of these conditions is false, the standard 2D pipeline is likely the better bet.
The downside is engineering. Equivariant networks are harder to write, harder to debug, and harder to compose with foundation-model priors. Spherical-Fourier and steerable-CNN libraries exist but are far less mature than PyTorch's standard transformer. Most of the field is still betting on data + 2D + flexible architectures over symmetry-baked 3D — but the 3D camp's sample efficiency numbers keep getting harder to ignore.
The 3D input representations
Before any equivariance can be applied, the raw sensor data must be converted to a 3D representation. Three options are in common use:
Point clouds. The simplest representation. An RGB-D camera produces a depth image; back-projecting each pixel using the camera intrinsics gives a 3D point $(x, y, z)$ with an associated color $(r, g, b)$. Multiple cameras are merged by transforming each point cloud into a common world frame. The result: $N$ points $\in \mathbb{R}^{N \times 6}$ (XYZ + RGB). Typical $N = 512$–$4096$ after downsampling. Encoded by PointNet++ or DGCNN.
Voxel grids. Discretize the workspace into a 3D grid of voxels, each containing occupancy + features. Resolution is the bottleneck: a 1cm grid over a 0.5m³ workspace requires $50^3 = 125,000$ voxels. Sparse voxel representations (MinkowskiEngine) make this tractable. The advantage: 3D convolutions are well-understood and fast on GPUs. The disadvantage: fixed resolution trades off detail vs memory.
Lifted 2D features. Extract per-pixel features from a ViT (DINOv2 or CLIP), then lift each feature to 3D using the depth map. The result: a 3D feature field where each point has a high-dimensional feature vector instead of just RGB. This is the best of both worlds: 2D pretraining priors + 3D spatial structure. Used by 3D Diffuser Actor and Polarnet.
Worked example: SE(3) equivariance in action. A policy must pick a mug from a table. The mug can appear in any orientation.
Without equivariance (2D policy, 50 demos): The policy sees the mug handle pointing right in 40/50 demos and pointing left in 10/50. At test time, the mug handle points toward the camera (never seen). The policy hesitates, predicts an averaged grasp between the "right-handle" and "left-handle" strategies, and misses the handle entirely. Success: 35%.
With SE(3) equivariance (3D policy, 50 demos): The equivariant architecture guarantees that if the policy can grasp a mug with the handle at 0°, it can grasp it at any angle. The 50 demos teach the concept of handle grasping; the equivariance constraint generalizes it to all orientations. At test time, the mug at a novel orientation is transformed to the canonical orientation internally, the policy predicts the canonical grasp, and the output is transformed back. Success: 88%.
The sample efficiency ratio: the 2D policy would need ~300 demos (covering diverse orientations) to match the equivariant policy's 50-demo performance. The equivariance provides a 6× data efficiency multiplier for this task.
Point cloud encoding architectures
The choice of point cloud encoder determines both the quality of 3D features and whether equivariance is possible:
PointNet++ (Qi et al., 2017). The workhorse. Processes raw $(x, y, z, r, g, b)$ points through set abstraction layers: each layer samples a subset of points, groups nearby points, and applies a shared MLP to produce per-group features. Not equivariant — the MLP operates on raw coordinates, so the features change under rotation. But it is fast (~3ms for 512 points on GPU) and well-understood.
DGCNN (Wang et al., 2019). Constructs a $k$-nearest-neighbor graph in feature space at each layer and applies edge convolutions. Slightly better than PointNet++ for shape classification, comparable for manipulation. Also not equivariant.
Vector Neurons (Deng et al., 2021). Replaces scalar features with 3D vector features that rotate with the input. Each "neuron" outputs a vector in $\mathbb{R}^3$ instead of a scalar, and the network operations (linear layers, nonlinearities) are designed to be SO(3)-equivariant. Used in EquiBot.
Spherical CNNs / e3nn (Geiger et al., 2022). Operate in the basis of spherical harmonics, using tensor products of irreducible representations to ensure exact SE(3) equivariance. The most principled approach but also the slowest: tensor products are computationally expensive, and the library ecosystem is immature compared to standard PyTorch.
Encoder
Equivariant?
Speed (512 pts)
Implementation difficulty
Best for
PointNet++
No
~3ms
Easy (PyTorch Geometric)
General-purpose 3D policies
DGCNN
No
~4ms
Easy
Shape-sensitive tasks
Vector Neurons
SO(3)
~8ms
Moderate
Rotation-diverse tasks
e3nn / Spherical CNN
SE(3)
~15ms
Hard
Maximum sample efficiency
Hybrid strategies
Lift, don't replace. Keep the ViT backbone. Use it to extract per-pixel features, then lift those features into 3D via the camera intrinsics + depth. The downstream policy operates on 3D feature points. You get 3D structure without losing the 2D pretraining.
Canonicalize the input. Before feeding a point cloud into a policy, rotate it to a canonical orientation. The policy itself is not equivariant; the preprocessor handles symmetry.
Use 3D only at the contact phase. Run a 2D VLA for high-level reasoning and reaching, switch to a 3D contact-aware policy for the final approach. The slow-fast split, in 3D form.
Worked example: sample efficiency gain from equivariance. Consider a pick task where the object can appear at any of 12 orientations on a table. A non-equivariant 2D policy needs demonstrations at each orientation — 12×50 = 600 demos minimum. An SE(3)-equivariant 3D policy needs demos at one orientation — 50 demos total — because the equivariance constraint guarantees generalization to all orientations. At test time, the object appears at a 13th unseen orientation. The 2D policy has never seen it and must interpolate. The 3D equivariant policy handles it by construction, because the mapping $f(R \cdot \text{input}) = R \cdot f(\text{input})$ is baked into the architecture.
Equivariance vs invariance, precisely. These two properties are often confused. An invariant function satisfies $f(g \cdot x) = f(x)$ — the output does not change when the input is transformed. An equivariant function satisfies $f(g \cdot x) = g \cdot f(x)$ — the output transforms in the same way as the input.
For robot policies, equivariance is the correct constraint, not invariance. If you rotate the scene 90°, the correct action rotates 90° too (equivariance). An invariant policy would predict the same action regardless of rotation — which is wrong.
The distinction matters architecturally: equivariant layers (e3nn, escnn) propagate group actions through the network. Invariant layers (max-pooling over orientations, rotation-invariant features) discard group information. Using invariant features and then trying to predict orientation-dependent actions is fundamentally ill-posed.
The canonicalization alternative
If equivariant architectures are too expensive to implement, there is a simpler alternative: canonicalize the input. Before feeding a point cloud to a standard (non-equivariant) policy, rotate it to a canonical frame. The policy only ever sees point clouds in the canonical orientation, so it only needs to learn one orientation.
Two canonicalization strategies:
PCA-based. Compute the principal axes of the point cloud and rotate so the first principal axis aligns with the $x$-axis. This is fast (~1ms) and deterministic, but fails for symmetric objects (a sphere has no principal axis).
Learned canonicalization. Train a small network to predict the canonical rotation from the point cloud. This handles arbitrary objects but requires training data with canonical annotations. EquiBot and some 3D Diffusion Policy variants use this approach.
The tradeoff: canonicalization is a preprocessing step that provides approximate equivariance without requiring equivariant network layers. It is much easier to implement than true equivariance, but it introduces errors when the canonicalization is imperfect (which it always is for novel objects). True equivariance is exact by construction but costs 2–5× in implementation effort and inference time.
Depth sensor considerations
Every 3D policy depends on a depth sensor to produce point clouds. The choice of sensor has a direct impact on policy performance:
Structured light (Intel RealSense D435). Projects an IR pattern and triangulates depth. Depth noise: $\sim$1% of range (5mm at 50cm). Fails on shiny, transparent, and black surfaces (IR is absorbed or reflected specularly). The most common sensor in manipulation research. ~$300.
Time-of-flight (Azure Kinect, L515). Measures photon travel time. More robust to surface material than structured light, but higher noise at close range (10mm at 50cm). Better for scenes with mixed materials. ~$400.
Stereo (ZED 2). Triangulates from two RGB cameras. No active illumination, so it works outdoors and in bright light. But depth accuracy depends on texture — featureless surfaces (white walls, smooth plastic) produce noisy depth. ~$450.
The practical advice: use Intel RealSense D435 for tabletop manipulation (cheap, good enough), Azure Kinect for scenes with transparent objects (the ToF sensor handles glass), and ZED for outdoor or mobile applications. Always filter point clouds with statistical outlier removal before feeding them to the policy — depth sensors produce spurious points that can corrupt the 3D features.
When 3D is not worth the engineering cost
Despite the sample efficiency arguments, most deployed robot policies in 2026 still use 2D images. Three reasons:
Foundation model priors are 2D. CLIP, DINOv2, SigLIP, and every large vision encoder are trained on 2D images. No 3D foundation model exists at comparable scale. Using 3D inputs means giving up the most powerful visual priors available.
Depth sensors are noisy. RGB-D cameras (Intel RealSense, Azure Kinect) have depth noise of 1–5mm at close range and 10–30mm at 1m. Shiny, transparent, and dark surfaces produce depth holes. Point clouds derived from noisy depth are themselves noisy — the 3D input is less clean than the 2D input in practice.
Most tasks don't need 3D equivariance. For pick-and-place with a top-down camera and a fixed set of objects, a 2D policy with 200 demos works fine. The equivariance advantage only manifests when objects appear in diverse orientations — which is common in research benchmarks but less common in structured industrial settings.
The practical decision: use 3D only when the task involves diverse 3D orientations (bin picking, random object poses on a table) AND you have fewer than 100 demos AND you do not need language conditioning. Otherwise, bet on 2D + more data.
3D is to robot policies what convolution was to vision: a re-parameterization that does not give you new capabilities, but lets the network learn the capabilities it was always supposed to learn from a fraction of the data.
12·7DVA — Direct Video Action
Imagine success, then figure out the motor commands. A two-model architecture that uses video prediction as an intermediate representation for robot control.
Every policy we have discussed so far maps observations directly to actions. Direct Video Action (DVA) takes a detour: first predict what the future looks like, then predict what actions would produce that future. The bet is that video prediction — trained on internet-scale data — captures richer physics, geometry, and task understanding than any robot-only dataset can provide.
The two-model architecture
DVA decomposes the policy into two learned components:
Video model $\mathcal{V}$. A causal video diffusion (or autoregressive) model that takes the current observation frame(s) $o_t$ and a task specification (language instruction $\ell$ or goal image $g$) and generates $N$ future frames: $\hat{I}_{t+1}, \hat{I}_{t+2}, \ldots, \hat{I}_{t+N}$. This is the "imagination" — it predicts what success looks like, without knowing anything about joints or motors.
Inverse dynamics model (IDM) $\phi$. A small network that takes two consecutive frames $(I_t, I_{t+1})$ and predicts the action $a_t$ that would move the robot from the scene in $I_t$ to the scene in $I_{t+1}$. This is the "execution" — it converts visual plans into motor commands.
At inference, the full pipeline is: observe $o_t$ → video model generates $N$ future frames → IDM converts each adjacent pair to an action → execute the first $K$ actions → re-observe and replan. The receding-horizon structure is identical to Diffusion Policy's action chunking, except the "chunk" is derived from imagined video rather than directly predicted.
Why video prediction as an intermediate representation?
Internet-scale pretraining. Billions of video frames exist on the internet. None of them have action labels, but all of them teach physics: objects fall, liquids pour, hands grasp. A video model pretrained on this data has priors that no robot dataset can match.
Task-agnostic planning. The video model doesn't need to know what a robot arm is. It learns "if you see a hand approaching a cup and the instruction says 'pick up the cup,' the next frames show the hand grasping the cup." The IDM handles the embodiment-specific translation.
Visual reasoning for free. Complex tasks that require spatial reasoning (stacking, insertion, tool use) are hard to express in action space but natural in pixel space. The video model can "see" the solution before the IDM computes the trajectory.
Training pipeline
The two models are trained separately, on different data:
Component
Training data
Loss
Scale
Video model $\mathcal{V}$
Internet video + robot video
Diffusion / AR next-frame prediction
Billions of frames
IDM $\phi$
Robot demonstrations only
MSE on predicted actions
Thousands of trajectories
The video model is pretrained on internet video (no actions needed), then optionally fine-tuned on robot video to improve visual realism in the robot's workspace. The IDM is trained only on robot demonstrations where ground-truth actions are available. This separation is the key economic insight: you can scale the video model with cheap, abundant internet data while the IDM stays small and robot-specific.
Conditioning the video model
The video model conditions on the current frame(s) plus a task specification. Two conditioning modes:
Language conditioning. A text encoder (CLIP, T5) embeds the instruction $\ell$ into tokens that cross-attend into the video diffusion process. "Pick up the red cup and place it on the saucer" → the model generates frames showing exactly that.
Goal-image conditioning. The model is given a goal frame $g$ showing the desired end state. It generates intermediate frames that connect the current observation to the goal. This is visual planning in the literal sense: the model fills in the trajectory between "here" and "there."
The IDM loss
In plain English: show the model two consecutive photos of the workspace. Ask it: "what did the robot do between these two snapshots?" The model guesses a 7D action vector, and you penalize it for how far off it was from the action the robot actually took. That is the entire training signal.
The inverse dynamics model is trained to predict the action that transitions between two consecutive frames. Given a frame pair $(I_t, I_{t+1})$ from a robot trajectory and the ground-truth action $a_t^*$ (the action the robot actually executed between those frames):
$(I_t, a_t^*, I_{t+1})$ — a transition tuple from the robot demonstration dataset $\mathcal{D}$. $I_t$ is the observation image at time $t$, $a_t^*$ is the action the robot executed, and $I_{t+1}$ is the resulting observation.
$\phi(I_t, I_{t+1})$ — the IDM's predicted action. Takes two consecutive images and outputs the action that would transition the scene from $I_t$ to $I_{t+1}$. Architecturally, this is typically a ResNet or ViT that encodes both frames, concatenates the features, and passes them through an MLP.
$\| \cdot \|^2$ — squared L2 norm (MSE). Works well because the IDM predicts a single deterministic action per frame pair. Multimodality is not an issue here — given two specific frames, there is essentially one correct action.
$a_t^*$ — the ground-truth robot action. Typically a 7D vector: $[\Delta x, \Delta y, \Delta z, \Delta \text{roll}, \Delta \text{pitch}, \Delta \text{yaw}, \text{gripper}]$ in end-effector space.
In code:loss = F.mse_loss(idm(frame_t, frame_t1), action_t) — that is literally it. The IDM is a small encoder-MLP that takes two 224×224 images and outputs a 7D action vector. Training takes 2–4 hours on a single GPU. The failure mode to watch: if the two frames look nearly identical (slow motion phases), the predicted action is ill-conditioned and noisy.
The IDM is small (10–50M parameters), fast to train (a few hours on a single GPU), and does not need internet data. It only needs to learn the mapping from "visual change" to "motor command" for a specific robot embodiment.
Video model conditioning: the mechanics
The video model is typically a causal video diffusion transformer (similar to Sora's architecture). It conditions on two signals simultaneously:
Language conditioning via cross-attention. The text instruction $\ell$ is encoded by a frozen text encoder (CLIP or T5) into a sequence of text tokens $z_\ell \in \mathbb{R}^{M \times d}$, where $M$ is the number of text tokens and $d$ is the embedding dimension. At every attention layer of the video diffusion model, the video tokens cross-attend to these text tokens: $\text{Attn}(Q_{\text{video}}, K_{\text{text}}, V_{\text{text}})$. This is the same mechanism that lets Stable Diffusion condition on text prompts — the video model learns to align its generated frames with the semantic content of the instruction.
Current-frame conditioning via concatenation. The current observation frame $o_t$ is typically concatenated as the first frame of the sequence that the video model generates. The model is trained to predict frames $\hat{I}_{t+1}, \ldots, \hat{I}_{t+N}$ conditioned on $I_t$ being real. This grounds the generation: the model cannot hallucinate an entirely different scene, because the first frame is anchored to reality.
The IDM generalization gap
The IDM is trained on real frame pairs from robot demonstrations. At inference, it must process imagined frame pairs from the video model. These imagined frames look slightly different from real frames — subtly blurred textures, imperfect lighting, occasionally physically impossible configurations. The IDM must generalize across this domain gap.
Three mitigation strategies:
Train the IDM on augmented frame pairs. Apply color jitter, Gaussian blur, random crops, and compression artifacts to the real training frames. This makes the IDM robust to the kinds of imperfections the video model produces.
Fine-tune the video model on robot data. After internet-scale pretraining, fine-tune the video model on the robot's actual camera feed. This closes the visual domain gap between imagined and real frames, making the IDM's job easier.
Use a discriminator to filter impossible frames. Train a small classifier to distinguish physically plausible frames from implausible ones (hand passing through table, objects floating). Reject imagined frame sequences that fail the plausibility check and re-sample. This adds inference cost but prevents the IDM from receiving garbage inputs.
The error propagation problem. DVA chains two learned models in series. If the video model imagines frame $\hat{I}_{t+3}$ where the gripper has passed through the table, the IDM dutifully predicts the action that would achieve this impossible configuration — which, when executed on the real robot, produces a collision or a wild trajectory. This cascading failure mode is DVA's Achilles heel. The receding-horizon replanning (re-observe reality every $K$ steps) limits the blast radius, but does not eliminate it. A single bad imagined frame within the executed window can cause a real-world failure before the system has a chance to replan.
Worked example: IDM predicts $\Delta$pose from two 224×224 frames. Two consecutive frames from a Franka robot picking a block. Frame $I_t$ shows the gripper 5cm above the block. Frame $I_{t+1}$ shows it 3cm above.
Encoding: Both frames pass through a frozen DINOv2 ViT-B/14, producing 257 tokens each (256 patch + 1 CLS). We take the CLS tokens: $z_t \in \mathbb{R}^{768}$, $z_{t+1} \in \mathbb{R}^{768}$.
Feature fusion: Concatenate: $[z_t; z_{t+1}] \in \mathbb{R}^{1536}$. Pass through a 3-layer MLP: $1536 \to 512 \to 256 \to 7$.
Prediction: $\phi(I_t, I_{t+1}) = [0.001, -0.002, -0.020, 0.003, -0.001, 0.000, 0.85]$. The dominant component is $\Delta z = -0.020$ (2cm downward motion), matching the visual change between frames. The gripper value 0.85 means "mostly closed" — the robot is about to grasp.
Ground truth: $a_t^* = [0.000, -0.001, -0.022, 0.002, 0.000, 0.001, 0.85]$. The MSE loss: $\|a_t^* - \hat{a}_t\|^2 = 0.001^2 + 0.001^2 + 0.002^2 + 0.001^2 + 0.001^2 + 0.001^2 + 0.0^2 = 9 \times 10^{-6}$. Tiny — the IDM has learned this mapping well.
IDM architecture choices
The inverse dynamics model is small and fast but its design still matters. Two architectures dominate:
Siamese encoder + MLP. Both frames pass through the same frozen encoder (shared weights). The CLS tokens or pooled features are concatenated and fed to a 3-layer MLP that predicts the action. This is the simplest and most common design. Pros: fast, easy to implement, leverages pretrained features. Cons: the concatenated CLS tokens lose spatial information — fine-grained motion (rotation, small displacements) is harder to predict.
Feature-difference encoder. Both frames are encoded, then the difference of their feature maps (or spatial tokens) is computed: $\Delta z = z_{t+1} - z_t$. This difference map is processed by a small CNN or transformer to predict the action. Pros: the subtraction highlights what changed between frames, suppressing static background. Cons: requires spatial features (not just CLS tokens), and the subtraction is sensitive to encoder alignment.
The IDM's accuracy is bounded by the visual resolution of the encoder and the magnitude of the actions. Small actions ($<$1mm displacement between frames) produce nearly identical frames, making the prediction ill-conditioned. This is why DVA typically uses longer frame gaps ($\hat{I}_t$ vs $\hat{I}_{t+2}$ instead of $\hat{I}_{t+1}$) for slow-motion phases of the task.
Inference: the full loop
Observe. Capture current frame $o_t$ and receive language instruction $\ell$.
Imagine. Video model generates $N$ future frames: $\hat{I}_{t+1}, \ldots, \hat{I}_{t+N}$. Typical $N = 8$–$16$.
Translate. IDM converts each adjacent pair to an action: $\hat{a}_{t+i} = \phi(\hat{I}_{t+i}, \hat{I}_{t+i+1})$ for $i = 0, \ldots, N-2$.
Execute. Send the first $K$ actions ($K \leq N-1$) to the robot. Typical $K = 4$–$8$.
Replan. After $K$ steps, re-observe and repeat from step 1.
The replanning loop is essential. Video predictions degrade over long horizons — small errors compound frame by frame. By re-observing reality every $K$ steps, DVA corrects for drift. This is the same receding-horizon principle as Diffusion Policy's action chunking, just applied to imagined frames instead of directly predicted actions.
Worked example: error propagation in DVA. The video model generates 8 future frames for a "pick up cup" task. Frames 1–5 are visually realistic: the gripper approaches the cup from above. Frame 6 has a subtle error: the gripper's shadow is missing, and the cup appears slightly transparent. Frame 7: the gripper appears to pass through the cup's rim (physically impossible). Frame 8: the cup is "grasped" but the fingers are in the wrong position.
IDM on frames 5→6: predicts $\Delta z = -0.015$m (continue descending). Reasonable — the visual change is subtle.
IDM on frames 6→7: predicts $\Delta z = -0.030$m (aggressive descent through the cup). The IDM has never seen a gripper pass through a solid object in training, so it predicts the best-fit action for the impossible visual transition. This action will cause a collision on the real robot.
IDM on frames 7→8: predicts grasp closure. But the gripper is in the wrong position from the previous bad action.
With replanning ($K = 4$): only frames 1–4 are executed. The robot re-observes after frame 4. The bad frames (6–8) are never executed. The replanning catches the error before it matters. This is why short execution horizons ($K = 4$–$6$) are critical for DVA — they limit the time window during which video prediction errors can accumulate.
Computational cost of DVA
The elephant in the room: video generation is expensive. Generating 8 frames at 256×256 resolution with a video diffusion model (50 denoising steps) takes 2–5 seconds on an A100 GPU. For a 10Hz control loop with $K = 4$ executed actions (0.4s between replans), the video model must generate in under 400ms — which requires either a heavily distilled model, fewer denoising steps (10–15 with quality loss), or a latent-space video model that generates at lower resolution and upsamples.
This computational constraint is the primary reason DVA has not replaced direct action prediction. A Diffusion Policy generates a 7-dimensional action chunk in ~30ms. A video model generates the same information encoded in $256 \times 256 \times 3 \times 8 = 1.57$M values, at 100× the cost. The information-theoretic argument is clear: predicting in action space is vastly more efficient than predicting in pixel space, unless the pixel-space predictions carry internet-scale priors that the action-space model cannot access.
Video model architectures for DVA
The video generation component in DVA pipelines uses one of three architectures, each with different tradeoffs:
Video diffusion transformers (UniPi, UniSim). The dominant architecture. A causal transformer operates on latent tokens (compressed from pixels by a VAE encoder). The diffusion process adds and removes noise from the latent sequence. Conditioning on text and current frame is via cross-attention. Typical model size: 1–3B parameters. Generation cost: 2–8 seconds for 8 frames at 256×256.
Autoregressive video transformers (Genie). Predict the next frame token-by-token, conditioned on previous frames. Faster per-frame generation but lower visual quality than diffusion. Action-conditioned variants (Genie 2) can also generate video given actions, inverting the DVA pipeline — useful for building world models.
Subgoal image generators (SuSIE). Instead of generating a full video, generate a single goal image showing the desired outcome. A low-level policy then navigates from the current observation to the goal image. This dramatically reduces generation cost (one image vs eight) and avoids temporal consistency issues, at the expense of losing intermediate trajectory information.
Approach
Output
Generation cost
Planning quality
Error accumulation
Full video diffusion
$N$ future frames
High (2–8s)
High (dense trajectory)
High (per-frame errors compound)
Autoregressive video
$N$ future frames
Medium (1–3s)
Medium
High
Single subgoal image
1 goal frame
Low (0.3–1s)
Low (no trajectory)
Low (no cascading)
Trajectory sketch
2D overlay on current frame
Very low (<0.5s)
Medium (2D path only)
Low
What DVA gains and loses
DVA (video + IDM)
Diffusion Policy / ACT (direct)
Internet pretraining
Yes — video model uses billions of internet frames
No — robot data only
Visual reasoning
Strong — video model plans in pixel space
Implicit only — reasoning must emerge in action space
Cross-embodiment
Video model transfers; only IDM is embodiment-specific
Entire policy is embodiment-specific
Error propagation
Two models in series — video errors compound through IDM
Single model — no cascading errors
Inference cost
High — video generation is expensive (diffusion over pixels)
Low — diffusion over action vectors (7D vs 224×224×3)
Action precision
Limited by video resolution and IDM accuracy
Direct — sub-millimeter possible
When to use DVA vs direct action prediction
The decision between DVA and direct action prediction (Diffusion Policy, VLA) comes down to three factors:
1. Data availability. DVA shines when you have abundant internet video but limited robot demonstrations. The video model can be pretrained on billions of internet frames (no actions needed), and only the small IDM requires robot data. If you have <50 robot demonstrations but the task involves common objects, DVA's internet priors give it a significant edge.
2. Task complexity. DVA's video prediction excels at tasks requiring spatial reasoning: stacking, insertion, tool use, arrangement. The video model can "see" the solution in pixel space before the IDM computes the trajectory. Direct action prediction struggles with these tasks because reasoning about spatial outcomes in action space is harder than reasoning in pixel space.
3. Latency requirements. DVA's inference cost is dominated by video generation (2–8 seconds). Direct action prediction runs in <50ms. If the task requires reactive control (catching objects, responding to perturbations), DVA is not viable. If the task is slow enough to tolerate 2-second planning pauses between execution phases, DVA is competitive.
The DVA paradigm is most compelling not as a replacement for direct action prediction, but as a pretraining strategy. Train a video model on internet data, use it to generate synthetic robot trajectories (via an IDM), and use those synthetic trajectories to pretrain a direct action prediction model. This pipeline — internet video → imagined robot trajectories → pretrained policy — combines DVA's data advantage with direct prediction's inference speed.
The family tree
DVA is not one paper but a paradigm. The key members:
UniPi (Du et al., 2023). The founding paper. Text-conditioned video diffusion model + IDM. Demonstrated the idea on simulated tasks. arXiv:2302.00111
SuSIE (Black et al., 2023). Generates a single subgoal image rather than a full video, then uses a low-level policy to reach it. Cheaper and more robust than full video generation. The subgoal is generated by an image-editing model conditioned on the instruction. arXiv:2312.07526
RT-Trajectory (Gu et al., 2023). Draws a coarse trajectory sketch over the current image rather than predicting future frames. The policy conditions on this 2D trajectory overlay. The "video" is simplified to a single annotated frame. arXiv:2311.01977
Genie / Genie 2 (Bruce et al., 2024). Learned world models that can generate interactive video from actions. Genie operates in the opposite direction — actions → video — but the shared infrastructure (causal video transformers, latent dynamics) is the same. arXiv:2402.15391
UniSim (Yang et al., 2023). A universal simulator that generates realistic video conditioned on diverse actions (robot commands, human actions, camera motion). Can serve as both the video model in a DVA pipeline and a training simulator. arXiv:2310.06680
The deeper lesson
DVA is a bet on representation. The claim is that pixels — not actions, not latent vectors, not language — are the natural intermediate representation for robot planning, because pixels are what internet-scale pretraining understands best. The counterargument is that pixels are wasteful: you generate 224×224×3×N values only to extract 7×N action values from them. The answer, for now, is that the waste is worth it when the pretraining priors are strong enough. As action-space foundation models (VLAs) improve, the balance may shift — but in 2026, the video-prediction camp remains the only group that can leverage truly internet-scale data for robot control.
DVA in the broader context
DVA is best understood as one instance of a broader pattern: using a foundation model's representation space as an intermediate layer for robot control. The foundation model provides priors that no robot dataset can match; the robot-specific component provides the embodiment mapping. The variants differ in which foundation model and which intermediate representation:
Paradigm
Foundation model
Intermediate representation
Robot-specific component
DVA (video)
Video diffusion model
Imagined future frames
Inverse dynamics model (IDM)
VLA (language)
Vision-language model
Hidden states / tokens
Action head (diffusion/flow/discrete)
Language planning
LLM
Text plans / code
Low-level policy per primitive
Value-map planning
LLM + VLM
3D scalar fields
Motion planner (MPC/optimization)
The trend is toward unifying these approaches: a single VLM that can reason in language, predict in video, and act in continuous space simultaneously. Gemini Robotics 1.5 is the first model to attempt all three. Whether the unified approach outperforms specialized decompositions remains an open empirical question as of 2026.
DVA decomposes robot policy learning into "what should the world look like next?" and "what motor commands make that happen?" The first question has internet-scale training signal. The second has a simple, well-posed answer. The decomposition is the insight.
12·DThe VLA zoo
The full landscape of vision-language-action models — the ones that shipped, the ones that scaled down, and the ones that proved cross-embodiment is real.
Section 12 gave the lineage table. This section opens each row and looks inside. The field moved fast enough in 2024–2025 that "VLA" now covers at least four distinct architectural bets: cross-embodiment pre-training, small-VLA distillation, bimanual specialists, and open-source generalists. Knowing which bet each model makes is the difference between picking the right starting point and wasting a quarter.
Cross-embodiment pre-training
HPT — Heterogeneous Pre-trained Transformers
Wang et al., 2024. The core idea: different robots have different observation and action spaces, but the task semantics are shared. HPT handles heterogeneity by giving each embodiment its own lightweight "stem" encoder — a small MLP or CNN that projects that robot's observations into a shared token space — and a shared transformer trunk that processes the tokens regardless of where they came from. Actions are decoded by per-embodiment "head" MLPs.
The architecture is deliberately modular. Adding a new robot means training a new stem and head while keeping the trunk frozen. This is the "plug-and-play embodiment" idea: the trunk learns task-level abstractions (approach, grasp, place) and the stems/heads handle the geometry of each particular arm.
In plain English: different robots plug different adapters into the same brain. A Franka arm and a UR5 have different cameras and different joints, but the concept of "reach toward the red cup" is the same. Each robot gets its own small translator (the stem) that converts its sensors into a common language the shared brain speaks, and another small translator (the head) that converts the brain's output back into that robot's joint commands.
$o_t^e$ — the observation for embodiment $e$ at time $t$. Different embodiments have different observation shapes (number of cameras, proprioception dimensions).
$\text{Stem}_e$ — the per-embodiment encoder. Projects $o_t^e$ into a fixed-dimensional token $z_t$. Typically a 2–3 layer MLP for proprioception and a small CNN or frozen ViT for images.
$\text{Trunk}$ — the shared transformer. Processes tokens from all embodiments identically. This is where cross-embodiment transfer happens.
$\text{Head}_e$ — the per-embodiment action decoder. Maps the trunk's output back to the action space of embodiment $e$.
In code:z = stem_franka(obs); h = trunk(z); a = head_franka(h) — three calls, three modules. To add a new robot, you write a new stem and head (tiny MLPs, ~2M params each), freeze the 300M-param trunk, and fine-tune on 50 demos. Training: ~1 hour on a single GPU.
HPT was pre-trained on data from over 50 robot embodiments. The result: fine-tuning the trunk + a new stem/head on a novel robot with just 50 demos beats training from scratch with 200. The trunk is genuinely learning transferable motor abstractions, not just averaging.
CrossFormer
Doshi et al., 2024. Same thesis as HPT — cross-embodiment pre-training with heterogeneous inputs — but different architectural choices. CrossFormer uses a single transformer that ingests all modalities (images, proprioception, language) as tokens, with learned "embodiment embeddings" added to each token to tell the model which robot produced it. No separate stems; the transformer does the alignment internally.
The tradeoff: CrossFormer is simpler to implement (one model, one forward pass) but harder to extend to new embodiments without retraining. HPT's modular stems make zero-shot embodiment addition cleaner; CrossFormer's monolithic design makes within-distribution performance slightly higher.
The small-VLA wave
TinyVLA
Wen et al., 2024. The first serious attempt to ask: how small can a VLA be and still work? TinyVLA uses a 1B-parameter backbone (a distilled VLM) and shows that with careful LoRA fine-tuning and aggressive data curation, a 1B model matches the 7B OpenVLA on standard benchmarks. The insight is that most of the 7B parameters are dedicated to language understanding that robot control does not exercise — a smaller model with the same visual and motor capacity suffices.
SmolVLA
HuggingFace, 2025. Pushed the frontier further: 450M parameters, flow-matching action expert, and performance matching models 10× its size on LIBERO and SimplerEnv. The recipe: SmolVLM as the vision-language backbone (itself a distillation of larger VLMs), a lightweight flow-matching expert for actions, and LoRA adapters for task-specific fine-tuning. SmolVLA runs on a single consumer GPU at inference — important for labs that don't have an A100 per robot cell.
The small-VLA wave is not about compression for its own sake. It is about deployment economics. A 7B VLA needs a $10K GPU per robot. A 450M VLA runs on a $2K Jetson. When you're deploying 100 robots, the difference is $800K.
Bimanual and mobile specialists
Mobile ALOHA
Fu et al., 2024. A mobile base with two ALOHA arms and whole-body teleoperation. The architecture is ACT (a conditional VAE + transformer), not a VLA — but the training trick is the headline: co-training. Mixing mobile ALOHA trajectories with static ALOHA trajectories in a single dataset improves success rates on both setups, even though the embodiments are different. The shared representation of bimanual coordination transfers across the mobile/static divide.
The co-training result is counterintuitive. A policy trained on mobile+static data outperforms one trained on mobile data alone, even on mobile tasks. The explanation: static data provides more diverse manipulation examples that the shared bimanual trunk can leverage, and the mobile base trajectories provide context that the static policy never sees. Both benefit.
ALOHA 2
Aldaco et al., 2024. Hardware iteration, not an architecture paper. Better teleoperation (lower friction, wider range of motion), better cameras (higher resolution, wider FOV), and a systematic study of data quality vs quantity. The headline result: 50 high-quality demos (smooth, consistent, no hesitation) outperform 200 mediocre demos (jittery, varied strategy). The lesson generalizes beyond ALOHA: if you're collecting data, train your teleoperators.
RDT-1B
Liu et al., 2024. A 1.2B-parameter Diffusion Transformer built specifically for bimanual manipulation. The architecture is a standard DiT (the same backbone used in image generation) adapted for action sequences: noised action chunks as input tokens, cross-attention to image and language tokens, iterative denoising at inference. Trained on over 1M bimanual episodes from multiple robot platforms.
RDT-1B's significance is proving that the DiT architecture — which was designed for image generation — works for action generation at scale. The denoising formulation handles bimanual multimodality naturally (two arms have highly multimodal coordination patterns), and the 1B parameter count is large enough to absorb cross-embodiment variation without the modular stem/head design of HPT.
The open-source generalist
Octo
Octo Team, 2024. The first serious open-source generalist robot policy. Transformer backbone (27M or 93M parameters), trained on 800K episodes from the Open X-Embodiment dataset. Supports language conditioning, goal-image conditioning, or both. Action head is a diffusion model (continuous actions) or a discrete tokenizer, selectable at fine-tuning time.
Octo's design philosophy is flexibility over performance. It is not the best policy on any single benchmark, but it is the only open model that can be fine-tuned to a new robot with 50 demos in an afternoon. The provided fine-tuning scripts, data loaders, and evaluation harness make it the practical starting point for most academic labs in 2025.
Octo: the practical starting point
Octo deserves special attention because it is the model most academic labs will actually use. The architecture is deliberately simple: a standard transformer backbone (27M or 93M parameters) with separate tokenizers for image, language, and proprioception inputs. The action head is pluggable — you can choose a diffusion head (continuous actions) or a discrete tokenizer at fine-tuning time.
The fine-tuning workflow is designed for accessibility:
Record 50–200 demonstrations on your robot using any teleoperation method.
Convert to the RLDS format (a standard data format for robot learning datasets).
Run the provided fine-tuning script: python finetune.py --data your_data --model octo-base. Typical time: 2–4 hours on a single A100.
Deploy with the inference script. The model runs at ~10Hz on an RTX 3090.
Octo is not the best policy on any single benchmark. But it is the only open model that provides the complete pipeline — from data collection to deployment — with documentation, scripts, and community support. For a lab that wants to get a VLA running on their robot in a week rather than a quarter, Octo is the answer.
The full comparison
Model
Params
Action head
Embodiments
Data scale
Open
HPT
~300M trunk
Per-embodiment MLP
50+ (stem per embodiment)
150K+ episodes
Yes
CrossFormer
~130M
Diffusion / discrete
10+ (embodiment embeddings)
900K+ episodes
Yes
TinyVLA
1B
Discrete tokens
Single (fine-tune per robot)
Fine-tune from OpenVLA data
Yes
SmolVLA
450M
Flow matching expert
Single (LoRA per task)
Fine-tune, LIBERO/SimplerEnv
Yes
Mobile ALOHA
~80M (ACT)
CVAE + transformer
Mobile + static bimanual
~800 co-trained episodes
Yes
ALOHA 2
~80M (ACT)
CVAE + transformer
Static bimanual
50–200 per task
Yes
RDT-1B
1.2B
Diffusion (DiT)
Multiple bimanual platforms
1M+ episodes
Yes
Octo
27M / 93M
Diffusion / discrete
22 robots (OXE)
800K episodes
Yes
OpenVLA
7B
Discrete tokens
Single (fine-tune)
970K episodes
Yes
π₀
3.3B
Flow matching
7+ embodiments
10K+ hours
Partial
GR00T N1
2.2B
Diffusion / flow
Humanoid-focused
Real + synthetic + video
Partial
The practical recipe: LoRA fine-tuning a VLA
The standard workflow for deploying a VLA on a new robot in 2026 is not training from scratch — it is LoRA fine-tuning a pretrained checkpoint. Low-Rank Adaptation (LoRA) freezes the pretrained weights and injects small trainable low-rank matrices into the attention layers. The result: you update ~1–5% of the parameters while preserving the pretrained knowledge.
LoRA fine-tuning recipe for a VLA
frompeftimportLoraConfig, get_peft_modelfromtransformersimportAutoModelForVision2Seq# Load pretrained VLA checkpointmodel = AutoModelForVision2Seq.from_pretrained("openvla/openvla-7b")
# Configure LoRA: inject low-rank adapters into attentionlora_config = LoraConfig(
r=32, # rank of the adapter matriceslora_alpha=32, # scaling factor (alpha/r = 1.0)target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# "trainable params: 19,922,944 || all params: 7,615,616,000 || 0.26%"# Fine-tune on your robot's demonstration data# Typical: 50-200 demos, 2-8 hours on a single A100trainer.train(
train_dataset=robot_demos,
learning_rate=2e-5, # low LR to preserve priorsnum_epochs=20,
batch_size=8,
warmup_ratio=0.05,
)
The key hyperparameters: rank $r = 16$–$64$ (higher rank = more expressivity, more parameters), learning rate $2 \times 10^{-5}$ (lower than full fine-tuning to avoid catastrophic forgetting), and 10–30 epochs on small datasets. The total trainable parameter count is typically 0.2–2% of the full model. This recipe works for OpenVLA, SmolVLA, and any HuggingFace-compatible VLA checkpoint.
Worked example: HPT stem architecture. A new lab has a Stretch RE-1 robot (one arm, 6-DoF, wrist camera + base camera, joint positions as proprioception). The pretrained HPT trunk was trained on 50+ embodiments but never saw a Stretch.
Step 1: Design the stem. The Stretch's observation space is: wrist image (224×224 RGB), base image (224×224 RGB), joint positions (6D), gripper width (1D). The stem must map all of these to the trunk's token format (512-dim tokens).
Image stem: frozen ViT-B/14 encodes each image → take CLS token (768D) → linear projection to 512D. Two images → 2 tokens.
Proprio stem: concatenate [joints(6); gripper(1)] = 7D → 2-layer MLP (7 → 256 → 512) → 1 token.
Total input to trunk: 3 tokens per timestep.
Step 2: Design the head. The Stretch's action space is: $\Delta$joint positions (6D) + gripper (1D) = 7D. Head: trunk output token (512D) → 2-layer MLP (512 → 256 → 7).
Step 3: Fine-tune. Freeze the trunk. Train only the stem + head on 50 Stretch demonstrations. Trainable parameters: ~2M (vs. 300M in the trunk). Training time: ~1 hour on a single GPU. The trunk's pretrained knowledge of "reach, grasp, place" transfers; the stems/heads adapt it to the Stretch's specific geometry.
Worked example: Mobile ALOHA co-training. You have 200 mobile ALOHA episodes (bimanual tasks on a wheeled base) and 3,000 static ALOHA episodes (bimanual tasks on a fixed table). Training on mobile-only: 62% success. Training on static-only: N/A for mobile tasks. Co-training on both:
The trick: combine both datasets into a single training set. For static episodes, the base velocity action dimensions are masked to zero. The ACT transformer processes both data sources identically — the bimanual arm tokens are the same; only the base tokens differ.
Why it helps: the static dataset provides 15× more diverse manipulation examples (different objects, grasps, placements). The transformer trunk learns general bimanual coordination primitives from this larger dataset. The mobile base trajectories provide the navigational context. Both halves share the manipulation representations.
Result: co-training achieves 78% success on mobile tasks — a 16-point improvement from a dataset that contains zero mobile demonstrations of its own. The static ALOHA data acted as a force multiplier for the mobile policy.
The foundation model vs specialist debate
When does scale help? The data says: scale helps when the task distribution is broad, when the robot will encounter novel objects, and when language conditioning is needed. A 7B VLA fine-tuned on 50 demos of "pick up the red block" will lose to a 3M-parameter Diffusion Policy trained on 50 demos of the same task. The VLA's advantage emerges when you ask it to pick up the blue block tomorrow, or the dinosaur next week, or to fold a shirt it has never seen.
The practical heuristic: if your deployment involves fewer than 5 distinct tasks with known objects, use a specialist. If it involves open-vocabulary instructions, novel objects, or a task set that will grow over time, use a VLA. The crossover point is roughly 10–20 tasks — below that, per-task specialists win on accuracy; above it, the VLA wins on engineering cost.
The deployment decision tree
Given the zoo of available models, how do you choose? The decision is primarily about your task distribution, data budget, and compute budget:
Worked example: choosing the right VLA for your deployment.Scenario A: Single task, 50 demos, one robot. Use a specialist policy (Diffusion Policy or ACT), not a VLA. The VLA's internet knowledge adds nothing when the task is fixed. Train time: 2 hours. Inference: <50ms.
Scenario B: 5 tasks, 100 demos each, one robot. Use Octo or SmolVLA with LoRA fine-tuning per task. The shared backbone amortizes the visual representation across tasks. Train time: 4 hours total. Inference: ~100ms.
Scenario C: Open-vocabulary instructions, 500 demos, one robot. Use OpenVLA-7B with LoRA. You need the language grounding from the VLM backbone. Train time: 8 hours on an A100. Inference: ~200ms (need two-system split for 10Hz).
Scenario D: 3 different robot platforms, 200 demos each, shared tasks. Use HPT or CrossFormer for cross-embodiment transfer. Train per-embodiment stems on each robot's data, share the trunk. Train time: 12 hours total (stems train fast, trunk is pretrained).
Scenario E: Humanoid, 35 DOF, bimanual, language instructions. Use $\pi_0$-class architecture: VLM backbone + flow matching action expert. The two-system split is mandatory at this DOF count and control frequency. Budget: A100 or H100 per robot.
The VLA zoo is converging on a common body plan: frozen VLM backbone, lightweight action expert, LoRA or adapter fine-tuning. The variation is in the action head (discrete vs diffusion vs flow), the bridge (tokens vs hidden states vs latent plan), and the training data. If you're starting a new project, pick the smallest model that covers your task distribution and fine-tune it. Don't train from scratch unless you have a novel embodiment and 100K+ demos.
12·Eπ₀ — the flow-matching VLA, in full
The most important VLA architecture of 2024–2025 deserves its own teardown. Here is every layer, every loss, and every training trick.
Section 12 introduced π₀ as one row in the VLA lineage table. Section 12·D placed it in the zoo. This section opens the hood. π₀ is not just another VLA — it is the architecture that proved three things simultaneously: (1) flow matching beats diffusion for action generation, (2) a dedicated action expert outperforms discrete tokenization, and (3) cross-embodiment training across seven robot platforms produces a single policy that generalizes to new tasks with minimal fine-tuning. Understanding π₀ in detail is understanding where the field is going.
Architecture: the two-headed transformer
The core insight behind π₀ is a separation of concerns inside a single transformer. A VLM backbone — based on PaliGemma, a 3B-parameter vision-language model — handles perception and language understanding. It processes camera images and language instructions, producing rich hidden-state representations that encode "what is in the scene" and "what the instruction means." But the VLM backbone never predicts actions directly. Instead, its hidden states are fed to a separate action expert: a dedicated set of transformer layers that speak continuous distributions instead of discrete tokens.
The action expert shares the backbone's attention layers — it participates in the same self-attention computation over image, language, and proprioception tokens — but has its own MLP weights. This is a mixture-of-experts (MoE) design: during the forward pass, each token is routed through either the VLM's MLPs or the action expert's MLPs, depending on whether it is a "perception" token or an "action" token. The attention layers are shared because cross-modal attention is what allows the action expert to "see" the scene and "hear" the instruction. The MLPs are separate because the computation needed to predict continuous action distributions is fundamentally different from the computation needed to answer VQA questions.
Why not just discretize? RT-2 and OpenVLA discretize each action dimension into 256 bins and predict bin indices autoregressively. This works, but it imposes a resolution ceiling: 256 bins over a 0.4m workspace gives ~1.6mm per bin. For insertion tasks, that is too coarse. It also destroys cross-dimension correlations: predicting bin indices independently for each joint misses the fact that joints move together in coordinated patterns. Flow matching avoids both problems by modeling the full continuous joint distribution in one shot.
The action expert: flow matching over action chunks
The action expert generates actions by learning a velocity field that transports noise to data. At training time, it learns to predict the straight-line velocity between a noise sample $x_0 \sim \mathcal{N}(0, I)$ and a ground-truth action chunk $x_1 = a_{1:H}$. At inference time, it starts from pure noise and integrates the learned velocity field forward to produce a clean action chunk.
$t \sim U(0, 1)$ — flow time, sampled uniformly. At $t = 0$ the input is pure noise; at $t = 1$ it is pure data.
$x_0 \sim \mathcal{N}(0, I)$ — noise sample, same shape as the action chunk: $(H \times D)$ where $H$ is the chunk length and $D$ is the action dimension.
$x_1 = a_{1:H}$ — ground-truth action chunk from the demonstration. For a 7-DoF arm with $H = 50$, this is a $(50 \times 7)$ tensor.
$(1-t)x_0 + t\,x_1$ — the linear interpolant between noise and data. The input to the velocity network at flow time $t$.
$v_\theta(\cdot, t, h_{\text{vlm}})$ — the action expert's velocity prediction, conditioned on the VLM's hidden states $h_{\text{vlm}}$. The expert "sees" the scene and instruction through $h_{\text{vlm}}$.
$(x_1 - x_0)$ — the target velocity: the straight-line direction from noise to data. This is what the network must learn to predict.
In plain English: take a real action chunk from a demonstration, mix it with random noise at a random ratio, and ask the network to predict which direction leads from noise toward the real actions. Do this millions of times, and the network learns a vector field that can turn any noise sample into a plausible action chunk, conditioned on what the robot sees and hears.
Diffusion policies (Section 08) denoise iteratively through $K$ noise levels, typically $K = 100$ steps with DDPM or $K = 10$–$20$ with DDIM. Each step requires a full forward pass of the denoiser. Flow matching uses straight-line paths between noise and data, which means fewer integration steps are needed to produce high-quality samples. In practice, π₀ uses 10 Euler steps at inference — 2–5× fewer than diffusion — while achieving equal or better sample quality. Fewer steps = lower latency = higher control frequency.
The second advantage is natural multimodality. When the demonstration data contains multiple valid strategies for the same observation (approach from the left or the right), the flow matching velocity field smoothly routes different noise samples toward different modes. Diffusion can do this too, but flow matching's linear interpolation makes mode separation more stable during training — the velocity targets $(x_1 - x_0)$ are always well-defined, even when $x_1$ is multimodal.
Training recipe: three stages
π₀'s training is a three-stage pipeline, each stage building on the previous:
Stage 1: VLM pre-training (inherited)
The PaliGemma backbone arrives pre-trained on internet-scale vision-language data. This is the most expensive stage — thousands of GPU-hours on billions of image-text pairs — and is done by the model provider, not the robotics lab. The backbone already understands "red cup," "wooden table," "pick up," and thousands of other visual-semantic concepts.
Provides the visual-semantic foundation that robot data alone cannot teach
Stage 2: Co-fine-tuning on web + robot data
The critical stage. The backbone is fine-tuned on a mixture of web VQA data and robot demonstration data. Every training batch contains both: ~70–80% robot samples supervised with the flow matching action loss, and ~20–30% web samples supervised with the standard VQA cross-entropy loss. The two losses are summed with a weighting ratio.
The web data prevents catastrophic forgetting. Without it, the backbone forgets what "red cup" looks like after a few thousand robot-only gradient steps. With it, the visual-semantic features stay alive while the action expert learns to use them for motor control. The ratio matters: too much web data slows convergence on robot tasks; too little and forgetting sets in. The sweet spot is empirically ~75% robot, ~25% web.
Teaches action prediction while preserving internet knowledge
Stage 3: Task-specific LoRA fine-tuning
Freeze the backbone and action expert weights. Attach LoRA adapters (rank 32, ~20M trainable parameters) to the attention layers. Fine-tune on 50–200 demonstrations of the target task on the target robot. This stage takes 1–2 hours on 8 GPUs and adapts the general-purpose policy to the specific geometry, objects, and task requirements of the deployment.
Specializes the generalist policy without destroying it
Co-fine-tuning: why mixing web and robot data helps
The intuition is worth deriving carefully. During web pre-training, the VLM backbone learns a representation where the token for "red" is close to the visual features of red objects. During robot training, the action expert learns that "pick up" means a specific motion trajectory. Co-fine-tuning lets both signals reinforce each other: the backbone maintains its semantic features (red = this visual pattern) while the action expert learns to use those features for motor control (red object at position X = reach to position X).
Without web data in the mix, the backbone's weights drift. After 10K robot-only gradient steps, the representation of "red" has been overwritten by features more useful for predicting the flow matching velocity field but less useful for distinguishing "red cup" from "blue cup." The model can still grasp objects, but it can no longer follow the instruction "pick up the red cup" because "red" no longer activates the correct visual features. This is catastrophic forgetting in action.
Without robot data, the backbone understands language and vision perfectly but has no idea how to translate that understanding into motor commands. It knows what a cup is but cannot close the gripper around one. The action expert starts from random weights and needs robot demonstrations to learn the mapping from VLM hidden states to velocity fields.
The mixing ratio acts as a regularizer strength. More web data = stronger regularization against forgetting = slower convergence on robot tasks. The practical schedule: start with a higher web ratio (40%) in the first 10K steps when the action expert is learning from scratch and the forgetting risk is highest, then decay to 20% for the remainder of training.
The π₀ evolution: 0 → 0.5 → 0.7
π₀ (2024): the base architecture
PaliGemma backbone + flow matching action expert. Trained on 10M+ demonstration steps across 7 robot platforms (Franka, UR5, ALOHA, Sawyer, xArm, Google Robot, mobile platforms). 3B parameters total (2.7B backbone + 300M action expert). Proved that a single policy can control multiple robots with different morphologies, sensors, and action spaces.
π₀.5 (2025): task-specific LoRA
Added the third training stage: LoRA fine-tuning for specific tasks and environments. The key finding was that 50–200 demonstrations plus LoRA fine-tuning on a new task achieves higher success rates than the base π₀ policy with 10x more demonstrations of that task in the pre-training mix. The per-task adapters are tiny (20M parameters each) and can be hot-swapped at inference: load the "fold towel" adapter for towel folding, the "pour water" adapter for pouring. Open-world generalization to new kitchens and bedrooms.
π₀.7 (2026): the RL Token
π₀.7 introduced two innovations. First, multi-scale memory (MEM): a hierarchical context window that stores recent observations at full resolution and older observations at compressed resolution, enabling tasks longer than 10 minutes without exceeding the transformer's context length. Second, the RL Token.
The RL Token is a special conditioning token prepended to the input sequence during online RL fine-tuning. When this token is absent, the policy behaves normally — it executes the most likely action for the given observation and instruction. When the RL Token is present, the flow matching sampling process adds an entropy bonus: instead of integrating the velocity field to its most likely endpoint, the sampler adds controlled noise at each Euler step, producing more diverse and exploratory action trajectories.
This is the mechanism that enables online RL polishing: the RL Token turns the deterministic-at-inference policy into a stochastic one, providing the exploration needed for policy gradient methods to discover better-than-demonstration behaviors. Once RL fine-tuning converges, the RL Token is removed, and the policy returns to its high-precision, low-variance mode. The beauty is that the same model weights serve both purposes — no separate exploration policy is needed.
Franka, UR5, ALOHA, Sawyer, xArm, Google Robot, mobile
Action chunk size
50 steps
At 50Hz = 1 second of motion per chunk
Inference latency
~100ms
10 Euler steps on A100; backbone amortized across chunks
LoRA fine-tuning
1–2 hours
8 GPUs, 50–200 demonstrations
LoRA adapter size
~20M params
0.6% of total; hot-swappable per task
Worked example: one inference step, traced end to end
Worked example: π₀ inference for "pick up the red cup." The robot is a Franka Panda with a wrist camera and a scene camera.
Step 1: Image encoding. Scene camera (224×224×3) → PaliGemma ViT encoder → 256 image tokens $\in \mathbb{R}^{2048}$. Wrist camera → same encoder → 256 tokens. Total: 512 image tokens. Latency: ~12ms.
Step 2: Language encoding. "Pick up the red cup" → PaliGemma tokenizer → 8 text tokens $\in \mathbb{R}^{2048}$. Latency: negligible (embedding lookup).
Step 3: Proprioception encoding. Joint positions (7D) + gripper width (1D) + EE pose (6D) = 14D → MLP → 1 token $\in \mathbb{R}^{2048}$. Latency: negligible.
Step 4: VLM backbone forward pass. Concatenated sequence: [8 text + 512 image + 1 proprio] = 521 tokens. Processed through 28 transformer layers. Each layer: shared attention across all tokens, then VLM-specific MLPs for the text/image/proprio tokens. Output: $h_{\text{vlm}} \in \mathbb{R}^{521 \times 2048}$. Latency: ~40ms on A100.
Step 5: Action expert — flow matching generation. Initialize $x_0 \sim \mathcal{N}(0, I)$ with shape $(50, 7)$ — 50-step chunk, 7-DoF action. Flatten to 350D, project to action tokens, append to the sequence. Run 10 Euler steps:
For $k = 0, 1, \ldots, 9$:
$t_k = k / 10$
Feed $[h_{\text{vlm}};\; \text{action\_tokens}(x_{t_k})]$ through shared attention + action expert MLPs
$v_k = \text{expert\_output}$ — predicted velocity, shape $(50, 7)$
$x_{t_{k+1}} = x_{t_k} + (1/10) \cdot v_k$ — Euler integration step
Final: $x_1 = \hat{a}_{1:50}$ — the predicted 50-step action chunk. Latency: 10 × ~5ms = ~50ms.
Step 6: Execute. Send the first 10 actions to the Franka's operational-space controller at 50Hz. After 200ms (10 steps), re-observe and replan — or continue executing the remaining 40 steps if confidence is high.
Total latency: 12ms (vision) + 40ms (backbone) + 50ms (flow matching) = ~102ms. Well within a 5Hz replan budget.
The MoE architecture, precisely
The mixture-of-experts design is worth examining at the layer level. In a standard transformer, each layer has two sub-modules: multi-head self-attention and a feed-forward MLP. In π₀, the attention sub-module is shared across all token types — image tokens, text tokens, proprio tokens, and action tokens all attend to each other in the same attention computation. But the MLP sub-module is split: image/text/proprio tokens are routed through the VLM's original MLP weights (frozen from PaliGemma pre-training), while action tokens are routed through the action expert's MLP weights (trained from scratch on robot data).
This design has three consequences. First, the action expert can "read" the scene and instruction through attention without any information bottleneck — it has full access to every image patch and every word. Second, the VLM's MLPs are never updated by robot gradients, so internet knowledge is preserved by construction. Third, the action expert's MLPs are free to learn action-specific computations (velocity field prediction, temporal correlations across the chunk) without being constrained by the VLM's pre-trained MLP structure.
The π₀ architecture is not "a VLM with an action head bolted on." It is a shared attention backbone with two specialist MLP tracks. The attention layers are the common language; the MLPs are the specialized dialects. This is why it works better than either a pure VLM (which wastes MLP capacity on language tasks irrelevant to robotics) or a pure action model (which lacks the visual-semantic understanding that internet pre-training provides).
12·FSmolVLA — the small-VLA revolution
The counter-movement to scaling: efficient VLAs that run on consumer hardware and still match the giants on most benchmarks.
The VLA story through 2024 was one of relentless scaling: RT-2 at 55B, OpenVLA at 7B, π₀ at 3.3B. Each model assumed that more parameters meant better generalization. Then 2025 arrived, and a 450M-parameter model matched the 7B one on the benchmarks that mattered. The bottleneck was never model size. It was data quality and action head design.
SmolVLA: the architecture
Shukor et al., 2025 (HuggingFace). SmolVLA achieves comparable performance to OpenVLA-7B with only 450M parameters. Four architectural choices make this possible:
Efficient VLM backbone: SmolVLM. Instead of a 7B Llama, SmolVLA uses SmolVLM — a 2B-class vision-language model with better token efficiency. SmolVLM was designed for edge deployment from the start, with aggressive knowledge distillation from larger VLMs. The 450M-parameter count includes both the VLM backbone and the action expert.
Flow matching action expert. Borrowed directly from π₀. Instead of discretizing actions into 256 bins (which wastes capacity on the tokenization overhead), SmolVLA routes actions through a continuous flow matching expert. This is more expressive per parameter — the network's capacity goes toward modeling the action distribution, not toward learning a binning scheme.
Aggressive image token pooling. Standard VLAs produce 256 image tokens per camera (from a 224/14 ViT). SmolVLA pools these down to 64 tokens via spatial average pooling before feeding them to the backbone. Fewer tokens = quadratically less attention computation. The information loss is minimal for manipulation tasks, where fine-grained spatial detail matters less than object identity and relative position.
LoRA for task adaptation. Only ~2% of parameters are task-specific. The base model is frozen; per-task LoRA adapters (rank 16, ~9M params) handle specialization. This means the "model" is actually a 450M frozen core plus a library of 9M-parameter adapters — one per task.
Why smaller works for robotics
A 7B language model can write poetry, solve calculus, and debate philosophy. A robot manipulator needs to understand "red cup," "pick up," and "to the left of." These are a tiny fraction of the capabilities a 7B model encodes. Most of a large LLM's capacity is dedicated to linguistic and reasoning abilities that are never exercised during manipulation.
SmolVLA's 450M parameters are sufficient because tabletop manipulation requires:
Object recognition at the category level (cup, block, bowl) — not the fine-grained distinctions (Labrador vs Golden Retriever) that large vision models excel at.
Spatial reasoning over a small workspace (~1m³) — not the global reasoning (which country is this?) that large models encode.
Instruction following for simple verb-noun commands ("pick up," "place on") — not the compositional language understanding (nested clauses, sarcasm, metaphor) that large models handle.
Action prediction via the flow matching expert — which is a 100–200M parameter network regardless of backbone size.
The 450M model covers all four requirements. The remaining 6.5B parameters in a 7B VLA are paying for capabilities the robot never uses.
Deployment economics
This is where the small-VLA thesis becomes a business argument, not just a research one:
Metric
OpenVLA-7B
π₀ 3.3B
SmolVLA 450M
GPU required
A100 (80GB)
A100 (40GB)
RTX 3090 (24GB)
GPU cost
~$10,000
~$10,000
~$1,000
Inference latency
~200ms
~100ms
~60ms
Cloud inference?
Required for most setups
Required
Optional — runs on-robot
INT8 quantized?
Still needs A10G+
Fits on RTX 4090
Fits on Jetson Orin
Network dependency
Yes (cloud GPU)
Yes
No — on-robot is feasible
Cost per 100 robots
~$1M in GPUs
~$1M
~$100K
The implications cascade. On-robot inference means no network round-trip latency (saving 10–50ms per step depending on the cloud setup). It means the robot works when WiFi drops. It means deployment in warehouses, factories, and homes where cloud connectivity is unreliable or forbidden by policy. SmolVLA does not just save money — it unlocks deployment scenarios that 7B models physically cannot reach.
Training recipe
SmolVLA's training follows the same three-stage pipeline as π₀, scaled down:
Pre-train SmolVLM on web data. Vision-language pre-training on image-text pairs. This is done by the SmolVLM team, not the robotics lab. The result is a compact VLM with strong visual-semantic features.
Co-fine-tune with flow matching action expert on Open X-Embodiment + DROID. Mixed batches of web VQA and robot demonstration data. The flow matching action expert is trained from scratch; the SmolVLM backbone is updated with a low learning rate. Duration: ~48 hours on 8 GPUs.
LoRA fine-tune for target task. Freeze everything, attach rank-16 LoRA adapters, train on 50–200 task-specific demonstrations. Duration: 30–60 minutes on a single RTX 3090.
Performance comparison
Benchmark
Metric
OpenVLA-7B
π₀ 3.3B
SmolVLA 450M
LIBERO-Long
Success %
53.3
68.4
62.1
LIBERO-Spatial
Success %
78.9
85.2
81.0
SimplerEnv (visual matching)
Success %
26.1
41.7
38.8
Bridge real-world
Success %
72.0
81.0
74.5
Params
—
7,000M
3,300M
450M
Inference GPU
—
A100
A100
RTX 3090
SmolVLA at 450M matches or exceeds OpenVLA-7B on every benchmark while running on a GPU that costs 1/10th as much. It trails π₀ by 4–8 points on average — the gap coming primarily from π₀'s larger backbone and more diverse pre-training data, not from the action head design (both use flow matching).
The small-VLA zoo
SmolVLA is not alone. A wave of efficient VLA models appeared in 2024–2025, each making a different bet on how to shrink the model without losing capability:
Model
Params
Backbone
Action head
Key insight
TinyVLA
1B
Distilled VLM
Discrete tokens
Most of 7B is wasted on non-robot capabilities
SmolVLA
450M
SmolVLM
Flow matching expert
Flow matching is more parameter-efficient than discrete tokens
RDT-1B
1.2B
DiT
Diffusion
DiT architecture works for actions, not just images
Octo
27M / 93M
Custom transformer
Diffusion / discrete
Smallest viable generalist; best fine-tuning UX
The trend is clear: the next generation of VLAs will be measured not by how large they are but by how much performance they deliver per parameter and per dollar of inference hardware.
When to go small vs large
The decision tree is simpler than it looks:
Fewer than 10 tasks on one robot, known objects: SmolVLA or TinyVLA with LoRA. The 450M model handles this with room to spare. Runs on consumer hardware, fine-tunes in under an hour.
10–50 tasks, single or dual embodiment, some novel objects: SmolVLA or π₀ depending on compute budget. If you have an A100, π₀ will give you a few extra points. If you're deploying to edge hardware, SmolVLA.
Cross-embodiment generalization across 50+ tasks and 5+ robot platforms: π₀ or GR00T N1. The larger backbone's capacity is justified by the diversity of the task distribution. You need the extra parameters to encode the motor knowledge for multiple embodiments.
Real-time on cheap hardware (Jetson, consumer GPU, no cloud): SmolVLA with INT8 quantization. No alternative exists at this price point.
The 2025 plot twist: 450M parameters matches 7B on most benchmarks. The bottleneck was never model size — it was data quality and action head design. Flow matching over continuous actions is strictly more parameter-efficient than discrete tokenization. If you are starting a new VLA project and do not have a cluster of A100s, start with SmolVLA. You can always scale up later if the task distribution demands it.
13Vision encoders
The eyes of the robot. Where most policies still leave performance on the table.
The vision encoder converts pixels into tokens or feature vectors that the policy consumes. The choice of encoder is a major lever — both for sample efficiency (a good prior cuts demonstrations needed by 3–10×) and for generalization (the encoder is what determines whether "red mug" and "blue mug" share a representation).
The encoders worth knowing
Encoder
Training
Why it's used
ResNet-18
ImageNet supervised
Cheap, fast, enough for single-task BC. The ACT default.
CLIP (ViT-B/16)
Image–text contrastive on 400M pairs
Language-aligned features. Standard for VLAs and UMI.
DINOv2 (ViT-L/14)
Self-supervised distillation, 142M images
Best raw visual features. Used in OpenVLA alongside SigLIP.
SigLIP
Sigmoid contrastive image–text
Stronger language alignment than CLIP at scale.
R3M
Time-contrastive + language alignment on Ego4D
Manipulation-aligned. Strong with little data.
VC-1
MAE on Ego4D + ImageNet
Robust low-shot performance.
Three eras
ImageNet-pretrained ResNet (until ~2022). Standard ResNet-18 or ResNet-50, frozen or fine-tuned. Cheap, good enough, the backbone of ACT and most pre-VLA work.
Self-supervised on robot or egocentric video (2022–2023). R3M, VC-1, MVP. Trained on Ego4D and similar; the priors are closer to manipulation distributions than ImageNet's.
Frontier vision foundation models (2023–present). DINOv2, SigLIP, CLIP. Either used directly or distilled.
DINOv2 vs CLIP: why self-supervised beats language-aligned for manipulation
DINOv2 (Oquab et al., 2023) is a vision transformer trained purely on images via self-supervised distillation — no text, no language, no captions. It learns to produce features where visually similar regions have similar embeddings. The result: DINOv2 features are spatially discriminative — they distinguish "the left edge of the cup handle" from "the right edge of the cup handle" at the patch level.
CLIP (Radford et al., 2021) is trained via image-text contrastive learning. It learns features that align images with their captions. This makes CLIP excellent at semantic understanding ("this is a mug," "this is a sponge") but mediocre at spatial discrimination. CLIP's features are optimized to match an entire image to a sentence, not to distinguish sub-centimeter spatial differences within an image.
For manipulation, spatial discrimination is paramount. The policy needs to know where exactly on the object to place the fingers, not just what the object is. This is why frozen DINOv2 often outperforms fine-tuned CLIP for manipulation: DINOv2's self-supervised objective produces spatially richer features that the downstream policy can exploit for precise positioning.
When CLIP wins anyway. If the task requires language grounding — "pick up the red cup, not the blue one" — CLIP's language alignment becomes essential. The optimal choice for VLAs is often to use both: DINOv2 for spatial features and SigLIP/CLIP for language alignment, concatenated into a dual-encoder. This is exactly what OpenVLA does (DINOv2 + SigLIP).
Encoder comparison: the full picture
Encoder
Pretraining data
Output type
Params
Best for
Typical use
ResNet-18
ImageNet (1.3M images)
Global (avg pool)
11M
Single-task BC, speed-critical
ACT, legacy BC
CLIP ViT-B/16
400M image-text pairs
Global (CLS) + spatial (patch)
86M
Language-conditioned tasks
UMI, RT-2 family
SigLIP ViT-L
3B image-text pairs
Global + spatial
304M
Stronger language alignment at scale
OpenVLA, PaliGemma
DINOv2 ViT-L/14
142M images (self-supervised)
Spatial (per-patch)
304M
Spatial discrimination, manipulation
OpenVLA (paired), 3D policies
R3M
Ego4D + language
Global
~50M
Low-data manipulation
Small-lab BC
VC-1
Ego4D + ImageNet (MAE)
Spatial
~300M
Robust low-shot
Academic benchmarks
Freeze vs fine-tune: the decision boundary
The question of whether to freeze or fine-tune the vision encoder is primarily a function of dataset size, and the transition is sharper than most practitioners realize:
<100 demonstrations. Freeze everything. A frozen foundation model encoder with a linear probe on top. Fine-tuning any part of the encoder will overfit catastrophically — you have orders of magnitude fewer samples than the encoder has parameters.
100–1,000 demonstrations. Freeze the encoder, train a small adapter (2–3 transformer layers, ~8–16M parameters) on top. This is the sweet spot for most manipulation research. The adapter learns task-specific feature combinations without destroying the pretrained representations.
1,000–10,000 demonstrations. You can begin fine-tuning the last 2–4 layers of the encoder with a low learning rate (10× lower than the adapter). The earlier layers stay frozen — they contain low-level features (edges, textures) that are universal.
>10,000 demonstrations. Full fine-tuning is viable and often beneficial. At this scale, even ViT-L encoders improve from task-specific adaptation. But monitor for overfitting: track validation loss per epoch and stop early.
Frozen or fine-tuned?
The dominant practice in 2026 is frozen encoder + small adapter for foundation models, and full fine-tune for ResNet-scale encoders. The reasons:
Fine-tuning a 300M+ parameter ViT on a few thousand robot demonstrations destroys the pretraining priors. The robot data is too narrow to support the fine-tune.
A frozen encoder + a learnable linear probe or small transformer adapter preserves the priors and trains in hours.
For ResNet-18-scale encoders, the prior is weak enough that fine-tuning helps — and the data is abundant enough to support it.
Worked example: encoder choice decision tree.Q1: How many demos do you have?
<100: Use a frozen foundation model (CLIP or DINOv2). No fine-tuning. Linear probe or small MLP on top.
100–1000: Frozen ViT + small transformer adapter (8–16M learnable params). This is the sweet spot for most manipulation.
>1000: You can fine-tune ResNet-18 end-to-end, or use a frozen ViT with a larger adapter.
>10,000: Fine-tune everything. At this scale, even ViTs benefit.
Q2: Do you need language conditioning?
Yes: Use CLIP or SigLIP (language-aligned). DINOv2 has no language alignment.
No: DINOv2 gives the best raw visual features. Pair with CLIP only if you need text later.
Q3: How fast does inference need to be?
<10ms: ResNet-18 (single forward: ~2ms).
10–50ms: ViT-B/16 (~8ms frozen, batch-1 GPU).
>50ms: ViT-L/14 (~20ms). Only viable with the slow-fast split.
What makes DINOv2 special for manipulation
DINOv2 deserves special attention because its design choices align unusually well with manipulation requirements. Three properties matter:
Patch-level spatial features. Unlike CLIP, which is optimized for image-level classification (matching an image to a caption), DINOv2's self-supervised objective (self-distillation with no labels) forces every patch token to be informative about its local region. The result: DINOv2's 256 patch tokens form a spatial map where nearby patches in the image have nearby representations in feature space. For a manipulation policy, this means the encoder preserves the fine-grained spatial structure needed to distinguish "top of the cup" from "side of the cup" at the feature level.
Robustness to viewpoint changes. DINOv2's training includes aggressive multi-crop augmentation: the student network sees small crops (covering 5–20% of the image) and must match the teacher's representation of the full image. This forces the features to be robust to scale and viewpoint changes — a property that transfers directly to manipulation, where the wrist camera's view of the object changes dramatically as the gripper approaches.
No language bias. CLIP features are biased toward the kinds of visual distinctions that language describes well ("red" vs "blue", "cat" vs "dog") and de-emphasize distinctions that language ignores (spatial layout, fine texture, sub-object structure). DINOv2 has no such bias — it treats all visual information equally. For manipulation, the spatially fine-grained information that DINOv2 preserves (edge geometry, surface normals implied by shading, grasp affordance cues) is exactly what CLIP discards.
The practical recommendation. For language-conditioned policies (VLAs), use DINOv2 + SigLIP (or CLIP) as a dual encoder — DINOv2 for spatial features, SigLIP for language alignment. For single-task BC without language, use DINOv2 alone. For speed-critical deployments where inference latency matters more than feature quality, use ResNet-18 fine-tuned end-to-end. Never use CLIP alone for manipulation unless language grounding is the primary requirement.
The adapter architecture
When using a frozen encoder, the adapter that sits between the encoder and the policy is the only learnable visual component. Two designs dominate:
Linear probe. A single linear layer from encoder dimension to policy input dimension. The simplest adapter: $z_{\text{policy}} = W z_{\text{encoder}} + b$ where $W \in \mathbb{R}^{d_{\text{policy}} \times d_{\text{encoder}}}$. Trainable parameters: $d_{\text{policy}} \times d_{\text{encoder}} \approx 200K$. Works surprisingly well with <100 demos. The linear probe is a good diagnostic: if it performs poorly, the encoder's features are not suited for the task.
Small transformer adapter. 2–4 transformer layers that process the encoder's spatial tokens and output a fixed number of "policy tokens" via cross-attention. Trainable parameters: 2–16M. This adapter can learn non-linear feature combinations and spatial aggregation patterns that a linear probe cannot. The cross-attention allows the adapter to dynamically focus on task-relevant regions of the image — attending to the gripper and object during contact, the broader scene during navigation.
Multi-camera fusion
Two strategies. Late fusion: encode each camera independently, concatenate or attention-fuse the resulting tokens before the policy. This is the standard. Early fusion: stitch images side-by-side or stack channels. Cheap but throws away camera identity.
Cross-attention works better than concatenation when one camera dominates (e.g., the wrist cam during contact). The policy can route attention to the camera that matters at each timestep.
Depth as an input channel
An alternative to full 3D point clouds is to add a depth channel to the 2D image: feed the encoder a 4-channel RGBD image instead of 3-channel RGB. This preserves the 2D pipeline (ViT, CNN) while giving the policy access to depth information. Two approaches:
Naive concatenation. Stack the depth channel alongside RGB to get a 4-channel input. The encoder's first convolutional layer must be modified to accept 4 channels (typically by copying the red channel's weights to initialize the depth channel). The rest of the network is unchanged. This is the simplest approach and works well with from-scratch training, but frozen pretrained ViTs cannot easily accept a 4th channel without re-training.
Separate depth encoder. Process RGB and depth with separate encoders, then fuse the features (concatenation or cross-attention). This preserves the pretrained RGB encoder while adding depth information. The depth encoder can be small (ResNet-18) because depth is lower-dimensional than RGB. The fusion point matters: early fusion (before the policy) gives the policy more information; late fusion (inside the policy) is more flexible.
The empirical finding: depth helps most for tasks where occlusion is the bottleneck (objects behind other objects, cluttered scenes) and least for tasks where appearance is the bottleneck (color-based sorting, texture-based grasping). For standard tabletop manipulation with an overhead camera, depth provides a 3–8% success rate improvement over RGB alone.
The wrist camera question
One of the most impactful architectural decisions in robot vision is whether to include a wrist-mounted camera. The wrist camera sees the object from the gripper's perspective — close up, during contact, at the moment when precision matters most. A scene camera 60cm above the workspace sees the broad layout but loses fine detail at the contact point.
The empirical evidence is clear: wrist cameras improve success rates by 10–25% on contact-rich tasks (insertion, precision pick, tool use) and have little effect on large-motion tasks (reaching, navigation). The reason is resolution: at a distance of 5cm, a 224×224 wrist camera image covers a ~4cm×4cm area at ~0.2mm/pixel resolution. The same object viewed by a scene camera 60cm away covers ~12×12 pixels — far too coarse for sub-millimeter positioning.
The cost is engineering: the wrist camera adds a cable to the robot's arm (risk of snagging), requires calibrating the camera-to-gripper transform, and doubles the vision encoder's compute budget. For mobile robots, cable routing is particularly painful. The practical compromise: use a wrist camera for manipulation, omit it for navigation-only tasks.
Augmentation
Three augmentations earn their seat:
Random shifts ($\pm$4 pixels) — simulates camera calibration error. Drops sim-to-real gap.
Color jitter — mild brightness, contrast, saturation. Critical for any policy that will see different lighting at deploy time.
Random crops at test time — DrQ-v2's trick: sample multiple crops at inference, average the Q-values.
Augmentations that don't earn their seat: heavy cutout, MixUp, anything that changes the geometry between the wrist camera and the gripper. The policy is not invariant to these — it depends on them.
Image preprocessing for robot vision
The standard image preprocessing pipeline for robot policies has subtle but important differences from the standard vision pipeline:
Resolution. 224×224 is the default (matches ViT-B pretraining). 336×336 for tasks requiring fine detail (insertion, threading). Never go below 128×128 — the policy loses critical spatial information.
Normalization. Match the encoder's pretraining statistics. For CLIP/SigLIP: ImageNet mean/std. For DINOv2: ImageNet mean/std (same). For a from-scratch ResNet: compute mean/std from your robot data.
Crop vs resize. Center-crop to square before resizing. Do not stretch — aspect ratio distortion confuses spatial reasoning. If the camera has a 4:3 aspect ratio, crop the top/bottom to make it 1:1.
Color space. Always RGB, never BGR. OpenCV defaults to BGR; failing to convert is a silent bug that degrades performance by 5–10% (the encoder's features become misaligned).
Worked example: vision encoder ablation. Task: pick up a randomly placed object on a table. 200 demonstrations. Franka robot with one scene camera and one wrist camera.
ResNet-18, fine-tuned end-to-end: 72% success. Fast inference (4ms). Overfits slightly at 200 demos but still serviceable. The encoder's limited capacity forces the policy to learn simple spatial features.
CLIP ViT-B/16, frozen + linear probe: 68% success. Surprisingly, worse than ResNet-18. CLIP's language alignment pulls features toward semantic similarity rather than spatial precision. The linear probe cannot compensate for the lost spatial information.
DINOv2 ViT-B/14, frozen + 2-layer adapter: 81% success. The best result. DINOv2's patch-level features preserve fine spatial structure. The adapter learns to focus on contact-relevant patches. 8ms encoder + 2ms adapter = 10ms total.
DINOv2 + SigLIP dual encoder, frozen + adapter: 80% success (no language in this task, so SigLIP adds nothing). But when language conditioning is added ("pick up the RED object"), this dual encoder reaches 85% while DINOv2-only drops to 65% (it cannot distinguish colors in its feature space).
Takeaway: the encoder choice depends on the task. For spatial precision: DINOv2. For language grounding: SigLIP/CLIP. For both: dual encoder. For minimum cost: ResNet-18 fine-tuned.
The three augmentations that matter. Extensive ablation studies across manipulation benchmarks consistently find that three augmentations earn their seat. Their effects are additive and their computational cost is negligible (<1ms per image):
Random shifts. Translate the image by $\pm$4 pixels in each direction (pad edges with border pixels). This simulates camera calibration error — between sessions, a robot's cameras shift by a few pixels due to thermal expansion, vibration, or accidental bumps. Without shift augmentation, a policy trained on Monday fails on Tuesday because the pixels moved. With it, the policy is robust to $\pm$4-pixel camera motion. The cost of not using this augmentation is typically 10–15% success rate degradation in real-world deployment.
Color jitter. Randomly perturb brightness (±20%), contrast (±20%), and saturation (±10%). Different rooms have different lighting; the same task under fluorescent lights vs. natural light looks dramatically different to a pixel-level policy. Color jitter forces the encoder to focus on shapes and spatial structure rather than absolute color values.
Random crop at test time (DrQ-v2 trick). At inference, take $M = 2$ random crops of the input image, run the encoder on each, and average the resulting features. This is a test-time augmentation that smooths out the policy's sensitivity to exact crop position. The compute cost is $M\times$ the encoder cost, so it is only viable with fast encoders (ResNet-18) or when the encoder is already cached.
Feature caching and batch inference
A practical optimization that saves 30–50% of inference time: because the vision encoder is frozen, its outputs can be cached and reused. If the camera image has not changed significantly between control steps (which is common at 50Hz+ control rates), the encoder features from the previous step can be reused without recomputing. A simple L2 distance threshold on the raw image determines whether to recompute: if $\|I_t - I_{t-1}\|_2 / N_{\text{pixels}} < \epsilon$ (typical $\epsilon = 0.01$), reuse the cached features.
For training, the optimization is even more dramatic: pre-compute all encoder features for the entire dataset before training begins. Store them as tensors on disk. The training loop reads features directly, bypassing the encoder entirely. This cuts training time by 2–3× for large ViT encoders and reduces GPU memory (no encoder in the training graph). The only requirement is that the encoder is truly frozen — if any gradient flows through the encoder, pre-computation is invalid.
13·BLanguage-conditioned planning
The bridge between LLMs and robot actions — using language models as high-level planners that decompose tasks into primitives a low-level policy can execute.
A VLA puts language understanding and motor control inside a single forward pass. Language-conditioned planning takes the opposite approach: a large language model plans, and a separate low-level policy acts. The LLM never touches the joints. It proposes a sequence of subgoals or primitive calls, and a pre-trained controller executes each one. The appeal is compositionality: a robot that knows 20 primitives can, in principle, solve any task that decomposes into a sequence of those 20 primitives — without any new training data.
The hierarchy runs three levels deep: a VLM or LLM for "what to do" (task decomposition and common-sense reasoning), a planner for "how to do it" (sequencing, constraint satisfaction, spatial reasoning), and a low-level policy for "muscle memory" (the actual motor commands). Each level operates at a different frequency and a different level of abstraction. The open question is where to draw the boundaries.
SayCan — affordance grounding
Ahn et al., 2022. The founding paper of the paradigm. The setup: a large language model (PaLM) proposes candidate next actions in natural language ("pick up the sponge", "go to the sink", "wipe the counter"). For each candidate, a pre-trained value function scores the probability that the robot can actually execute it right now, given its current state. The selected action is the one that maximizes the product of language usefulness (from the LLM) and physical feasibility (from the value function).
In plain English: the language model plays the role of a chef calling out orders — it knows what dish to make but has no idea which ingredients are within arm's reach. The value function plays the role of the cook at the station — it knows exactly what it can grab right now but has no idea about the recipe. Multiply the two scores together and the highest-scoring action is both useful for the task and physically doable.
$\mathcal{A}$ — the set of available primitives. Each is a short natural-language description paired with a pre-trained low-level policy. Typical set: 50–100 primitives covering navigation, picking, placing, opening, closing.
$p_{\text{LLM}}(a_i \mid \text{instruction}, \text{history})$ — the language model's score for how useful action $a_i$ is toward completing the instruction, given what has already been done. This is the LLM's next-token probability for the action string.
$V(s, a_i)$ — the affordance score. A value function trained via RL or BC that estimates the probability of successfully executing primitive $a_i$ from the current state $s$. This is the robot's self-knowledge: "I can pick up the sponge from here, but I can't reach the shelf."
In code:score = llm_prob * affordance_value for each candidate skill, then best = skills[scores.argmax()]. The LLM probabilities come from the model's next-token logits over the skill name strings. The affordance values come from pre-trained per-skill value functions evaluated on the current observation. Total inference: one LLM forward pass + N value-function forward passes (N = number of skills, typically 50–100).
The product is elegant: the LLM says what's useful, the robot says what's possible. Neither alone is sufficient — the LLM doesn't know the robot's reach, and the value function doesn't know what task the human wants. Together they ground language in physical reality.
Worked example: SayCan in action. Instruction: "I spilled my drink, can you help?" The robot is in a kitchen.
Step 1. LLM proposes candidates: "pick up sponge" (0.35), "go to table" (0.20), "find a towel" (0.25), "open fridge" (0.05), "pick up cup" (0.15).
Step 2. Value function scores feasibility from current state (near counter): "pick up sponge" (0.92 — sponge is visible), "go to table" (0.85), "find a towel" (0.30 — no towel in view), "open fridge" (0.90), "pick up cup" (0.70).
Step 3. Products: sponge = 0.35 × 0.92 = 0.322, table = 0.20 × 0.85 = 0.170, towel = 0.25 × 0.30 = 0.075, fridge = 0.05 × 0.90 = 0.045, cup = 0.15 × 0.70 = 0.105.
Selected: "pick up sponge" (0.322). The robot picks up the sponge, executes that primitive, then replans. The LLM, now conditioned on "picked up sponge", scores "go to spill" highest.
SayCan scoring: the math in detail
Worked numerical example: SayCan scoring with 3 skills. The robot has 3 available skills: [pick(cup), place(table), pour(cup)]. The instruction is "fill the cup with water."
LLM scores (probability that each skill is the useful next step):
pick(cup) = 0.8, place(table) = 0.1, pour(cup) = 0.1.
Value function scores (probability of successful execution from current state):
pick(cup) = 0.9 (cup is visible and reachable), place(table) = 0.7 (table is clear), pour(cup) = 0.2 (robot is not holding anything — can't pour).
Combined scores ($p_{\text{LLM}} \times V$):
pick(cup) = $0.8 \times 0.9 = 0.72$
place(table) = $0.1 \times 0.7 = 0.07$
pour(cup) = $0.1 \times 0.2 = 0.02$
Selected action: pick(cup) with score 0.72. The LLM wanted to pick the cup, and the value function confirmed the robot can do it.
After picking the cup, the LLM is re-queried with the updated history. Now pour(cup) gets a high LLM score (0.7) and a high value score (0.85, since the robot is now holding the cup near the faucet). The multiplicative scoring automatically sequences the task: first pick (because you can't pour without holding), then pour (because the instruction says "fill").
Inner Monologue — closed-loop language feedback
Huang et al., 2022. SayCan plans open-loop: the LLM generates the full plan, and if something goes wrong mid-execution, it doesn't know. Inner Monologue closes the loop. After each primitive execution, the robot generates a text description of what it observes (via an image captioner or object detector), and that description is appended to the LLM's context. If the primitive failed — "the sponge was not picked up, it is still on the counter" — the LLM replans.
The feedback sources are heterogeneous: success/failure detectors, scene descriptions from a VLM, human corrections typed into a chat interface. The LLM treats them all as text. This is both the strength (any sensor can contribute) and the weakness (text is a lossy representation of the world state).
Code as Policies — the policy is the program
Liang et al., 2023. Instead of scoring a fixed set of primitives, the LLM writes Python code that calls perception APIs and motion primitives directly. The "policy" is the generated program. Give the LLM a prompt with API documentation — get_obj_pos("mug"), move_to(x, y, z), grasp() — and a natural-language instruction, and it produces executable code.
Code as Policies — LLM output
# Instruction: "put the red block on top of the blue block"# Generated by LLM:red_pos = get_obj_pos("red block")
blue_pos = get_obj_pos("blue block")
move_to(red_pos[0], red_pos[1], red_pos[2] + 0.05) # approach from abovegrasp()
move_to(blue_pos[0], blue_pos[1], blue_pos[2] + 0.08) # place above bluerelease()
The power is combinatorial: the LLM can compose primitives in arbitrary ways, use loops and conditionals, and call perception mid-execution. The fragility is also combinatorial: one wrong coordinate, one hallucinated API name, one off-by-one in a loop, and the robot does something wrong or dangerous. There is no learned recovery — the code either works or it doesn't.
Worked example: Code as Policies for a multi-step task. Instruction: "sort the fruits into the bowl by color — red fruits in the left bowl, green fruits in the right bowl."
The LLM receives the API documentation and generates:
Worked example: VoxPoser 3D value map. Instruction: "move to the cup." The workspace is discretized into a 40×40×40 voxel grid (0.5m³ workspace, ~1.25cm resolution).
Step 1: LLM generates code. The LLM receives the instruction and a list of detected objects with their 3D positions. It generates Python code that writes a scalar field:
value_map[cup_x-5:cup_x+5, cup_y-5:cup_y+5, cup_z:cup_z+10] = 1.0
This creates a "hot zone" of high value (1.0) in the voxels near and above the cup. All other voxels remain at 0.0.
Step 2: LLM generates constraint map. Obstacle avoidance: constraint_map[table_surface_z-2:table_surface_z+2, :, :] = -1.0. Voxels at the table surface have negative value (repulsive).
Step 3: Motion planner. A gradient-based planner (MPC or trajectory optimization) finds the end-effector path that maximizes cumulative value while respecting the constraint map. The robot's end-effector follows the gradient of the value map — ascending toward the cup from its current position while avoiding the table surface.
The 3D value map serves as a "potential field" that the motion planner navigates. The LLM never generates a trajectory directly — it generates the landscape, and the planner finds the path.
This program composes perception (color detection, object enumeration), control flow (loops, conditionals), and motion primitives into a behavior that no fixed primitive set could express. The LLM effectively serves as a program synthesizer that translates natural language into executable robot code. The failure mode is also visible: if get_obj_color returns "orange" for a tomato, it goes in the wrong bowl. There is no graceful degradation.
VoxPoser — 3D value maps from language
Huang et al., 2023. A different interface between language and action. Instead of generating code that calls motion primitives, the LLM generates code that writes 3D voxel maps: a value map (where the end-effector should go) and a constraint map (where it should not go). A classical motion planner then optimizes a trajectory through the voxel space.
The key insight: 3D value maps are a natural interface between language-level semantics ("put it on the shelf") and motion-planner-level geometry ("the end-effector should converge to coordinates [0.3, 0.5, 0.8] while avoiding the obstacle at [0.3, 0.3, 0.6]"). The LLM provides the semantics; the voxel map provides the geometry; the planner provides the dynamics.
ReKep — relational keypoint constraints
Huang et al., 2024. The LLM specifies constraints not in voxel space but as relational keypoint constraints: "keypoint A (the cup handle) should be within 2cm of keypoint B (the hook), and keypoint C (the cup bottom) should be above keypoint D (the shelf surface)." A numerical optimizer finds a trajectory satisfying all constraints simultaneously.
In plain English: the LLM says "keep the gripper above the cup rim" and "bring the spout close to the mug opening." The optimizer then finds a smooth arm trajectory that satisfies all of these spatial relationships simultaneously — without the LLM ever touching a joint angle. The LLM writes the rules of the game; the optimizer plays it.
$\tau$ — the robot trajectory. A sequence of end-effector poses over time.
$c_i(\tau)$ — a soft constraint cost. Penalizes violations of relational keypoint constraints specified by the LLM. Example: $\| p_A(\tau_T) - p_B \|_2^2$ (keypoint A should reach keypoint B at the final timestep).
$\mathcal{C}_{\text{hard}}$ — the set of hard constraints: collision avoidance, joint limits, stability.
$\lambda_i$ — priority weights for soft constraints, also specified by the LLM. "The cup must not spill" gets a higher weight than "approach from the left."
What this means for your system: the LLM generates Python cost functions like lambda traj: (traj[-1].ee_pos - hook_pos).norm(), and the optimizer (typically scipy.optimize.minimize or a shooting method) minimizes the weighted sum. Latency is dominated by the optimizer: 50–200ms per replan, which is acceptable for 1–2Hz subgoal planning but too slow for servo-rate control. Keypoint detection accuracy is the primary failure mode — a 2cm localization error on "the cup handle" propagates directly into a 2cm placement error.
ReKep's advantage over VoxPoser: keypoint constraints are more interpretable, easier for the LLM to specify correctly, and more sample-efficient for the optimizer. The cost is that the keypoints must be detected in the scene — which requires a vision model that can localize "the cup handle" and "the hook" from a language description.
CoPa and SpatialVLM — VLM-based spatial planners
The newest wave replaces the LLM + separate perception pipeline with a single VLM that can reason about 3D space directly. CoPa (Huang et al., 2024) uses a VLM to generate manipulation plans by reasoning over object contact points and post-contact trajectories. SpatialVLM (Chen et al., 2024) trains a VLM on spatial reasoning data so it can answer quantitative questions ("how far is the mug from the edge?") and use those answers to parameterize actions. The direction is clear: collapse the LLM + perception stack into a single model that sees and reasons simultaneously.
The comparison
Method
Interface
Closed-loop?
Spatial reasoning
Failure mode
SayCan
Score fixed primitive set
Open-loop (per step)
Via value functions only
Missing primitive = stuck
Inner Monologue
Score primitives + text feedback
Yes (text observations)
Via captioner
Captioner error → bad replan
Code as Policies
Generated Python code
Optional (re-call LLM)
Via perception APIs
Bad code = bad action
VoxPoser
3D voxel value/constraint maps
Replan per phase
Explicit 3D voxel grid
Coarse voxels = imprecise
ReKep
Relational keypoint constraints
Replan on failure
Keypoint coordinates
Bad keypoint detection = wrong target
CoPa / SpatialVLM
VLM-generated contact plans
Yes (VLM observes)
Native VLM spatial reasoning
VLM hallucination
The latency problem
Every language-conditioned planning method shares a fundamental timing constraint: LLM inference is slow. A single forward pass through a 7B model takes 500ms–2s, depending on hardware and prompt length. For a 10Hz control loop (100ms per step), you cannot call the LLM at every timestep. The latency budget simply does not fit.
The standard solution is a hierarchical frequency split:
High-level planner (LLM): 0.1–1 Hz. Called once per primitive (every 2–10 seconds). Selects the next subgoal or writes the next code snippet. The 500ms–2s latency is acceptable because the planner is not on the inner control loop.
Low-level policy: 10–50 Hz. Executes the selected primitive reactively. Does not involve the LLM at all. Runs a pre-trained BC or RL policy conditioned on the subgoal.
This is the same System 2 / System 1 split as the VLA architecture (Section 12), but with the boundary drawn at the language-action interface rather than inside a single model. The LLM does high-level reasoning at human decision speed; the policy does motor control at robot servo speed.
Worked example: latency budget for SayCan. The robot must execute "bring me a coke from the fridge."
Decomposition (LLM, called 6 times):
1. "Navigate to fridge" — LLM scores + selects (1.2s). Low-level nav policy executes (8s).
2. "Open fridge door" — LLM scores + selects (1.0s). Low-level policy executes (3s).
3. "Pick up coke can" — LLM scores + selects (1.1s). Low-level policy executes (4s).
4. "Close fridge door" — LLM scores + selects (0.9s). Low-level policy executes (3s).
5. "Navigate to user" — LLM scores + selects (1.0s). Low-level policy executes (6s).
6. "Hand over coke" — LLM scores + selects (0.8s). Low-level policy executes (2s).
Total LLM time: 6 × ~1s = 6s. Total execution time: 26s. Total task time: 32s. The LLM adds ~18% overhead to the task. If the LLM were called at every 10Hz control step (320 calls), it would add 320s of latency — longer than the task itself.
Limitations
Language-conditioned planning is powerful for long-horizon multi-step tasks where the planning horizon exceeds what a single policy can handle. But the limitations are real and should inform when you reach for this approach vs. an end-to-end VLA:
Latency. An LLM call takes 200ms–2s. If you're replanning every primitive (5–10 seconds), this is fine. If you need reactive control at 10Hz, it's not. The two-system VLA split solves this by amortizing the LLM call over many control steps.
Brittleness. One bad code generation, one hallucinated coordinate, one misidentified object, and the entire plan fails. There is no graceful degradation — the failure mode is binary. End-to-end VLAs degrade more smoothly because the policy is a continuous function, not a program.
The "last mile" problem. Language is too coarse for fine manipulation. "Pick up the needle" does not tell the policy how to orient the fingers, how hard to squeeze, or how to compensate for the needle's flex. The low-level policy must handle all of this, and the planner cannot help.
Primitive coverage. SayCan-style methods are limited to the primitives that have been pre-trained. If the task requires a motion the primitive set doesn't cover, the system is stuck. Code-based methods (Code as Policies, VoxPoser) are more flexible but shift the coverage problem to the API surface.
System integration: the full stack in 2026
The convergent architecture uses language-conditioned planning as one layer in a multi-layer system:
Layer
Frequency
Component
Input
Output
Task decomposition
0.1–0.5 Hz
LLM / VLM
Language instruction + scene image
Sequence of subgoals
Subgoal planning
1–2 Hz
VoxPoser / ReKep / Code-as-Policies
Current subgoal + scene
Spatial targets or constraint maps
Motor execution
10–200 Hz
Diffusion Policy / VLA action expert
Spatial target + observation
Joint or EE commands
Each layer operates at a different timescale and abstraction level. The task decomposition layer runs once per task or per major phase transition. The subgoal planning layer runs once per primitive (every few seconds). The motor execution layer runs at servo rate. Information flows downward (higher layers condition lower layers); feedback flows upward (motor failures and scene changes trigger replanning at higher layers).
When to use what
Use language-conditioned planning when: the task has 5+ sequential stages, requires common-sense reasoning ("the cup goes in the dishwasher, not the trash"), or must generalize to instructions the policy has never seen. Use an end-to-end VLA when: the task is short-horizon, requires precise manipulation, or latency matters. Use both when: the VLM plans at 1Hz and the VLA executes at 50Hz — which is, increasingly, the dominant architecture.
The future: unified VLM planners
The separation between "planning" (LLM-based, discrete, symbolic) and "execution" (policy-based, continuous, learned) is an artifact of the current technology stack. The convergent architecture is already collapsing these layers: Gemini Robotics 1.5 interleaves reasoning tokens with action tokens in a single model. The LLM does not "plan" and then "hand off" — it thinks and acts in the same token stream. The reasoning is causally upstream of the actions, providing the same compositional benefits as a separate planner but without the latency and integration overhead of a two-model system.
This trend — from separate LLM + policy to unified VLM that reasons and acts — is likely to make the SayCan/Code-as-Policies paradigm obsolete within 2–3 years. But the concepts (affordance grounding, spatial value maps, relational constraints) will survive as inductive biases or training objectives for the unified models. Understanding them now is essential for understanding what comes next.
Language-conditioned planning is not a competitor to VLAs. It is a layer above them. The convergent architecture of 2026 uses a VLM for task decomposition, a language-conditioned planner for sequencing, and a VLA or low-level policy for execution. The debate is not "which one" but "where to draw the boundaries."
14PPO
The locomotion workhorse. The reason simulator-trained quadrupeds walk.
Proximal Policy Optimization (Schulman et al., 2017) is an on-policy actor-critic algorithm that became the dominant RL method for robotics-in-simulation.
The objective
In plain English: imagine you have a room full of robot arms, and each one tried a slightly different approach to the task. Some did well, some did badly. PPO says: for the ones that did well, make that behavior a little more likely next time — but not too much more, or you'll overcommit to a fluke. For the ones that did badly, freely make that behavior less likely. The "clipping" is the speed limit on how fast you can change your mind.
$r_t(\theta)$ — the importance ratio. Measures how much more (or less) likely the current policy $\pi_\theta$ is to take action $a_t$ compared to the old policy $\pi_{\theta_\text{old}}$ that actually collected the data. $r_t = 1$ means no change; $r_t = 2$ means the action is now twice as likely.
$\pi_\theta(a_t \mid s_t)$ — the current policy's probability of taking action $a_t$ in state $s_t$. This changes as we update $\theta$.
$\pi_{\theta_\text{old}}(a_t \mid s_t)$ — the old policy's probability (frozen snapshot from before the current update). The data was collected under this policy.
$\hat A_t$ — the estimated advantage of action $a_t$ in state $s_t$. Positive means "better than average," negative means "worse than average." Computed via Generalized Advantage Estimation (GAE), which blends 1-step, 2-step, ..., $n$-step TD errors with exponential weighting $\lambda \approx 0.95$.
$\epsilon$ — the clip range, typically 0.2. Limits the policy ratio to $[0.8, 1.2]$, preventing any single update from changing the policy too dramatically. This is PPO's replacement for TRPO's hard KL constraint.
$\min(\cdot, \cdot)$ — the pessimistic bound. Takes the lower of the clipped and unclipped objective. When advantage is positive (good action), this prevents the policy from increasing the action's probability too aggressively. When advantage is negative (bad action), the policy can freely decrease the probability.
In code:ratio = torch.exp(log_prob_new - log_prob_old), then loss = -torch.min(ratio * adv, torch.clamp(ratio, 1-eps, 1+eps) * adv).mean(). Three lines. The ratio is computed in log-space for numerical stability. The advantage adv comes from a backward GAE loop over the trajectory (see the worked example below). In Isaac Gym with 4096 parallel environments, one PPO update takes ~200ms and processes ~65K transitions.
Three pieces:
Importance ratio $r_t$ — corrects for the fact that the data was collected by the old policy.
Advantage $\hat A_t$ — typically generalized advantage estimation, a $\lambda$-weighted blend of $n$-step TD errors. $\lambda \approx 0.95$ is standard.
Clipping — when $r_t$ exceeds $1 \pm \epsilon$ (typically $\epsilon = 0.2$), the surrogate flattens.
Derivation: the policy gradient theorem
In plain English: try random things, observe what happens, and do more of what worked. The policy gradient says: if an action led to a high reward, nudge the policy to make that action more likely. If it led to a low reward, nudge the other way. The beauty is that you never need to know the physics of the environment — you only need to know which actions you took and how much reward you got.
Goal. Show that $\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a \mid s) \cdot Q^{\pi_\theta}(s, a)\right]$.
Step 1. The objective is $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$ where $R(\tau) = \sum_t \gamma^t r_t$. The trajectory distribution is $p_\theta(\tau) = \mu(s_0) \prod_t \pi_\theta(a_t \mid s_t) p(s_{t+1} \mid s_t, a_t)$.
Step 2. Take the gradient. The dynamics $p(s_{t+1} \mid s_t, a_t)$ do not depend on $\theta$, so:
This is the REINFORCE estimator. It has high variance because $R(\tau)$ includes rewards from before and after action $a_t$.
Step 3. Apply the "future-only" simplification: rewards before time $t$ don't depend on $a_t$, so they add zero-mean noise. Replace $R(\tau)$ with $Q^{\pi}(s_t, a_t) = \mathbb{E}[\sum_{k \geq t} \gamma^{k-t} r_k \mid s_t, a_t]$. Subtracting a baseline $V(s_t)$ gives the advantage $A_t = Q(s_t, a_t) - V(s_t)$, which further reduces variance without changing the expectation.
Derivation: why clipping works
The issue with vanilla policy gradient: a single gradient step can change the policy drastically, especially when the advantage is large. TRPO solved this with a hard KL constraint, which requires second-order optimization. PPO replaces this with a simpler mechanism.
The surrogate objective without clipping is $\mathbb{E}[r_t \hat A_t]$ where $r_t = \pi_\theta / \pi_{\theta_{\text{old}}}$. This objective has the same gradient as the true policy gradient at $r_t = 1$ (i.e., when the new policy equals the old), but far from $r_t = 1$ it can be misleading.
Clipping removes the incentive to move $r_t$ far from 1. When $\hat A_t > 0$ (good action), $\min(r_t \hat A_t, (1+\epsilon)\hat A_t)$ caps the benefit of increasing $r_t$ above $1+\epsilon$. When $\hat A_t < 0$ (bad action), $\min(r_t \hat A_t, (1-\epsilon)\hat A_t)$ still allows full correction — there is no clip on the corrective side. This asymmetry is the key: the policy can always run away from bad actions, but it cannot rush toward good ones.
unclipped surrogatePPO clipped objectiveflat region
Interactive: PPO clipping visualization
unclipped r·APPO L_CLIPclip boundary
The full PPO loss adds a value function regression term and an entropy bonus:
Full PPO loss
$$ \mathcal{L} = -\mathcal{L}^{\text{CLIP}} + c_1 \cdot \mathcal{L}^{\text{VF}} - c_2 \cdot \mathcal{H}[\pi_\theta(\cdot \mid s_t)]$$
$-\mathcal{L}^{\text{CLIP}}$ — the negated clipped surrogate. Negated because we minimize the total loss but want to maximize the clipped objective (i.e., increase probability of good actions).
$\mathcal{L}^{\text{VF}} = (V_\theta(s_t) - V_t^{\text{target}})^2$ — the value function loss. Trains the critic to predict expected returns. $V_t^{\text{target}}$ is typically the GAE-based return estimate. $c_1 = 0.5$ is standard.
$\mathcal{H}[\pi_\theta(\cdot \mid s_t)]$ — the entropy of the policy's action distribution at state $s_t$. For a Gaussian policy, $\mathcal{H} = \frac{1}{2}\log(2\pi e \sigma^2)$ per dimension.
$c_2$ — the entropy coefficient, typically 0.0 to 0.01 for continuous control. The $-c_2 \mathcal{H}$ term (note the minus sign) rewards exploration by penalizing overly deterministic policies. Higher $c_2$ = more exploration, slower convergence.
In code:loss = ppo_clip_loss + 0.5 * F.mse_loss(v_pred, v_target) - 0.01 * entropy.mean(). The three terms are computed on the same batch and summed. The GAE advantages are computed in a backward loop over the trajectory before the PPO update, not inside it. Typical training: 4096 parallel environments in Isaac Gym, 24 steps per rollout, 4 PPO epochs per batch = ~400K transitions per update.
Vanilla policy gradient (REINFORCE) is high-variance. TRPO (the predecessor) is correct but uses second-order optimization that's painful to implement at scale. PPO replaces the trust region with a clipped first-order objective that recovers most of the stability benefit at a fraction of the engineering cost.
Where PPO shines
Locomotion in simulation. Isaac Gym, MuJoCo MJX, Brax. With 4096+ parallel environments, billions of timesteps cost an afternoon.
Sim-to-real with domain randomization.
Where PPO loses
Real-world robots. On-policy means every gradient step throws away old data.
Sparse rewards. PPO needs reward signal; without shaping, it doesn't explore well.
15SAC and the off-policy family
Maximum-entropy reinforcement learning. The default for sample-efficient RL.
Soft Actor-Critic (Haarnoja et al., 2018) is an off-policy actor-critic that adds an entropy bonus to the reward.
The maximum-entropy objective
In plain English: be good at the task, but also keep your options open. A standard RL agent that finds one way to grasp the cup commits to it fully — and breaks when the cup moves 2cm. A max-entropy agent maintains several viable grasping strategies simultaneously. The entropy bonus is the mathematical expression of "don't put all your eggs in one basket."
$J(\pi)$ — the objective to maximize. Standard RL maximizes cumulative reward; max-entropy RL adds a bonus for being "random in a useful way."
$\gamma$ — the discount factor, typically 0.99. Weights future rewards: a reward $k$ steps from now is worth $\gamma^k$ as much as an immediate reward. $\gamma = 0.99$ means the agent cares about roughly the next 100 steps.
$r_t$ — the reward received at timestep $t$. Defined by the task (e.g., +1 for grasping the object, -0.01 per step as a time penalty).
$\alpha$ — the temperature parameter. Controls the exploration-exploitation trade-off. High $\alpha$ = more exploration (the agent values entropy nearly as much as reward). Low $\alpha$ = near-greedy (standard RL). Can be learned automatically (see below).
$\mathcal{H}[\pi(\cdot \mid s_t)] = -\mathbb{E}_\pi[\log \pi(a \mid s_t)]$ — the entropy of the policy at state $s_t$. High entropy means the policy spreads probability over many actions (exploratory); low entropy means it concentrates on one action (exploitative). The entropy bonus prevents premature collapse to a deterministic policy.
Derivation: the soft Bellman equation
Goal. Show that the optimal soft Q-function satisfies a modified Bellman equation with an entropy term.
In plain English: the robot dreams about what will happen next, considers all possible next actions (not just the single best one), and values states where it has many good options more than states where it has only one. The "soft" Bellman equation is the standard one but with a blurry maximum that keeps multiple strategies alive.
In standard RL, the Bellman equation is $Q^*(s,a) = r(s,a) + \gamma \mathbb{E}_{s'}[\max_{a'} Q^*(s', a')]$. The max-entropy version replaces the hard max with a soft max (log-sum-exp):
$Q_{\text{soft}}^*(s,a)$ — the optimal soft Q-function. Like the standard Q-function, it represents the value of taking action $a$ in state $s$ and acting optimally thereafter — but "optimally" now includes the entropy bonus.
$r(s,a)$ — the immediate reward for taking action $a$ in state $s$.
$\mathbb{E}_{s'}[\cdot]$ — expectation over the next state $s'$, drawn from the environment dynamics $p(s' \mid s, a)$.
$\alpha \log \sum_{a'} \exp(Q/\alpha)$ — the soft maximum (log-sum-exp). This is a smooth, differentiable approximation to $\max_{a'} Q(s', a')$. When $\alpha \to 0$, it becomes a hard max (standard Bellman). When $\alpha$ is larger, it "softens" the max, encouraging the agent to keep multiple near-optimal actions viable rather than committing to one.
The optimal policy is the Boltzmann distribution: $\pi^*(a \mid s) \propto \exp(Q_{\text{soft}}^*(s,a) / \alpha)$. With this policy, we can write the soft value as $V_{\text{soft}}(s) = \alpha \log \sum_a \exp(Q(s,a)/\alpha) = \mathbb{E}_\pi[Q(s,a) - \alpha \log \pi(a \mid s)]$.
Substituting back gives SAC's practical Bellman target: $y_t = r_t + \gamma \mathbb{E}_{a' \sim \pi}[\min_j \bar Q_j(s', a') - \alpha \log \pi(a' \mid s')]$. The $\min$ over two Q-networks prevents overestimation; the $-\alpha \log \pi$ term is the entropy bonus.
Derivation: automatic $\alpha$ tuning
The temperature $\alpha$ controls the exploration-exploitation trade-off. Setting it manually is fragile. The automatic tuning formulation solves a constrained optimization: maximize the expected return subject to $\mathcal{H}[\pi(\cdot \mid s_t)] \geq \bar{\mathcal{H}}$ for all $s_t$. By duality, this gives the loss:
$\alpha$ — the learnable temperature. Treated as an optimization variable with its own loss and learning rate. Updated alongside the actor and critic.
$\log \pi(a \mid s)$ — the log-probability of the sampled action under the current policy. Very negative values mean the policy is spread out (high entropy); near-zero values mean it is concentrated (low entropy).
$\bar{\mathcal{H}} = -\dim(\mathcal{A})$ — the target entropy. For a 7-DoF action space, $\bar{\mathcal{H}} = -7$. This is a heuristic: it roughly corresponds to the entropy of a unit Gaussian per dimension. If the policy's entropy drops below this, $\alpha$ increases to encourage more exploration.
$-\alpha \cdot (\log \pi + \bar{\mathcal{H}})$ — the loss pushes $\alpha$ up when $\log \pi + \bar{\mathcal{H}} < 0$ (policy too deterministic) and down when $\log \pi + \bar{\mathcal{H}} > 0$ (policy too stochastic). At equilibrium, $\mathbb{E}[\log \pi] = \bar{\mathcal{H}}$.
When the policy is too deterministic ($\mathcal{H} < \bar{\mathcal{H}}$, meaning $\log \pi$ is very negative), the loss drives $\alpha$ up, increasing the entropy bonus. When the policy is too stochastic, $\alpha$ decreases. The target entropy $-\dim(\mathcal{A})$ is heuristic but works: it roughly corresponds to a uniform distribution over a unit hypercube in each action dimension.
What this means for your system: you never have to hand-tune the exploration rate. Set target_entropy = -action_dim and let $\alpha$ adjust itself. In practice, $\alpha$ starts high (0.2–0.5) when the policy is random at initialization, drops as the policy improves, and stabilizes around 0.01–0.05 in late training. If $\alpha$ stays high throughout training, the reward signal is too weak and the policy cannot find a good exploitation strategy.
Twin Q and target networks
Two tricks borrowed from TD3:
Twin Q-networks with the $\min$ operator combat overestimation bias in the Q-learning target.
Target networks updated as a slow EMA of the online networks ($\tau \approx 0.005$) stabilize the bootstrap target.
Why a single Q-network overestimates
The Bellman target for Q-learning is $y = r + \gamma \max_{a'} Q(s', a')$. The $\max$ is the problem. Suppose the true Q-values for four actions in state $s'$ are all exactly 5.0, but the network's estimates have zero-mean noise: $\hat{Q} = [4.7, 5.3, 4.9, 5.1]$. The true max is 5.0, but $\max(\hat{Q}) = 5.3$. This is Jensen's inequality applied to the max operator: $\mathbb{E}[\max_a \hat{Q}(s',a)] \geq \max_a \mathbb{E}[\hat{Q}(s',a)]$. The max of noisy estimates is a biased-upward estimate of the max of the true values.
This is not a one-time error. The overestimated target is used to update the Q-network, which produces even more overestimated values at the next step, which produces an even more biased target. The feedback loop compounds: $Q$ values drift upward, the policy chases phantom high-Q actions that don't actually lead to high reward, and the whole system diverges. This is the overestimation spiral that killed early Q-learning methods (DDPG was notoriously unstable for this reason).
The twin-Q fix
Train two Q-networks $Q_{\phi_1}$ and $Q_{\phi_2}$ with independent initializations and independent gradient updates (same data, different random mini-batch ordering). Use the minimum for the Bellman target:
Twin-Q target
$$ y = r + \gamma \left( \min(Q_{\bar\phi_1}(s', a'), Q_{\bar\phi_2}(s', a')) - \alpha \log \pi(a' \mid s') \right), \quad a' \sim \pi(\cdot \mid s') $$
Why does $\min$ help? Each network has independent estimation error. The probability that both networks simultaneously overestimate the same action is much lower than the probability that one does. Taking the minimum is conservative: it slightly underestimates on average (the symmetric counterpart of the overestimation bias). But underestimation is benign — the policy becomes slightly cautious, which is safe. Overestimation is catastrophic — the policy becomes delusional, which is not.
Worked example: twin-Q vs. single-Q overestimation. True Q-values for 3 actions in state $s'$: $[5.0, 5.0, 5.0]$. Network noise: $\sigma = 0.5$.
Single Q-network: estimates $\hat{Q}_1 = [4.6, 5.4, 5.1]$. Target uses $\max = 5.4$. Overestimation: $+0.4$.
Twin Q-networks: $\hat{Q}_1 = [4.6, 5.4, 5.1]$, $\hat{Q}_2 = [5.3, 4.8, 5.2]$. For the actor's sampled action $a'$ (say $a_2$): $\min(5.4, 4.8) = 4.8$. Underestimation: $-0.2$. The policy learns a slightly conservative value, but does not spiral.
Over 100k updates, single-Q drifts $Q$-values to $\sim 50$ (when the true max is 5). Twin-Q stays within $[4.5, 5.5]$. This is the difference between a training run that converges and one that diverges.
Worked example: $\alpha$ auto-tuning dynamics. 7-DOF arm, so target entropy $\bar{\mathcal{H}} = -\dim(\mathcal{A}) = -7$. Current $\alpha = 0.15$.
Scenario 1 — policy too deterministic. Current entropy $\mathcal{H}(\pi) \approx -4$ (the policy concentrates on a narrow action range). Average log-probability of sampled actions: $\mathbb{E}[\log \pi(a \mid s)] \approx -4$.
The $\alpha$ loss: $\mathcal{L}(\alpha) = -\alpha (\log \pi + \bar{\mathcal{H}}) = -0.15 \times (-4 + (-7)) = -0.15 \times (-11)$. But wait — $\bar{\mathcal{H}} = -7$, so the inner term is $\log \pi - (-7) = -4 + 7 = 3$. Since $3 > 0$, the gradient $\partial \mathcal{L}/\partial \alpha = -(+3) = -3$. Minimizing: $\alpha \leftarrow \alpha - \eta \times (-3) = \alpha + 3\eta$. $\alpha$ increases. Higher $\alpha$ means stronger entropy bonus, pushing the policy to explore more. Correct — the policy was too deterministic.
Scenario 2 — policy too stochastic. $\mathbb{E}[\log \pi] \approx -10$. Inner term: $-10 + 7 = -3 < 0$. Gradient: $-(-3) = +3$. $\alpha \leftarrow \alpha - 3\eta$. $\alpha$ decreases. Less entropy bonus, policy sharpens. Correct.
At equilibrium: $\mathbb{E}[\log \pi] = \bar{\mathcal{H}} = -7$. The policy maintains entropy roughly equal to a unit Gaussian in 7 dimensions.
$Q_{\phi_i}(s_t, a_t)$ — the $i$-th Q-network's prediction of the value of action $a_t$ in state $s_t$. SAC trains two Q-networks ($i \in \{1, 2\}$) to combat overestimation.
$y_t$ — the Bellman target. What the Q-value "should be" according to the one-step bootstrap: the immediate reward plus the discounted future value.
$\min_j \bar Q_{\phi_j}(s_{t+1}, a')$ — the pessimistic target Q-value. Takes the minimum of the two target Q-networks (slow EMA copies, denoted by the bar) to prevent optimistic extrapolation.
$-\alpha \log \pi(a' \mid s_{t+1})$ — the entropy bonus in the target. Actions with lower probability (more exploratory) get a bonus, baked into the Q-target itself.
$a' \sim \pi$ — next action sampled from the current policy. This is what makes SAC off-policy: the data $(s_t, a_t, r_t, s_{t+1})$ can be old, but $a'$ is always fresh from the latest policy.
SAC actor loss
$$ \mathcal{L}_\pi(\theta) = \mathbb{E}\Big[ \alpha \log \pi_\theta(a \mid s) - \min_j Q_{\phi_j}(s, a) \Big]$$
$\alpha \log \pi_\theta(a \mid s)$ — the entropy penalty. Pushes the policy to stay stochastic by penalizing high-probability (low-entropy) actions. Without this term, the actor would collapse to a deterministic greedy policy.
$-\min_j Q_{\phi_j}(s, a)$ — the negated Q-value (negated because we minimize the loss, so maximizing Q-value means minimizing its negation). Pushes the policy toward high-value actions.
$a$ is sampled via the reparameterization trick: $a = \tanh(\mu_\theta(s) + \sigma_\theta(s) \odot \epsilon)$ with $\epsilon \sim \mathcal{N}(0, I)$. The $\tanh$ squashes actions to $[-1, 1]$; this requires a log-det-Jacobian correction in $\log \pi$.
Worked example: SAC actor update. State $s$, action dimension $d = 7$. The policy $\pi_\theta$ is a squashed Gaussian: sample $\epsilon \sim \mathcal{N}(0, I)$, compute $a_{\text{raw}} = \mu_\theta(s) + \sigma_\theta(s) \odot \epsilon$, then $a = \tanh(a_{\text{raw}})$.
Suppose $\mu_\theta(s) = [0.3, -0.1, 0.5, 0.0, 0.2, -0.4, 0.1]$, $\sigma_\theta(s) = [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]$, $\epsilon = [0.5, -0.3, 0.8, 0.1, -0.6, 0.2, -0.4]$.
$a_{\text{raw}} = [0.4, -0.16, 0.66, 0.02, 0.08, -0.36, 0.02]$. $a = \tanh(a_{\text{raw}}) = [0.380, -0.159, 0.578, 0.020, 0.080, -0.345, 0.020]$.
$\log \pi$: sum of log-Gaussian density of $a_{\text{raw}}$ minus the tanh correction $\sum_i \log(1 - \tanh^2(a_i))$.
$\min(Q_1(s, a), Q_2(s, a)) = 3.45$. Current $\alpha = 0.2$.
Actor loss = $\alpha \cdot \log \pi(a \mid s) - 3.45$. The gradient pushes the policy toward actions with high Q-values while maintaining entropy (the $\alpha \log \pi$ term penalizes overly deterministic policies).
If you have ten thousand environment steps to spend, SAC is your default. If you have one thousand, you want REDQ or DroQ. If you have a hundred, you want offline pretraining first.
16Sim-to-real
The reality gap is the central engineering problem of pure-RL robotics.
Training in simulation is fast, free, and produces policies that fail spectacularly when deployed on a real robot — unless you do specific things to close the gap.
Training in simulation is fast, free, and produces policies that fail spectacularly when deployed on a real robot — unless you do specific things to close the gap. The interventions cluster into four categories: domain randomization, system identification, real-to-sim, and online adaptation.
Domain randomization
The dominant technique. At each simulator reset, sample physical and visual parameters from a wide distribution: friction, mass, motor gains, latency, observation noise, lighting, textures, camera pose, gravity. The policy is forced to learn a control law robust across the distribution; the real world is treated as one more sample from it.
Three regimes:
Static randomization. Fixed ranges, sampled once per episode. Simple, works for many tasks.
Adversarial randomization. Sample parameters that the policy currently fails on. Faster to converge, requires more infrastructure.
Automatic Domain Randomization (ADR). Start narrow, widen the range when success rate exceeds a threshold. OpenAI's Rubik's cube paper. Gives a curriculum for free.
The randomization that matters
Not all parameters are equal. Empirical priorities, in rough order:
Motor / actuator dynamics. Latency, PD gains, torque limits, deadbands.
Mass and inertia. Especially for objects being manipulated.
Friction. Both ground and contact.
Observation noise and latency. A policy trained on perfect proprioception fails on a real robot with 5ms IMU latency and quantization.
Visuals. For pixel-based policies, lighting and texture randomization are mandatory. Include random camera FOV, exposure, and white balance.
Why domain randomization works: the robustness argument. Let $\xi$ denote the physical parameters (friction, mass, latency, etc.) and let $\pi(a \mid o; \xi)$ be the optimal policy for a specific parameter setting. Domain randomization trains a single policy $\pi_\theta(a \mid o)$ that minimizes the expected loss over the parameter distribution:
$$ \min_\theta \mathbb{E}_{\xi \sim p(\xi)}\left[\mathcal{L}(\pi_\theta, \xi)\right] $$
The resulting policy is not optimal for any single $\xi$, but it is robust to the entire distribution. If the real world $\xi_{\text{real}}$ falls within the support of $p(\xi)$, the policy handles it. This is an instance of distributionally robust optimization — the policy optimizes for the worst case within the distribution rather than the best case at a single point.
The key engineering insight: wider randomization always helps robustness but hurts peak performance. A policy trained on friction $\mu \in [0.5, 0.6]$ outperforms one trained on $\mu \in [0.1, 1.5]$ when the real friction is $\mu = 0.55$. But the narrow policy fails catastrophically at $\mu = 0.3$, while the wide policy degrades gracefully. Since you rarely know the real parameters precisely, wider is usually the right bet.
System identification: the alternative to DR
Instead of training over a wide distribution, measure the real-world parameters and set the simulator to match. Drop a calibration object on the table, record the bounce, fit the coefficient of restitution. Run a motor sweep, fit the actuator model. System identification is cheaper than DR when it works — one simulation with accurate parameters is better than ten thousand with random ones. But it doesn't cover unknown unknowns: the cable that catches on the table edge, the motor backlash that varies with temperature, the slight camera misalignment. DR covers these because they fall somewhere in the randomized range even if you never modeled them explicitly.
In practice, the two techniques combine. System identification narrows the randomization ranges to a plausible neighborhood. DR widens them enough to absorb the residual uncertainty. The hybrid is strictly better than either alone.
A practical example: for a Franka arm grasping task, you can system-identify the PD gains by running a step-response test (command a joint to move to a target and measure the rise time and overshoot). This pins the nominal $K_p$ to 620 and $K_d$ to 48. Then you randomize $K_p \in [530, 720]$ and $K_d \in [38, 58]$ — a $\pm$15% range centered on the measured values, rather than the $\pm$30% range you would need without system identification. Narrower ranges mean the policy can learn a tighter, more precise control law while still being robust to the remaining uncertainty.
RMA: Rapid Motor Adaptation
RMA (Kumar et al., 2021) is the bridge between system identification and domain randomization. The idea: train two modules in simulation.
Privileged policy. Train a policy $\pi(a \mid o, e)$ where $e$ is a vector of ground-truth environment parameters (friction, mass, terrain slope, motor strength, etc.) that the simulator provides but the real world doesn't. The policy learns to adapt its behavior to these parameters — walk differently on ice vs. gravel, push harder when the object is heavy.
Adaptation module. Train a separate network $\hat{e} = g_\phi(o_{t-H:t})$ that infers $e$ from a short history ($H \approx 50$ timesteps) of proprioceptive observations (joint positions, velocities, torques). This module learns to identify the environment from its effects: if the robot slips, friction is low; if the robot is slow to respond, there is latency.
At deployment, the privileged parameters $e$ are unavailable. The adaptation module fills in: $\pi(a \mid o, g_\phi(o_{t-H:t}))$. Within 50 timesteps (~0.5s at 100Hz), the adaptation module converges on an estimate of the real-world parameters, and the policy adapts accordingly. This is why RMA-trained quadrupeds can walk on grass, gravel, and stairs without retraining — the adaptation module identifies the terrain in real time.
The sim-to-real gap checklist
When your sim-trained policy fails on the real robot, the cause is almost always one of these five. Diagnose in this order:
Contact dynamics. Simulators approximate contacts with penalty forces or LCP solvers. Neither models deformation, stiction, or surface micro-texture. Symptom: the policy grasps perfectly in sim, drops objects in real. Fix: increase contact friction randomization; add stochastic slip events; test with softer and harder objects than the sim default.
Observation gap. Sim images look different from real images — different lighting, textures, no reflections, no motion blur. Symptom: a pixel-based policy that works in sim but freezes or oscillates in real. Fix: visual domain randomization (random textures, lighting, camera noise); or use a frozen foundation-model encoder (DINOv2, CLIP) that generalizes across the gap.
Action delay. Real motors have latency (5–25ms between command and execution). Communication buses add more. If the sim policy assumes instant execution, the real robot overshoots. Symptom: jerky, oscillatory motion. Fix: randomize action delay in sim (uniform in [1, 3] control steps); add action history to the observation.
State estimation error. Sim knows exact joint positions. Real robots have encoder quantization, drift, and cable routing that adds compliance. Symptom: gradual drift or occasional large errors. Fix: add Gaussian noise + bias to joint observations in sim; include IMU latency.
Unmodeled dynamics. Cables, hoses, table compliance, air resistance on large objects. Sim ignores them; the real world does not. Symptom: the policy fails on a task that should be easy. Fix: widen DR ranges; add random external forces; consider a real-to-sim calibration step.
Domain randomization config — Isaac Lab
# Domain randomization for manipulation — Isaac Lab (NVIDIA)# Applied at each environment resetdomain_rand = {
# Physics parameters"friction_range": [0.5, 1.5], # unitless, nominal ~0.8"object_mass_range": [0.05, 2.0], # kg"object_com_offset": [-0.02, 0.02], # m, center-of-mass shift"gravity_noise": [-0.5, 0.5], # m/s^2, added to 9.81"restitution_range": [0.0, 0.3], # bounciness# Actuator parameters"motor_strength_range": [0.7, 1.3], # fraction of nominal torque"kp_range": [420, 780], # PD position gain (nom 600)"kd_range": [35, 65], # PD velocity gain (nom 50)"action_delay_steps": [0, 3], # control steps of latency# Observation noise"joint_pos_noise": 0.005, # rad, Gaussian std"joint_vel_noise": 0.1, # rad/s"proprio_latency_steps": [0, 2], # steps of observation delay# Visual randomization (pixel-based policies)"camera_pos_noise": 0.02, # m, Gaussian std per axis"camera_rot_noise": 5.0, # degrees"light_color_temp": [3000, 6500], # Kelvin"light_intensity_range": [0.5, 2.0], # multiplier"texture_randomize": True, # random colors/patterns"table_color_hsv": [(0,0,0.2), (360,0.3,0.8)],
}
Real-to-sim and digital twins
Build a simulator that matches your specific real environment — calibrated geometry, measured friction, characterized actuators. Useful when one specific deployment matters more than generality. Less helpful when the goal is broad generalization.
Online adaptation
Fine-tune in the real world after sim training. Sometimes via RL (slow, dangerous), sometimes via fast supervised correction signals (preferred). The unifying lesson: sim training gets you to 80%, and the last 20% is real data.
Worked example: domain randomization ranges for a Franka arm. A typical sim-to-real setup for a Franka Panda manipulator randomizes:
Motor parameters: PD gains ±30% from nominal (K_p = 600 ± 180, K_d = 50 ± 15). Actuation latency: uniform in [5ms, 25ms]. Torque limits: 87 Nm ± 10%.
Physics: Object mass: 0.1–2.0 kg (vs nominal 0.5 kg). Friction: μ = 0.3–1.2 (table), 0.5–1.5 (object). Gravity: 9.81 ± 0.5 m/s².
Observation noise: Joint position: ±0.005 rad. Joint velocity: ±0.1 rad/s. Camera: ±4 pixel shift, ±5° rotation, brightness ±30%.
The real robot is one sample from this distribution. If the ranges are too narrow, the policy fails on the real robot. If too wide, the policy is too conservative. ADR (automatic domain randomization) tunes these ranges automatically during training.
Domain randomization ranges: what, how much, and why
Not all DR parameters are created equal. Some dominate transfer performance; others are noise. The following table consolidates empirical ranges from published sim-to-real work (OpenAI Rubik's cube, Legged Gym, HORA dexterous, IsaacGym manipulation). Each range is chosen to bracket the plausible real-world variation with margin:
Parameter
Randomization range
Nominal value
Why this range
Surface friction ($\mu$)
0.3–1.5
~0.7
Real surfaces range from smooth laminate (~0.3) to high-grip rubber (~1.5). Under-randomizing friction is the #1 cause of sim-trained grasps that slip in the real world.
Object mass
±50% of nominal
Task-dependent
A "500g object" might actually weigh 350g (hollow) or 750g (wet, or different material). Mass affects both inertia during transport and required grip force.
Camera pose (translation)
±3cm per axis
Calibrated position
Camera mounts shift due to vibration, bumps during operation, or imprecise installation. A 3cm error in extrinsics can shift pixel coordinates by 50+ pixels at working distance.
Camera pose (rotation)
±5° per axis
Calibrated orientation
Mounting imprecision and bracket flex. Even 2° of tilt shifts object centroids in the image by 10–20 pixels, enough to throw off a pixel-based policy.
Lighting color temperature
3000–7000 K
~5000K (daylight)
Real environments range from warm tungsten (2700K) to cool fluorescent (6500K) to daylight. Policies trained on one lighting fail under another if the vision encoder is not robust.
Lighting intensity
±40% of nominal
Task-dependent
Overhead lights turn on/off, windows let in variable sunlight, shadows from people passing by. Intensity shifts change image brightness, contrast, and shadow patterns.
Action delay
0–3 control steps
1 step
Communication bus latency (USB: 1–5ms, EtherCAT: <1ms), motor controller delays, OS scheduling jitter. At 50Hz control, 3 steps = 60ms — enough to overshoot by several mm.
Joint damping
±30% of nominal
Motor-specific
Damping varies with temperature (cold motors are stiffer), wear (lubricant degradation), and cable routing (added compliance). Under-damped joints oscillate; over-damped joints are sluggish.
Joint position noise
±0.005 rad (Gaussian σ)
0
Encoder quantization on a 13-bit encoder is ~0.0008 rad, but cable routing, gear backlash, and flex add 3–5× more effective noise.
Center-of-mass offset
±2cm per axis
Geometric center
Real objects have non-uniform density (a mug's handle shifts its CoM). A 2cm CoM error changes the torque required to hold the object stable by ~20%.
The ranges above are starting points. The right process: start with these ranges, train a policy, deploy it on the real robot, identify the failure mode, and then widen the range for the parameter that explains the failure. If the gripper slips: widen friction. If the arm overshoots: widen action delay and damping. If the image-based policy freezes: widen visual randomization. This iterative tightening is how you converge on the minimal DR set for your specific task.
A common failure mode: randomizing too many parameters simultaneously. If you randomize 15 parameters each with wide ranges, the resulting distribution is so broad that the policy cannot find a control law that works across all samples. The training loss plateaus at a high value and the policy is mediocre everywhere. The fix is progressive widening: start with narrow ranges (or zero randomization), train until convergence, then widen the parameter that the real-world deployment suggests is the bottleneck. This is manual ADR — less elegant than automatic ADR but more debuggable and faster for a single task.
Real2Sim2Real: the iterative calibration loop
Pure domain randomization treats the real world as a black box: randomize everything and hope reality falls within the distribution. Real2Sim2Real is a more targeted approach: use real-world failures to improve the simulator, then retrain in the improved simulator. The loop has six steps:
Worked example: Real2Sim2Real for a peg-insertion task. The goal: insert a cylindrical peg (8mm diameter) into a hole (8.5mm diameter) on a PCB fixture. Tolerance: 0.25mm lateral, 2° angular.
Iteration 1 — Naive sim training.
Train PPO in Isaac Gym with default physics parameters and moderate DR. Deploy on the real Franka. Result: 45% success rate. Failure analysis from 55 failure videos: 30 failures are "peg catches on hole edge and jams" (contact dynamics mismatch), 15 are "peg approaches at wrong angle" (camera pose error), 10 are "arm oscillates near contact" (action delay mismatch).
Iteration 2 — Tune sim to reproduce failures.
Collect 100 real-world failure trajectories with full state logging (joint positions, torques, end-effector wrench). Replay these trajectories in simulation with different physics parameters. Use Bayesian optimization to find the sim parameters that best reproduce the real contact forces during jamming:
• Real contact stiffness: ~5000 N/m. Default sim: 2000 N/m. Fix: set sim stiffness to 4500–5500 N/m.
• Real friction during peg-hole contact: ~0.4. Default sim: 0.8. Fix: set friction range to 0.3–0.6.
• Real action delay: ~40ms (2 control steps at 50Hz). Default sim: 0 steps. Fix: randomize 1–3 steps.
Retrain with calibrated sim parameters. Result: 62% success rate. Improvement: +17 points.
Iteration 3 — Add failure scenarios to training.
The remaining 38% failures cluster into two modes: (a) the peg approaches at a slight angle and the chamfer doesn't correct it, (b) the robot hesitates at contact and the PD controller oscillates. Add to training:
• Adversarial initial peg orientations: tilt 0–5° from vertical (vs. 0–2° previously).
• Contact event curriculum: 20% of training episodes start with the peg already touching the hole edge, forcing the policy to learn recovery.
• PD gain randomization widened from ±10% to ±30%.
Retrain. Result: 78% success rate. Improvement: +16 points.
Total improvement: 45% → 62% → 78% across three iterations. Each iteration required ~4 hours of sim training + 2 hours of real-robot data collection + 2 hours of analysis. The remaining 22% failure rate is dominated by edge cases (unusual peg orientations, fixture wear) that would require either more DR breadth or a small amount of real-world RL fine-tuning to close.
The Real2Sim2Real loop converges when the sim failures match the real failures in distribution: if the policy fails at the same rate and in the same ways in sim and real, the simulator is well-calibrated and further sim training will transfer. The practical signal: track the failure mode distribution (not just the success rate) across sim and real. When the pie charts match, you are done calibrating.
Domain randomization config — YAML format
# dr_config.yaml — Domain randomization for peg insertion# Each parameter: [low, high] for uniform sampling per episodephysics:
friction_range: [0.3, 0.6] # calibrated from real contact datacontact_stiffness: [4500, 5500] # N/m, from force-replay matchingobject_mass_range: [0.008, 0.015] # kg, peg mass variationrestitution: [0.0, 0.1] # metal-on-metal, low bounceactuator:
kp_range: [420, 780] # PD position gain (nominal 600)kd_range: [35, 65] # PD velocity gain (nominal 50)action_delay_steps: [1, 3] # from real latency measurementtorque_limit_frac: [0.85, 1.0] # fraction of nominal max torqueobservation:
joint_pos_noise_std: 0.003# rad, Gaussianjoint_vel_noise_std: 0.08# rad/see_force_noise_std: 0.5# N, wrist F/T sensor noisevisual:
camera_pos_noise_std: 0.03# m per axis, from mount calibrationcamera_rot_noise_std: 5.0# degrees per axislight_color_temp: [3000, 7000] # Kelvinlight_intensity_frac: [0.6, 1.4] # multiplier on nominaltexture_randomize: True# random table/object colorscurriculum:
initial_tilt_range: [0, 5] # degrees, peg approach anglecontact_start_frac: 0.2# 20% of episodes start at contact
When to stop iterating the Real2Sim2Real loop. The loop has diminishing returns after 3–4 iterations. The first iteration (naive sim → real deployment) reveals the largest gaps and typically yields a 15–20 point improvement. The second iteration (calibrated physics) yields 10–15 points. The third (failure-scenario augmentation) yields 5–10 points. After that, the remaining failures are either irreducible stochasticity (the peg slips in a way no sim can model) or require real-world RL to close. The practical stopping criterion: when the failure mode distribution in sim matches the failure mode distribution in real (same failure types in similar proportions), your simulator is well-calibrated and further sim-only improvements will transfer. If the failure modes diverge (sim fails on X, real fails on Y), you have an unmodeled gap that calibration cannot fix — switch to residual RL or HIL-SERL for the final push.
17Pixel-based RL
Learning from images is harder than learning from state, in ways that are now well-understood.
State-based RL — where the agent observes a low-dimensional state vector — has been a solved problem in many simulated benchmarks since 2018. Pixel-based RL — observing only RGB frames — was a much harder problem until DrQ (Kostrikov et al., 2020) demonstrated that aggressive data augmentation closes most of the gap.
The DrQ family
DrQ
Augment image observations with random shifts ($\pm 4$ pixels), then run SAC. Average $K$ Q-values per state-action pair, computed on $K$ different augmentations. Shockingly, this single change closed the gap to state-based RL on DeepMind Control benchmarks.
DrQ-v2
Replaces SAC with DDPG (deterministic policy + exploration noise schedule), drops the ensembling. Faster, simpler, better on DeepMind Control. The standard pixel-RL baseline since 2021. Key ingredients: n-step returns (3-step), large replay buffer (1M), exploration noise schedule linearly decayed from 1.0 to 0.1 over 500k steps.
DrM
Adds a dormant-neuron reset and a layer-norm tweak. Marginal gains but the right diagnostic frame for why pixel-RL is unstable: large fractions of the network become inactive during training and stop contributing.
The augmentation insight
Augmentations work in pixel-RL for the same reason they work in supervised vision: they enforce a useful invariance and act as a regularizer. But the deeper reason is that RL targets are noisy; without augmentation, the network overfits to whatever artifacts the noise produces. Random shifts force the encoder to be translation-equivariant and starve the network of the specific pixel-coordinate features it would otherwise memorize.
Why random crop works for RL specifically (the DrQ insight)
In supervised learning, augmentation creates more training data. In RL, the mechanism is different and more fundamental. The Q-function is a regression target, and that target is already noisy (it is a one-step bootstrap estimate, not a ground-truth label). Without augmentation, the Q-network overfits to the noise: it memorizes "when the red block is at pixel (43, 67), the Q-value is 3.7." This is a spurious correlation — the Q-value should be the same whether the block is at pixel (43, 67) or pixel (47, 63).
Random crops enforce exactly this invariance. Two augmented views of the same observation produce slightly different pixel features but must map to the same Q-value. This acts as a consistency regularizer on the encoder: features that are stable under shifts (object positions relative to each other, shapes, colors) survive training, while features sensitive to absolute pixel position (memorized coordinates) do not.
Representation learning for pixel RL
Random crop is the minimum viable augmentation. A family of methods goes further, learning visual representations explicitly useful for control:
CURL (Srinivas et al., 2020) — contrastive learning. Two augmented views of the same frame should be close in embedding space; views from different frames should be far apart. Uses a momentum encoder and InfoNCE loss, exactly like MoCo.
SPR (Schwarzer et al., 2021) — self-predictive representations. The encoder must predict its own future representations: $f(o_{t+k}) \approx g(f(o_t), a_t, \ldots, a_{t+k-1})$. Forces the encoder to capture dynamics-relevant features.
ATC (Stooke et al., 2021) — augmented temporal contrast. Combines contrastive learning with a temporal component: the encoder must match representations of temporally adjacent frames under different augmentations.
All three share the same principle: give the encoder an auxiliary objective that forces it to learn features useful for predicting the future, rather than features useful for memorizing the past. On standard benchmarks (DMC-100k, DMC-500k), these methods provide diminishing returns over plain DrQ-v2 when the augmentation is strong. Their value shows on harder tasks: sparse rewards, visual complexity, long horizons.
When pixel RL beats state RL
Counterintuitive — why would learning from images ever be better than learning from ground-truth state? Three cases:
Poor state estimation. If the "state" is noisy or incomplete (no object tracking, no contact sensing), pixels carry strictly more information. A camera sees the object; the state vector might not include its position.
Visual features matter for the task. Sorting by color, quality inspection (detect defects), food handling (ripe vs. unripe). The state vector would need a classifier; the pixel policy learns one implicitly.
Sim-to-real with visual DR. Pixel policies trained with visual domain randomization generalize to new environments. The image encoder serves as a de facto state estimator that works across scenes.
The CURL objective, unpacked
CURL (Contrastive Unsupervised Representation for RL) applies the InfoNCE contrastive loss directly to the RL encoder. The idea: two different random augmentations of the same observation should produce similar representations, while augmentations of different observations should produce dissimilar ones. This is the same principle behind SimCLR and MoCo in computer vision, adapted to the RL setting where the "dataset" is a replay buffer of transitions.
$q = f_\theta(\text{aug}_1(o_t))$ — the query: the encoder applied to one random augmentation of observation $o_t$. This is the representation we want to be useful for control.
$k_+ = f_{\bar\theta}(\text{aug}_2(o_t))$ — the positive key: the momentum encoder applied to a different augmentation of the same observation $o_t$. The momentum encoder $f_{\bar\theta}$ is an exponential moving average of $f_\theta$ (update rate $\alpha = 0.01$), exactly as in MoCo.
$k_j^-$ — negative keys: representations of augmented views from other observations in the same minibatch. Typically $K = 127$ negatives (batch size 128, one positive per query).
$\tau$ — the temperature. Controls the sharpness of the softmax. $\tau = 0.1$ is standard. Lower temperature makes the loss more sensitive to hard negatives.
The loss forces the encoder to produce representations that are invariant to augmentation (the two views of the same frame map to the same point) and discriminative across time (different frames map to different points). This is exactly the invariance that matters for control: the encoder should not change its output when the camera jiggles by 4 pixels, but should change dramatically when the object moves.
In practice, CURL adds ~10% training overhead (one extra forward pass through the momentum encoder per batch) and provides the largest gains in the low-data regime (DMC-100k: +15% over DrQ on hard tasks like Walker-Walk). At DMC-500k and beyond, the augmentation-only baseline (DrQ-v2) catches up, because with enough data the encoder learns useful features from the RL objective alone.
The CURL momentum encoder is critical. Without it (using the same encoder for both query and key), the contrastive loss collapses: the encoder learns to map everything to a constant vector, which trivially satisfies the contrastive objective. The momentum encoder, updated as $\bar\theta \leftarrow \alpha \bar\theta + (1 - \alpha) \theta$ with $\alpha = 0.99$, provides a slowly-changing reference that prevents this collapse. This is the same stabilization trick used in MoCo and BYOL in self-supervised learning.
When pixel RL beats state RL — three concrete scenarios
Counterintuitive — why would learning from images ever be better than learning from ground-truth state? Three cases where pixels carry strictly more information than a hand-designed state vector:
Scenario 1: Poor state estimation in manipulation. You are building a cloth-folding policy. The "state" vector contains joint positions and a single point-cloud centroid of the cloth. But the cloth's shape is a 50,000-dimensional mesh — the centroid captures almost none of it. A single RGB image captures the cloth's full visible geometry: folds, wrinkles, edges, overlap regions. Pixel RL on a 84×84 image (7,056 values) carries more task-relevant information than the 7-dimensional state vector. Empirical result: pixel RL achieves 72% fold success vs. 35% for state RL on the same task (DeformableRaven benchmark).
Scenario 2: Visual features are the task itself. You are sorting ripe from unripe tomatoes on a conveyor belt. Ripeness is determined by color, texture, and surface blemishes — information that exists only in pixels. The state vector (object position, velocity) tells you where the tomato is, not what it looks like. You would need a separate classification pipeline feeding into the state, at which point the pixel policy is simpler: it learns perception and control jointly from a single reward signal ("pick only red tomatoes"). This is common in food handling, quality inspection, and any task where the decision depends on visual appearance.
Scenario 3: Sim-to-real with visual domain randomization. You train a picking policy in simulation with aggressive visual DR (random textures, lighting, camera perturbation). Deploy on a real robot in a warehouse with lighting conditions and backgrounds never seen in sim. The pixel-based encoder, forced to be invariant to visual distractors by DR, acts as a robust state estimator that works across scenes. A state-based policy trained in sim transfers perfectly to real if the state estimation pipeline works in the real environment. But state estimation (object detection + pose estimation) is its own fragile pipeline that often fails on novel objects and scenes. The pixel policy sidesteps this entirely — the encoder is the state estimator, and DR has made it robust.
Encoder architecture for pixel RL
The standard encoder is surprisingly small: 4 convolutional layers, channels [32, 32, 32, 32], kernel size 3x3 with stride 2 on the first two layers. Input: 84x84x9 (3 stacked frames). Output: a 50-dimensional feature vector after a linear projection. Total: ~0.5M parameters. This is much smaller than a ViT-B (86M) or even a ResNet-18 (11M). Why?
RL does millions of gradient updates. Each update uses a tiny batch (256 transitions) compared to supervised learning (thousands of images). A large encoder would overfit catastrophically — it would memorize the replay buffer within a few thousand steps. The small CNN has just enough capacity to extract spatial features (object positions, gripper state) but not enough to memorize textures. This is also why dropout and layer norm matter more in pixel RL than in supervised vision: they are the regularization that prevents the encoder from collapsing.
The architecture in detail, layer by layer:
Layer
Type
Channels
Kernel
Stride
Output shape
Parameters
Input
—
9
—
—
84 × 84 × 9
0
Conv1
Conv2d + ReLU
32
3 × 3
2
41 × 41 × 32
2,624
Conv2
Conv2d + ReLU
64
3 × 3
2
20 × 20 × 64
18,496
Conv3
Conv2d + ReLU
128
3 × 3
2
9 × 9 × 128
73,856
Conv4
Conv2d + ReLU
256
3 × 3
2
4 × 4 × 256
295,168
Flatten
—
—
—
—
4,096
0
LayerNorm
LayerNorm
—
—
—
4,096
8,192
Linear
Linear + Tanh
—
—
—
50
204,850
Total
~603K
This is 1000× smaller than a ViT-B (86M parameters). The disparity is not a flaw — it is a design constraint imposed by the RL training regime. RL performs 1M+ gradient updates on a replay buffer of ~100K transitions. A ViT-B would memorize the entire replay buffer in under 10K updates and produce Q-values that are perfect on stored transitions but meaningless on new ones. The small CNN is the right inductive bias: local spatial features (edges, objects, gripper position) generalize; global attention patterns do not, at this data scale.
Sample efficiency. The network has to learn perception, value estimation, and control jointly from a single scalar reward. Any of these tasks alone is hard.
Representation collapse. The encoder can converge to features that are temporally smooth but task-irrelevant. The Q-network reports low loss (it predicts the correct bootstrapped target) but the encoder has learned to map all observations to nearly the same point in feature space. The policy then takes the same action everywhere. This is distinct from Q-value overestimation — it is an encoder failure, not a value failure.
Exploration. Random actions in a high-dimensional control space rarely produce useful images; you need either a curiosity bonus, a strong prior, or both.
Non-stationarity. In supervised learning, the dataset is fixed. In RL, the data distribution changes as the policy improves. The encoder must track a moving target: features that were useful for the initial random policy may be useless for the semi-competent policy at 100K steps. This is why replay ratios and target-network update rates matter more in pixel RL than in state RL — they control how fast the representation is allowed to drift.
The representation collapse diagnostic
How do you detect encoder collapse before it ruins a training run? Two cheap signals:
Feature norm variance. Compute the standard deviation of the L2 norm of the encoder output across a batch of 256 observations. In a healthy encoder, this variance is at least 10% of the mean norm. If it drops below 2%, the encoder is collapsing — all observations map to nearly the same feature vector.
Dormant neuron ratio. Count the fraction of ReLU neurons in the encoder that have zero output for > 95% of a batch. If > 30% of neurons are dormant, the encoder is effectively lower-capacity than you designed. This is the DrM diagnostic. The fix: periodically reset dormant neurons (re-initialize their weights) and add layer normalization after each conv layer.
Log both metrics every 10K gradient steps. A healthy training run shows stable feature norm variance (within 50% of its initial value) and dormant neuron ratio below 20% throughout training. If either metric degrades sharply, the encoder is collapsing and training should be restarted with stronger regularization (more aggressive augmentation, layer normalization, or a lower learning rate for the encoder).
These diagnostics cost less than 1% of training time (a single forward pass on a held-out batch every 10K steps) and can save hours of wasted training by catching collapse early.
The augmentation-regularization tradeoff
Stronger augmentation improves generalization but hurts sample efficiency. Random shifts of $\pm 4$ pixels are the sweet spot for most DMC tasks: enough to prevent memorization, not so much that the observation becomes ambiguous (a $\pm 12$ shift can move a small object entirely out of frame in an 84×84 image). For manipulation tasks with larger images (224×224), the shift should be proportionally larger ($\pm 10$–$20$ pixels) to maintain the same effective invariance. Color jitter and random erasing help on tasks with visual complexity (multiple objects, textured backgrounds) but add no benefit on simple tasks (single object, solid background). The diagnostic: if adding augmentation does not improve eval performance, the task is not visually complex enough to benefit.
Beyond random shifts, three augmentation strategies have been validated for pixel RL:
Random convolution. Apply a small random convolutional filter (3×3, random weights) to the observation. This perturbs textures without changing spatial structure. Useful for sim-to-real: the random conv teaches the encoder to ignore texture details that differ between sim and real.
Color jitter. Random brightness, contrast, saturation, and hue shifts. Standard in supervised vision, but must be applied carefully in RL: extreme hue shifts can make a red object look green, confusing color-conditioned tasks. Limit hue to $\pm 10\%$.
Cutout / random erasing. Mask a random rectangular patch (10–30% of the image) with gray. Forces the encoder to use distributed spatial features rather than relying on a single salient region. Particularly useful for manipulation tasks where the encoder might over-attend to the robot arm (large, always present) rather than the small object being manipulated.
The combination matters. DrQ-v2 uses only random shifts. Adding color jitter and cutout on top of random shifts typically adds 5–10% return on visually complex tasks (e.g., robot manipulation with cluttered backgrounds) but adds nothing on simple tasks (e.g., single-object locomotion in DMC). The diagnostic: train with shifts only, then train with all augmentations, and compare performance at 100K environment steps. If the gap is less than 5%, the extra augmentations are not worth the added hyperparameter complexity.
Why random crop outperforms all other augmentations in pixel RL. Consider what each augmentation destroys: color jitter destroys color information, cutout destroys local spatial information, random crop destroys absolute position information. In RL, the Q-function's most common failure mode is memorizing absolute pixel coordinates: "when the red block is at (43, 67), the return is 3.7." Random crop is the only augmentation that directly attacks this failure mode, because it shifts the entire image by up to $\pm 4$ pixels, making absolute coordinates unreliable. Color jitter and cutout attack different failure modes (color memorization, local patch memorization) that are less common in practice. This is why random crop is the single augmentation that works across all pixel-RL benchmarks, while the others provide task-dependent gains.
The pretraining shortcut
Replace the encoder with a frozen visual foundation model (CLIP, DINOv2, R3M). The RL problem becomes "learn a policy on a 768-dim feature vector," which is much closer to state-based RL. This is the dominant pattern in 2026 — pure pixel-RL from scratch is rare; pixel-RL on top of a frozen foundation model is common.
The frozen-encoder approach has a second benefit beyond sample efficiency: it decouples visual generalization from policy learning. A DINOv2 encoder trained on 142M images provides features that generalize across lighting, backgrounds, and object instances. The RL policy on top only needs to learn the mapping from features to actions, which is a low-dimensional regression problem. The result: RL policies built on frozen foundation-model encoders transfer across visual domains (sim-to-real, lab-to-kitchen) without any visual domain randomization. The encoder handles visual generalization; the RL handles motor generalization.
Worked example: pixel RL training budget comparison. Task: DMC Walker-Walk (locomotion from pixels). All methods use the same 84×84×9 observation.
DrQ-v2 (learned encoder): 500K environment steps, ~2M gradient updates, 2 hours on 1 GPU. Final return: 920/1000. Encoder: 600K params, trained end-to-end. The encoder learns basic spatial features (limb positions, ground contact) but is fragile to visual perturbations.
CURL (contrastive encoder): 100K environment steps, ~400K gradient updates, 45 minutes. Final return: 880/1000. Reaches 90% of DrQ-v2's performance in 5× fewer steps. The contrastive objective provides a better learning signal in the low-data regime. Beyond 200K steps, DrQ-v2 catches up and surpasses CURL.
Frozen DINOv2 + SAC: 100K environment steps, ~400K gradient updates, 30 minutes. Final return: 940/1000. The frozen encoder provides 768-dim features that are already informative; SAC only trains a 2-layer MLP critic and actor (~200K params). Converges faster, generalizes better, and does not suffer from encoder collapse. The downside: the DINOv2 features are not optimized for the specific task, so there is a ceiling on tasks that require perceiving fine-grained task-specific visual cues.
Worked example: DrQ-v2 augmentation. Input image: 84×84×3 (stacked 3 frames = 84×84×9). Random shift: pad the image by 4 pixels on each side (92×92), then randomly crop back to 84×84. This shifts the image by at most ±4 pixels in any direction. The shift simulates small camera calibration errors and forces the encoder to learn features that are robust to exact pixel positions.
At inference, DrQ-v2 takes a single center crop (no randomness). The training augmentation acts as a regularizer that prevents the Q-network from memorizing pixel-specific features. Without it, the Q-network memorizes the exact pixel coordinates of objects in the replay buffer and generalizes poorly to new positions.
Computational cost: essentially zero — a pad-and-crop is two lines of PyTorch and adds <0.1ms per batch. This is why random shift is the universal default: it costs nothing and prevents the most common failure mode.
18World models
Imagine the future, plan inside the imagination, hope the imagination is right.
A world model is a learned dynamics model — a network that predicts $p(s_{t+1} \mid s_t, a_t)$ — plus the apparatus to use it.
The Dreamer family
The recurrent state-space model (RSSM)
The world model factorizes the state into a deterministic component $h_t$ (a GRU's hidden state) and a stochastic component $z_t$ (a categorical or Gaussian latent).
Derivation: the RSSM
Why two components? The deterministic path $h_t$ captures the predictable dynamics — inertia, gravity, trajectory continuation. The stochastic path $z_t$ captures irreducible uncertainty — contact outcomes, object slippage, unobserved state. Together they form a sufficient representation for both prediction and control.
In plain English: the robot dreams about what will happen next. It keeps a "confident prediction" (the deterministic state — like knowing that a thrown ball will keep going up) and an "uncertain guess" (the stochastic state — like not knowing whether the ball will bounce or stick on landing). Together, these two components let the model hallucinate realistic future trajectories, and the policy learns entirely inside this hallucination.
$h_t \in \mathbb{R}^{600}$ — the deterministic hidden state, computed by a GRU. Captures the predictable, inertial dynamics (momentum, gravity, trajectory trends). Think of it as "what the model is confident will happen."
$z_t$ — the stochastic latent. In DreamerV3, this is 32 categorical variables each with 32 classes (= 1024-dim one-hot). Captures irreducible uncertainty — contact outcomes, slippage, hidden state not visible in the image.
$f_\theta$ — the recurrence function (a GRU). Takes the previous deterministic state, stochastic state, and action as input. This is the backbone that carries temporal context forward.
$q_\phi(z_t \mid h_t, x_t)$ — the posterior (encoder). Sees the actual observation $x_t$ (camera image) and the deterministic state $h_t$, and outputs a distribution over $z_t$. Used during training only — gives the "correct" stochastic state because it can peek at the real image.
$p_\theta(\hat z_t \mid h_t)$ — the prior (predictor). Must predict $z_t$ from $h_t$ alone, without seeing the image. Used during imagination rollouts at inference. The KL loss trains this to match the posterior.
$x_t$ — the observation (camera image). The decoder $p_\theta(\hat x_t \mid h_t, z_t)$ reconstructs it, ensuring the latent state $(h_t, z_t)$ contains enough information about the scene.
$a_{t-1}$ — the action taken at the previous timestep. The recurrence must know what action caused the current state transition.
The training loss combines three terms: (1) image reconstruction $-\log p_\theta(x_t \mid h_t, z_t)$, (2) reward prediction $-\log p_\theta(r_t \mid h_t, z_t)$, and (3) KL divergence $\mathrm{KL}(q_\phi(z_t \mid h_t, x_t) \| p_\theta(z_t \mid h_t))$ which forces the prior to predict the posterior without seeing the image.
Worked example: RSSM forward pass. At time $t-1$, the model has deterministic state $h_{t-1} \in \mathbb{R}^{600}$, stochastic state $z_{t-1}$ (32 categoricals × 32 classes), and the agent took action $a_{t-1} \in \mathbb{R}^{7}$.
Recurrence: $h_t = \text{GRU}(h_{t-1}, [z_{t-1}; a_{t-1}])$. The GRU takes a concatenated input of $z_{t-1}$ (32×32 = 1024 dim after one-hot) and $a_{t-1}$ (7 dim), total 1031 dim. Output: $h_t \in \mathbb{R}^{600}$.
Prior: $p_\theta(z_t \mid h_t) = \text{MLP}(h_t) \to$ logits for 32 categoricals, each 32 classes. During imagination, we sample from this.
Posterior: $q_\phi(z_t \mid h_t, x_t) = \text{MLP}([h_t; \text{enc}(x_t)]) \to$ same 32×32 categoricals. During training with real observations, we use the posterior (it sees the image).
KL: Between two categorical distributions. For category $j$ with posterior probs $q_j$ and prior probs $p_j$: $\mathrm{KL}_j = \sum_c q_{jc} \log(q_{jc}/p_{jc})$. Summed across 32 categories. DreamerV3 uses KL balancing: $\beta_{\text{post}} \cdot \mathrm{KL}(\text{sg}[q] \| p) + \beta_{\text{prior}} \cdot \mathrm{KL}(q \| \text{sg}[p])$ where $\text{sg}$ = stop gradient.
Decoder: $p_\theta(x_t \mid h_t, z_t) = \text{ConvTranspose}([h_t; z_t])$, reconstructing the 64×64 image. The reconstruction loss is MSE in pixel space.
Derivation: imagination rollouts
Once trained, the world model enables "dreaming." Starting from a real observation $(h_0, z_0)$:
The actor selects action $a_0 = \pi_\psi(h_0, z_0)$.
The deterministic recurrence gives $h_1 = f_\theta(h_0, z_0, a_0)$.
The prior samples $\hat z_1 \sim p_\theta(\hat z_1 \mid h_1)$ — no real image needed.
The reward predictor estimates $\hat r_1$.
Repeat for 15–50 imagined steps.
The actor and critic are trained purely on these imagined trajectories. The actor maximizes the sum of predicted rewards (with the entropy bonus for DreamerV3); the critic estimates the value function. No real environment interaction is required during this phase. This is the core sample-efficiency mechanism: one real trajectory generates thousands of imagined ones.
deterministic hposterior (sees image)KL prior↔posteriorpolicy training
DreamerV3 details that matter
Symlog transformations on rewards and values: $\text{symlog}(x) = \text{sign}(x)\log(|x|+1)$. Compresses the dynamic range so the same loss works across tasks with rewards in [-1,1] or [0, 1000].
Two-hot encoding of returns: predict a categorical over a discretized return range and decode with a soft target. Stabilizes value learning.
Categorical latents with straight-through gradients: 32-dim categorical with 32 classes per dim, instead of Gaussian latents. Empirically more stable.
KL balancing: separate scaling for the "make posterior close to prior" and "make prior close to posterior" terms of the KL. Prevents posterior collapse.
DayDreamer
Wu et al., 2022. Dreamer applied to four real robots. The headline result was not the algorithm — it was the framing: an A1 quadruped learned to walk in 1 hour from scratch, on real hardware, with no simulator. Dreamer's sample efficiency made real-world RL plausible.
Where world models help
Sample-efficient real-world RL when the dynamics model is easier to learn than the policy.
Transfer: a world model trained on one task can be reused for a related task.
Long-horizon credit assignment: imagined rollouts can be 50+ steps without reset costs.
Where world models struggle
Contact-rich manipulation, where prediction errors compound fast and the model can't track sliding contacts.
Open-ended environments where reconstruction loss spends capacity on irrelevant background pixels.
Interactive: latent imagination rollout
true trajectoryimagined (from model)divergence
TD-MPC / TD-MPC2: planning in latent space
The Dreamer family learns a world model and then trains a policy inside the dream. TD-MPC (Hansen et al., 2022) takes a different bet: learn a latent dynamics model, but don't reconstruct observations and don't train a policy end-to-end. Instead, plan at test time using Model Predictive Path Integral (MPPI) control directly in the learned latent space.
The key insight: pixel reconstruction wastes model capacity on visual details that are irrelevant to control. Shadows, textures, background clutter — none of it affects the optimal action. TD-MPC's model only needs to be accurate in the policy-relevant part of state space: the part that predicts rewards and value.
Architecture
In plain English: compress the camera image into a compact code, then learn to predict how that code changes when you take an action — without ever trying to reconstruct the image. The model learns a "physics simulator" that operates entirely on learned representations, not pixels. At test time: sample 512 candidate action plans, simulate them all in the learned model, and pick the best one. Total planning time: ~10ms.
Five learned components, all operating in latent space:
TD-MPC components
$$\begin{aligned}
h &= f_\theta(o) \quad &&\text{encoder}\\
h' &= d_\theta(h, a) \quad &&\text{latent dynamics}\\
\hat r &= R_\theta(h, a) \quad &&\text{reward predictor}\\
\hat V &= v_\theta(h) \quad &&\text{value predictor}\\
a &\sim \pi_\theta(h) \quad &&\text{policy prior}
\end{aligned}$$
$f_\theta(o)$ — the encoder. Maps a raw observation $o$ (image or proprioceptive state) to a compact latent representation $h$. No decoder — the model never reconstructs the observation.
$d_\theta(h, a)$ — the latent dynamics model. Predicts the next latent state $h'$ given the current latent and action. This is where the "world model" lives — but it operates entirely in learned representation space, not pixel space.
$R_\theta(h, a)$ — the reward predictor. Estimates the immediate reward for being in latent state $h$ and taking action $a$. Trained with a standard regression loss against observed rewards.
$v_\theta(h)$ — the value predictor. Estimates the expected discounted return from latent state $h$. This is the terminal value used at the end of planning rollouts — it bootstraps value beyond the planning horizon.
$\pi_\theta(h)$ — the policy prior. A learned policy that provides the initial action distribution for MPPI sampling. Without it, MPPI would sample random actions, which in high-dimensional continuous spaces is catastrophically inefficient. The policy prior focuses the search around good actions; MPPI refines from there.
The training loss combines three objectives: (1) latent dynamics consistency via a joint-embedding loss (the predicted next latent must match the encoded next observation), (2) reward prediction, and (3) temporal-difference value learning. No reconstruction. No KL. No decoder.
MPPI planning at test time
At each real timestep, TD-MPC runs a planning loop entirely inside the learned model:
Encode the current observation: $h_0 = f_\theta(o_t)$.
Sample $N$ action sequences of length $H$ from the policy prior $\pi_\theta$, with added Gaussian noise.
For each candidate sequence, roll forward through $d_\theta$ to get $h_1, h_2, \ldots, h_H$.
Score each trajectory: sum of predicted rewards $\sum_{k=0}^{H-1} \gamma^k R_\theta(h_k, a_k)$ plus the terminal value $\gamma^H v_\theta(h_H)$.
Weight trajectories by exponentiated returns and compute the weighted mean action.
Execute only the first action. Replan from scratch at the next step.
Derivation: the MPPI update
Why importance-weighted averaging? MPPI is a sampling-based approximation to the optimal control law. Instead of solving a Bellman equation, it approximates the optimal action as a soft maximum over sampled trajectories, weighted by their returns. The temperature $\lambda$ controls how greedy the weighting is: $\lambda \to 0$ converges to the max-return trajectory; large $\lambda$ averages across all candidates.
In plain English: sample N candidate plans, simulate them all in the learned model, score each plan by its predicted total reward, then take a weighted average where the best plans get exponentially more vote. Execute only the first action of the winning plan, then replan from scratch. This is "planning by committee" where the committee members that scored highest dominate the vote.
Given $N$ sampled action sequences $\{a^{(i)}_{0:H-1}\}_{i=1}^N$, each producing a return estimate $S^{(i)}$:
$S^{(i)} = \sum_{k=0}^{H-1} \gamma^k R_\theta(h^{(i)}_k, a^{(i)}_k) + \gamma^H v_\theta(h^{(i)}_H)$ — the return estimate for trajectory $i$. Accumulated predicted rewards plus bootstrapped terminal value.
$w^{(i)}$ — the importance weight. A softmax over returns with temperature $\lambda$. High-return trajectories get exponentially more weight.
$\lambda$ — the temperature. Controls the sharpness of the weighting. $\lambda = 0.5$ is typical. Lower values make MPPI more greedy (closer to "pick the best trajectory"); higher values average more broadly.
$a^*_0$ — the executed action. The importance-weighted mean of the first actions across all $N$ trajectories. Only this single action is executed; the rest of the planned sequence is discarded and replanning happens at $t+1$.
In code:weights = F.softmax(returns / lam, dim=0), then action = (weights.unsqueeze(-1) * first_actions).sum(0). With N=512 candidates and H=5 planning horizon, the full MPPI loop (sample, rollout, weight, average) runs in ~8ms on GPU — fast enough for 50Hz control. The policy prior seeds the samples so most candidates are reasonable; without it, random sampling in 7D action space wastes 99% of candidates on garbage trajectories.
Worked example: MPPI with 5 trajectories. Suppose we have planning horizon $H = 4$, discount $\gamma = 0.99$, temperature $\lambda = 0.5$, and 5 sampled trajectories with return estimates:
Trajectory 3 ($S = 4.1$) gets 53% of the weight. Trajectory 2 ($S = 1.8$) gets 0.5%. The exponential weighting is aggressive: mediocre trajectories are effectively ignored.
Step 3. Suppose the first actions were $a^{(1)}_0 = 0.3,\; a^{(2)}_0 = -0.1,\; a^{(3)}_0 = 0.7,\; a^{(4)}_0 = 0.4,\; a^{(5)}_0 = 0.6$ (scalar for simplicity).
$a^*_0 = 0.088 \times 0.3 + 0.005 \times (-0.1) + 0.530 \times 0.7 + 0.022 \times 0.4 + 0.355 \times 0.6 = 0.026 - 0.001 + 0.371 + 0.009 + 0.213 = 0.618$.
The executed action is pulled toward the best trajectories. With more samples ($N = 512$ is typical in TD-MPC2), the estimate concentrates tightly around the optimum.
TD-MPC2: one model, 80+ tasks
TD-MPC2 (Hansen et al., 2024) is the multi-task scaling version. A single 317M-parameter model trained across 80+ continuous control tasks from DeepMind Control, Meta-World, and MyoSuite. The first world model that is truly multi-task — one set of weights, no task-specific heads, no finetuning.
Key changes from TD-MPC to TD-MPC2:
Task embeddings. A learned task embedding vector $e_\tau$ is concatenated to the latent state everywhere. The dynamics, reward, value, and policy networks all condition on $e_\tau$.
Larger model. Encoder becomes a 5-layer MLP with layer norm. Dynamics model uses a 2-layer MLP with residual connections. 317M parameters total — about 100x larger than TD-MPC.
Normalized latent space. SimNorm (simplex normalization) on the latent state prevents collapse and keeps the representation stable across tasks with wildly different observation scales.
No reconstruction loss ever. The model is trained purely on reward prediction, value prediction, and latent consistency. This is the key architectural commitment: the latent space is shaped entirely by what matters for control.
FOWM: finetuning offline world models
FOWM (Yu et al., 2023) asks a natural question: can you pretrain a world model on diverse offline data and then finetune it for a specific task? The answer is yes, and the key insight is that world model pretraining transfers better than policy pretraining. A world model trained on 10 different manipulation tasks captures general physics — object permanence, gravity, contact dynamics — that is reusable. A policy trained on 10 tasks captures 10 specific behavior patterns that may not compose.
The recipe: pretrain the latent dynamics model and reward predictor on a broad offline dataset (mixed quality, mixed tasks), freeze the dynamics, then finetune only the policy and value heads on target-task data. The pretrained dynamics acts as a learned physics simulator.
Diffusion-based world models
The most recent wave treats world modeling as a video generation problem. Instead of learning compact latent dynamics, these models generate full future frames — and planning happens by conditioning the generation on desired outcomes.
UniSim
Yang et al., 2024. A video diffusion model trained to simulate how the world changes in response to actions. Given a frame and an action description (language or discrete control), UniSim generates the next frames. It is a universal simulator in the sense that the same model handles navigation, manipulation, and human-object interaction. The cost: inference is orders of magnitude slower than latent-space models. The benefit: the "world model" generalizes to novel scenes because it inherits the generalization of large-scale video pretraining.
Genie / Genie 2
Bruce et al., 2024. Learned world models from internet video. Genie learns a latent action space from unlabeled video — it discovers that certain latent codes correspond to "move left" or "jump" without ever seeing labels. Genie 2 scales this to photorealistic 3D environments with consistent geometry, generating long-horizon interactive worlds from a single image prompt. The relevance to robotics: if you can build an action-controllable world model from internet-scale data, you have a free simulator for any environment that appears in video.
Genesis
Xian et al., 2024. A generative, open-source physics engine with differentiable simulation. Unlike the learned models above, Genesis combines classical physics solvers (rigid body, soft body, fluid, cloth) with a differentiable rendering pipeline. The world model is not learned — it is engineered — but it is differentiable end-to-end, so you can backpropagate through the simulation to optimize policies or system parameters. It sits at the intersection of classical simulation and learned world models.
World model comparison
Method
Approach
Planning
Multi-task
Sim-to-real
DreamerV3
RSSM latent + pixel reconstruction
Learned policy in imagination
Single-task (per model)
DayDreamer: A1 walking in 1hr
TD-MPC2
Latent dynamics, no reconstruction
MPPI in latent space
80+ tasks, single 317M model
Demonstrated on real WidowX
UniSim
Video diffusion as simulator
Conditioning on desired outcomes
Broad (inherits video pretraining)
Not yet; inference too slow
FOWM
Pretrained latent dynamics, finetune policy
Learned policy + MPC hybrid
Transfer via pretraining
Demonstrated on real xArm
The world model landscape is splitting into two camps: compact latent models (Dreamer, TD-MPC2) that are fast enough for real-time control and generative models (UniSim, Genie 2) that inherit the generalization of internet-scale pretraining but are too slow for closed-loop control. The bet for 2026–2027 is whether distillation can bridge the gap — training a fast latent model to match the predictions of a slow generative one.
19Offline RL
Reinforcement learning when you cannot interact.
Offline RL learns a policy from a fixed dataset of transitions $\{(s, a, r, s')\}$, with no further interaction. It is what you do when you have demonstrations and rewards but no robot to run on. The central failure mode is distributional shift in the value target: the Bellman backup queries Q at out-of-distribution actions, and the network's extrapolation there is unreliable.
The three responses
CQL — Conservative Q-Learning
Penalize Q-values for OOD actions. The intuition: standard Q-learning overestimates OOD actions because the network extrapolates optimistically. CQL adds a regularizer that pushes Q down everywhere except on actions seen in the data:
$\mathbb{E}_{s \sim \mathcal{D}}$ — expectation over states from the offline dataset. CQL only touches states that were actually visited.
$\log \sum_a \exp Q(s,a)$ — a soft-max over all actions (log-sum-exp). Minimizing this pushes Q-values down everywhere in action space. It is largest when the Q-function assigns high values to many actions — penalizing overestimation broadly.
$\mathbb{E}_{a \sim \pi_\beta}[Q(s,a)]$ — the expected Q-value under the behavior policy $\pi_\beta$ (i.e., the policy that collected the data). Subtracting this term pulls Q-values up for actions seen in the dataset. Net effect: Q is suppressed on out-of-distribution (OOD) actions and maintained on in-distribution ones.
$\pi_\beta$ — the behavior policy: whatever policy generated the offline dataset. In practice, estimated by the empirical action distribution in the data.
In code:penalty = Q(s, random_actions).logsumexp(dim=1).mean() - Q(s, dataset_actions).mean(). The random_actions are sampled uniformly or from the current policy to approximate the log-sum-exp integral. Add alpha * penalty to the standard Bellman loss. Typical $\alpha$: 1.0–10.0. If the policy is too timid (refuses to act), lower $\alpha$. If Q-values diverge, raise it.
The first term pushes Q down everywhere, the second pulls it up on the data distribution. Net effect: Q is suppressed on OOD actions. Works; sometimes over-conservative.
IQL — Implicit Q-Learning
Avoid evaluating Q at OOD actions entirely. Use expectile regression on the value function and fit the policy to weighted behavior cloning of dataset actions, weighted by their advantage.
$V(s)$ — the state value function. IQL's key idea: learn $V(s)$ using expectile regression instead of taking a max over actions. This avoids ever evaluating $Q$ on out-of-distribution actions.
$L_2^\tau(u)$ — the asymmetric (expectile) squared loss. When $u > 0$ (Q exceeds V), the weight is $\tau$; when $u < 0$ (Q below V), the weight is $1 - \tau$. With $\tau = 0.7$, positive residuals are weighted 2.3× more than negative ones — so $V(s)$ is pulled toward the upper quantiles of $Q(s, a)$, approximating the value of the best in-distribution actions.
$\tau$ — the expectile parameter. $\tau = 0.5$ gives the mean (standard MSE). $\tau = 0.7$ gives roughly the 70th percentile of Q-values. Higher $\tau$ extracts more "optimistic" policies from the data. Typical range: 0.7–0.9.
$\mathbb{1}(u < 0)$ — an indicator function that is 1 when $u < 0$ and 0 otherwise. Implements the asymmetric weighting.
$Q(s,a) - V(s)$ — the advantage. Positive advantage means action $a$ is better than the average action in state $s$.
$\exp(\beta(Q - V))$ — the advantage weight for policy extraction. Actions with high advantage get exponentially more weight. $\beta$ controls temperature: higher $\beta$ concentrates weight on the best actions. Typical $\beta = 3$–$10$.
$\log \pi(a \mid s)$ — log-probability under the learned policy. The loss is advantage-weighted behavior cloning: imitate the dataset, but imitate high-advantage actions more.
In code: The expectile loss is diff = q_val - v_pred; weight = torch.where(diff > 0, tau, 1 - tau); loss_v = (weight * diff.pow(2)).mean(). The policy extraction is advantage-weighted BC: adv = q_val - v_pred; weights = torch.exp(beta * adv); loss_pi = -(weights.detach() * log_pi).mean(). This is the modern default for offline RL — no log-sum-exp trick, no sampling random actions for the penalty, just asymmetric MSE and weighted imitation.
Expectile $\tau \approx 0.7$. The policy is extracted by advantage-weighted behavior cloning — never queries Q on OOD actions. Strong, simple, the modern default.
CQL: deriving the conservative penalty from first principles
The core problem: standard Q-learning computes $y = r + \gamma \max_{a'} Q(s', a')$. The $\max$ queries Q at whatever action maximizes it — which is almost certainly an action not in the dataset. The network's Q-value at that out-of-distribution action is pure extrapolation, and neural networks extrapolate badly. The result: Q is overestimated at OOD actions, the policy chases those phantom values, and the policy diverges.
CQL's fix: add a penalty that pushes Q down everywhere, then add a counter-term that pushes Q up on actions actually in the data. The net effect: Q is conservative (under-estimated) at OOD actions and accurate at in-distribution actions.
The first term, $\log \sum_a \exp Q(s,a)$, is the log-sum-exp over all actions. Minimizing it pushes Q-values down on average across all actions. Actions with the highest Q get pushed down the most (because $\exp(Q)$ is largest for them). The second term, $\mathbb{E}_{a \sim \pi_\beta}[Q(s,a)]$, is the expected Q under the dataset's behavior policy. Maximizing it (subtracting it from the loss) pulls Q up for actions that actually appear in the data. The balance: OOD actions are suppressed, in-distribution actions are preserved.
The hyperparameter $\alpha$ controls how conservative the resulting Q-function is. Too high: the policy becomes overly cautious and never exploits high-value opportunities. Too low: the conservative regularizer is too weak and Q still overestimates at OOD actions. Typical range: $\alpha \in [1, 10]$, tuned on a validation set.
IQL: expectile regression from scratch
Standard regression minimizes $\mathbb{E}[(y - f(x))^2]$ — this targets the mean of $y$. But for policy extraction, we want the maximum of $Q(s,a)$ over in-distribution actions, not the mean. Taking a literal max would require evaluating Q at actions not in the dataset — the exact thing we want to avoid.
Expectile regression targets a specific quantile using an asymmetric squared loss:
When the residual $u = Q(s,a) - V(s)$ is positive (Q exceeds V, meaning this action is better than V predicts), the weight is $\tau$. When $u$ is negative, the weight is $1 - \tau$. With $\tau = 0.9$, positive residuals are weighted 9 times more than negative ones. The function V converges not to the mean of Q but to a high quantile — approximately the 90th percentile. This is close to $\max_a Q(s,a)$ without ever evaluating Q at an OOD action.
Worked example: expectile regression with 5 Q-values. State $s$ has 5 actions in the dataset with Q-values: $Q = [2.0, 4.5, 3.0, 7.0, 5.5]$. Current $V(s) = 4.0$. Expectile $\tau = 0.9$.
Residuals: $u = Q - V = [-2.0, +0.5, -1.0, +3.0, +1.5]$.
Weights:
$u_1 = -2.0 < 0$: weight $= 1 - \tau = 0.1$. Loss $= 0.1 \times 4.0 = 0.4$.
$u_2 = +0.5 \geq 0$: weight $= \tau = 0.9$. Loss $= 0.9 \times 0.25 = 0.225$.
$u_3 = -1.0 < 0$: weight $= 0.1$. Loss $= 0.1 \times 1.0 = 0.1$.
$u_4 = +3.0 \geq 0$: weight $= 0.9$. Loss $= 0.9 \times 9.0 = 8.1$.
$u_5 = +1.5 \geq 0$: weight $= 0.9$. Loss $= 0.9 \times 2.25 = 2.025$.
Total loss = 10.85. The gradient is dominated by $u_4 = +3.0$ (the best action), which contributes 75% of the loss. The gradient pushes $V(s)$ strongly upward toward $Q = 7.0$. At convergence, $V(s) \approx 6.2$ — close to the max (7.0), well above the mean (4.4). This is the IQL trick: $V$ approximates the value of the best available action without ever querying Q at a new action.
When offline RL exceeds demonstration quality: trajectory stitching
This is the key advantage of value-based offline RL over behavior cloning. BC matches the average quality of your demonstrations. If your demos are 60% optimal, BC produces a 60%-optimal policy. Offline RL with a value function can exceed the best single demo by stitching together the best parts of different trajectories.
Concretely: suppose you have two demonstrations for a navigation task. Demo A takes the optimal path through room 1, then gets lost in room 2 (total return: 5). Demo B fumbles through room 1 but navigates room 2 perfectly (total return: 6). The optimal policy would take A's path through room 1 and B's path through room 2 (total return: 11). BC cannot do this — it copies either A or B. IQL can, because it learns per-state values: $V(s_{\text{room1}})$ reflects the best continuation from that state across all demos, and the advantage-weighted policy extraction picks the best action at each state independently.
The stitching requires coverage: the offline dataset must contain transitions near the stitching point. If demo A never visits a state close to where demo B enters room 2, the value function has no basis for connecting them. This is why dataset diversity matters more than dataset quality for offline RL — diverse mediocre data enables stitching, while a single expert trajectory does not.
AWAC, AWR — Advantage-Weighted regression
The general family: estimate advantage from data, do BC weighted by $\exp(\beta A)$. AWAC adds an explicit Q-function update; AWR doesn't. The simplest member of the family.
When offline RL helps over BC
Mixed-quality data. If your demonstrations include some failures or sub-optimal trajectories, BC trains on the average. Offline RL trains toward the best.
Reward-labeled play data. If you have task-agnostic interaction with reward labels, BC has nothing to imitate. Offline RL extracts a task-specific policy.
When offline RL doesn't help
If your dataset is uniformly expert demonstrations, BC matches offline RL and is simpler. If your dataset is small and narrow, offline RL is hard to tune and unreliable. The big breakthroughs in robot learning over the last three years were data, not offline RL.
Worked example: CQL penalty computation, step by step. You have a state $s$ and 5 actions in the offline dataset: $a \in \{0.1, 0.3, 0.5, 0.7, 0.9\}$. The current Q-network assigns Q-values: $Q(s, a) = [2.0, 3.5, 4.0, 3.0, 1.5]$.
Step 1: Log-sum-exp over all actions. This is the "push all Q down" term:
$$ \log \sum_a \exp Q(s,a) = \log(\exp(2.0) + \exp(3.5) + \exp(4.0) + \exp(3.0) + \exp(1.5)) $$
$= \log(7.39 + 33.12 + 54.60 + 20.09 + 4.48) = \log(119.67) = 4.78$.
Step 2: Dataset mean Q-value. This is the "pull dataset Q up" term. Under the empirical behavior policy (uniform over the 5 dataset actions):
$$ \mathbb{E}_{a \sim \pi_\beta}[Q(s,a)] = \frac{2.0 + 3.5 + 4.0 + 3.0 + 1.5}{5} = 2.8 $$
Step 3: CQL penalty.
$$ \mathcal{L}_{\text{CQL}}(s) = 4.78 - 2.8 = 1.98 $$
Interpretation: The penalty is positive (1.98), which means the loss will push Q-values down overall. The log-sum-exp is dominated by the highest Q-values ($Q = 4.0$ contributes $\exp(4.0) = 54.6$, which is 46% of the sum). This means the gradient pressure is strongest on the actions with the highest Q — exactly the OOD actions that standard Q-learning would overestimate.
What happens during training: The CQL penalty is added to the standard Bellman loss with weight $\alpha$. With $\alpha = 5$: the total penalty contribution is $5 \times 1.98 = 9.9$. This is large compared to typical Bellman errors (1–3), so the Q-network is strongly incentivized to keep Q-values low everywhere except on actions that actually appear in the data. The result: if the policy tries to take an action not in $\{0.1, 0.3, 0.5, 0.7, 0.9\}$, the Q-value there has been actively suppressed, and the policy avoids it.
The over-conservatism problem: If $\alpha$ is too large, the Q-values are pushed so far down that even good in-distribution actions look bad. The policy becomes paralyzed — it refuses to take any action confidently. This is why CalQL (Calibrated CQL) was introduced: it sets a floor on Q-values based on the Monte Carlo return of the dataset trajectories, preventing Q from being pushed below the actual observed return.
Offline RL methods: head-to-head comparison
Method
Data requirement
Compute
Stitching?
OOD safety
Best use case
BC
Expert demos only
Low (1 model, MSE loss)
No — copies demos
No constraint
Uniformly expert data, simple tasks
CQL
Mixed-quality OK
Medium (Q-net + penalty)
Yes
Strong (conservative Q)
Mixed-quality data where you want safe exploitation
IQL
Mixed-quality OK
Medium (V + Q + π)
Yes
Moderate (implicit)
General offline RL; modern default for simplicity
CalQL
Mixed-quality OK
Medium-high
Yes
Strong (calibrated)
When CQL is too conservative; needs MC return estimates
TD3+BC
Mixed-quality OK
Low (TD3 + BC term)
Limited
Weak
Quick baseline; few hyperparameters
Decision Transformer
Return-labeled trajectories
High (Transformer)
No — sequence model
No explicit constraint
When you want to condition on desired return
The stitching column is the key differentiator. BC and Decision Transformer are trajectory-level methods: they reproduce entire demonstrated trajectories. CQL and IQL are state-level methods: they learn per-state values and can compose the best parts of different trajectories into a novel plan that exceeds any single demonstration. If your dataset contains diverse sub-optimal trajectories whose good segments could be combined into a better-than-demonstrated policy, CQL or IQL will find it; BC will not.
The offline RL practitioner's decision tree
Given a fixed dataset and no further robot access, which method should you use? The decision depends on three properties of your dataset:
Is the data uniformly expert? If yes, use BC. It is simpler, faster to train, and will match or slightly beat offline RL on uniformly expert data. Offline RL's advantage is in extracting value from mixed-quality data, which uniformly expert data does not have.
Does the data have reward labels? If no (only demonstration trajectories without per-step rewards), you must either assign rewards retroactively (hindsight relabeling, success/failure labels) or use BC. IQL requires $(s, a, r, s')$ tuples; it cannot train on unlabeled demos.
Does the data cover diverse behaviors? If the dataset contains multiple strategies for the same task (some fast and risky, some slow and careful), IQL can stitch together the best parts of each. If all demonstrations follow the same strategy, stitching has nothing to combine, and the advantage over BC vanishes.
The most common mistake: applying offline RL to a small, narrow, expert-only dataset and expecting it to outperform BC. It will not. Offline RL is a tool for extracting maximal value from large, diverse, mixed-quality datasets. On small expert datasets, BC is the right tool, and the additional complexity of CQL/IQL buys you nothing but tuning headaches.
Worked example: IQL expectile regression. Suppose we have three data transitions from the same state $s$, with actions $a_1, a_2, a_3$ and Q-values $Q(s, a_1) = 2.0$, $Q(s, a_2) = 5.0$, $Q(s, a_3) = 8.0$. The current value estimate is $V(s) = 4.5$. The expectile loss with $\tau = 0.7$:
For $a_1$: $u = Q(s,a_1) - V(s) = 2.0 - 4.5 = -2.5$. Since $u < 0$: weight = $|\tau - 1| = 0.3$. Loss = $0.3 \times (-2.5)^2 = 1.875$.
For $a_2$: $u = 5.0 - 4.5 = 0.5$. Since $u \geq 0$: weight = $\tau = 0.7$. Loss = $0.7 \times 0.5^2 = 0.175$.
For $a_3$: $u = 8.0 - 4.5 = 3.5$. Since $u \geq 0$: weight = $0.7$. Loss = $0.7 \times 3.5^2 = 8.575$.
Total = 10.625. The gradient pushes $V(s)$ upward (toward the high-Q actions), because the $\tau = 0.7$ weighting penalizes under-estimation 2.3× more than over-estimation. At convergence, $V(s)$ approximates the 70th percentile of the Q-distribution — biased toward the good actions.
Reward design for offline RL in robotics
Offline RL requires reward labels, but most robot demonstration datasets do not have per-step rewards. Three practical approaches to retroactive labeling:
Binary success/failure. The simplest: $r = 1$ at the final step if the task was completed, $r = 0$ otherwise. All intermediate steps get $r = 0$. This creates an extremely sparse reward that offline RL can handle (IQL is designed for exactly this regime). The downside: no gradient signal for partial progress, so the value function does not distinguish "almost succeeded" from "never tried."
Distance-to-goal shaping. Compute the distance between the end-effector (or manipulated object) and the goal position at each timestep. $r_t = -\|p_t - p_{\text{goal}}\|$. This gives dense signal but requires knowing the goal position, which may not be available in all datasets. For language-conditioned tasks, "goal position" must be inferred from the instruction — an additional complexity.
Learned reward model. Train a classifier on (observation, language instruction) pairs to predict success probability. Use the classifier's logit as a dense reward: $r_t = \sigma^{-1}(P(\text{success} \mid o_t, \ell))$. This scales to diverse tasks but introduces reward model error, which offline RL can amplify if the Q-function exploits errors in the reward model.
The empirical finding: for robot manipulation with offline RL, binary sparse reward + IQL with high expectile ($\tau = 0.9$) is the simplest recipe that works reliably. Dense shaped rewards help convergence speed but require task-specific engineering that rarely generalizes across tasks.
Offline RL hyperparameter sensitivity
Offline RL methods are notoriously sensitive to hyperparameters, and the right settings depend on the dataset. The key hyperparameters and their interaction with dataset properties:
Hyperparameter
Method
Typical range
Sensitive to
$\alpha$ (CQL penalty weight)
CQL
1.0–10.0
Dataset coverage. Narrow data → higher $\alpha$. Broad data → lower $\alpha$.
$\tau$ (expectile)
IQL
0.7–0.9
Data quality. Expert data → $\tau = 0.7$. Mixed data → $\tau = 0.9$.
$\beta$ (AWR temperature)
IQL policy
3.0–10.0
Action space dimensionality. High-dim → lower $\beta$.
Discount $\gamma$
All
0.99–0.999
Horizon length. Long tasks → $\gamma$ closer to 1.
Q-ensemble size
CQL, SAC
2–10
OOD severity. More diverse data → fewer critics needed.
The practical recipe: start with IQL ($\tau = 0.7$, $\beta = 3.0$, $\gamma = 0.99$) and evaluate. If the policy is too conservative (refuses to attempt the task), increase $\tau$ toward 0.9. If the policy is too aggressive (attempts impossible actions), decrease $\tau$ toward 0.5 or switch to CQL. Tune on a validation set of held-out trajectories, not on real-robot deployment — offline RL hyperparameter sweeps on real hardware are prohibitively expensive.
19·5RL as sequence modeling
What if reinforcement learning is just next-token prediction on trajectories?
Every RL method we have seen so far — PPO, SAC, CQL, IQL — learns a value function or a policy by solving the Bellman equation in some form. In 2021, three papers asked the same heretical question at nearly the same time: what if we skip the Bellman equation entirely and just train a transformer on trajectory data? The answer turned out to be surprisingly effective, and it created a new paradigm that connects offline RL directly to the language-modeling infrastructure.
Decision Transformer
Decision Transformer (Chen et al., 2021) reframes offline RL as conditional sequence generation. The core idea: a trajectory is a sequence of (return-to-go, state, action) tokens. Train a causal transformer to predict the next action, conditioned on the desired return-to-go. At inference time, set the return-to-go to a high value — and the model generates actions consistent with achieving that return. arXiv:2106.01345
In plain English: feed the desired return as a token — "I want total reward = 10" — and the model outputs actions that achieve it. The transformer has seen thousands of trajectories during training, some good (reward 10) and some bad (reward 2). By conditioning on the desired return, you tell the model "generate actions from the part of the distribution where things went well." It is a quality dial for behavior generation.
$\hat{R}_t = \sum_{t'=t}^{T} r_{t'}$ — the return-to-go at timestep $t$. This is the sum of all future rewards from $t$ onward. It is the "steering signal" — it tells the model how much total reward we want from this point forward. High $\hat{R}_t$ → the model generates actions characteristic of high-performing trajectories.
$s_t$ — the state at timestep $t$. In robotics, this is the observation (joint angles, images, proprioception). Each modality is embedded by a separate encoder and projected to the transformer's hidden dimension.
$a_t$ — the action at timestep $t$. This is the prediction target. The transformer outputs a distribution over actions, and we train with cross-entropy (discrete) or MSE (continuous).
$T$ — the episode length. The return-to-go decreases across the trajectory as rewards are collected. At $t = T$, $\hat{R}_T = r_T$.
Each token type (return, state, action) gets its own learned embedding layer. Timestep information is added via a learned positional embedding shared across the three token types at each step. The transformer is causal — it only attends to tokens at or before the current position, exactly like a language model.
The training objective
DT is trained with supervised learning on offline trajectory data. For continuous action spaces (the standard in robotics):
$f_\theta(\hat{R}_t, s_t, \tau_{<t})$ — the transformer's predicted action given the return-to-go, current state, and all prior context. The model outputs a deterministic action vector; the loss is MSE against the ground-truth action from the dataset.
$\tau_{<t}$ — all prior tokens in the trajectory: $(\hat{R}_1, s_1, a_1, \ldots, \hat{R}_{t-1}, s_{t-1}, a_{t-1})$. The context window is typically the last $K$ timesteps (e.g., $K = 20$), not the full trajectory.
$\tau \sim \mathcal{D}$ — a trajectory sampled from the offline dataset. DT trains on the same data as offline RL methods like CQL or IQL — the difference is that DT does not learn a Q-function or solve any Bellman equation.
In code:pred_actions = dt(returns_to_go, states, actions, timesteps); loss = F.mse_loss(pred_actions, gt_actions). At inference, set returns_to_go[0] = max_return and the model generates expert-quality actions. No value function, no policy gradient, no Bellman equation — just supervised learning on trajectory data with a conditioning signal. Training uses the same infrastructure as GPT fine-tuning.
For discrete action spaces (Atari), replace MSE with cross-entropy over action bins. The architecture is identical — only the output head and loss change.
Inference: steering with return-to-go
Worked example: DT inference on a pick-and-place task. The offline dataset contains trajectories with returns ranging from 0 (failure) to 10 (success). We want a successful policy.
Step 1. Set the initial return-to-go to the maximum: $\hat{R}_1 = 10$.
Step 2. Observe the current state $s_1 = [\text{joint angles, gripper open, block at (0.3, 0.2)}]$.
Step 3. Feed $(\hat{R}_1, s_1)$ into the transformer. It predicts $a_1 = [0.02, 0.01, -0.03, \ldots]$ (move toward the block).
Step 4. Execute $a_1$. Receive reward $r_1 = 0$ (no success yet). Update return-to-go: $\hat{R}_2 = \hat{R}_1 - r_1 = 10$.
Step 5. Feed $(\hat{R}_1, s_1, a_1, \hat{R}_2, s_2)$ into the transformer. It predicts $a_2$.
Step 6. Continue until the task completes or the episode times out.
The key insight: by setting $\hat{R}_1 = 10$, we tell the model "generate a trajectory that achieves total reward 10." The model has seen such trajectories in training and generates actions consistent with them. Setting $\hat{R}_1 = 5$ would generate a mediocre trajectory. This is return-conditioned policy extraction — no value function, no policy gradient, just conditional generation.
Trajectory Transformer
Trajectory Transformer (Janner et al., 2021) takes the idea further. Instead of predicting only the next action, it discretizes the entire trajectory — states, actions, and rewards — into tokens and predicts everything autoregressively. At inference, it uses beam search to plan: sample multiple candidate trajectory continuations, evaluate them by their predicted returns, and execute the best one. arXiv:2106.02039
The discretization: each dimension of state, action, and reward is binned into $V$ discrete values (typically $V = 100$). A single timestep $(s_t, a_t, r_t)$ with $d_s$-dimensional state and $d_a$-dimensional action becomes $d_s + d_a + 1$ tokens. The vocabulary is $V \cdot (d_s + d_a + 1)$ tokens with unique IDs per dimension.
Gato: the generalist agent
Gato (Reed et al., 2022) is the logical endpoint: tokenize everything — Atari frames, robot proprioception, text, images — into a single sequence, and train one transformer on all of it. 1.2B parameters. 604 tasks across Atari, robotics, image captioning, and dialogue. arXiv:2205.06175
The tokenization scheme: continuous values (actions, proprioception) are mu-law encoded and discretized into 1024 bins. Images are encoded as 16×16 patches via a ResNet, then tokenized into 1024-dimensional embeddings. Text is SentencePiece with a 32K vocabulary. A single context window might contain: [image tokens, proprioception tokens, action tokens, text tokens] — and the model predicts the next token regardless of modality.
Gato proved the architecture is universal. It did not prove that the architecture is competitive — on any single task, a specialist beats Gato. The contribution is the existence proof: a single set of weights can play Atari, control a robot arm, and hold a conversation. The question for 2026's VLAs is whether scale resolves the specialist gap.
Why this paradigm matters
Infrastructure reuse. The entire language-modeling stack — KV-cache, FlashAttention, tensor parallelism, speculative decoding — transfers directly. No custom RL training loop.
Scaling laws. If trajectory prediction obeys the same scaling laws as language, then throwing more compute at a bigger model should improve the policy. Early evidence is mixed but trending positive.
Pre-training. A trajectory transformer can be pre-trained on diverse multi-task data, then fine-tuned on a specific task with minimal data — the same recipe that works for LLMs.
Unified interface. Language instructions, goals, and rewards can all be tokens in the same sequence. No separate conditioning mechanism needed.
The stitching problem
DT's main limitation is trajectory stitching. Classical offline RL (CQL, IQL) can combine the beginning of one trajectory with the end of another to find a better policy than any single trajectory in the dataset. DT cannot — it generates trajectories that look like trajectories in the dataset. If the dataset contains no single trajectory that achieves the maximum reward, DT will not discover the optimal policy by composing partial trajectories.
Concretely: suppose the dataset has two trajectories for a navigation task. Trajectory A reaches a waypoint efficiently but then fails. Trajectory B starts poorly but finishes well. IQL can stitch the efficient start of A with the successful finish of B. DT will reproduce either A or B in full, because it was trained to imitate whole sequences. This is the fundamental trade-off: DT gives up compositional generalization in exchange for stable, simple training.
Worked example: DT inference step by step. Task: pick-and-place. Dataset returns range from 0 (failure) to 10 (perfect). We want the best possible behavior.
Step 0. Set desired return-to-go: $\hat{R}_1 = 10$ (we want a perfect trajectory).
Step 1. Observe $s_1 = [\text{joint angles, gripper open, block at (0.3, 0.2)}]$. Feed $(\hat{R}_1 = 10, s_1)$ into the transformer. Output: $a_1 = [0.02, 0.01, -0.03, 0.0, 0.0, 0.0, 0.0]$ (move toward block).
Step 2. Execute $a_1$. Receive reward $r_1 = 0$ (no success yet). Update: $\hat{R}_2 = \hat{R}_1 - r_1 = 10 - 0 = 10$.
Step 3. Observe $s_2$. Feed $(\hat{R}_1, s_1, a_1, \hat{R}_2 = 10, s_2)$. Output: $a_2$ (continue approaching).
Steps 4–15. The robot approaches, grasps, lifts. Each step: execute, observe reward, update return-to-go. At step 12, the robot places the block: $r_{12} = 8.0$. Now $\hat{R}_{13} = 10 - 0 - 0 - \ldots - 8.0 = 2.0$.
Step 16. The remaining return-to-go is 2.0. The model generates "finishing" actions (release gripper, retract arm) consistent with earning 2.0 more reward.
The key mechanism: the return-to-go acts as a "quality dial." Setting $\hat{R}_1 = 10$ tells the transformer "generate actions from the part of the training distribution where total reward was 10." Setting $\hat{R}_1 = 3$ would generate a mediocre trajectory. The model never sees an explicit reward signal during inference — it just conditions on the desired outcome.
Worked example: why DT cannot stitch. Two demonstrations for a block-stacking task:
Demo A: States $[s_1, s_2, s_3, s_4]$, actions $[a_1^A, a_2^A, a_3^A, a_4^A]$, return = 5. Good approach ($s_1 \to s_2$), sloppy grasp ($s_2 \to s_3$), failed placement ($s_3 \to s_4$).
Demo B: States $[s_1, s_5, s_6, s_7]$, actions $[a_1^B, a_5^B, a_6^B, a_7^B]$, return = 8. Slow approach ($s_1 \to s_5$), solid grasp ($s_5 \to s_6$), clean placement ($s_6 \to s_7$).
Optimal stitched policy: $s_1 \xrightarrow{a_1^A} s_2$ (good approach from A) $\to$ somehow transition to $s_6$ (solid grasp from B) $\to s_7$ (clean placement from B). Return could be 13.
Why DT fails: If we condition on $\hat{R}_1 = 13$, the transformer has never seen a trajectory with return 13 starting from $s_1$. It saw return=5 from $s_1 \to s_2 \to \ldots$ and return=8 from $s_1 \to s_5 \to \ldots$. It will either generate A-like or B-like actions, not a hybrid. The context window processes sequences as wholes; it has no mechanism for value-based per-state composition.
Why IQL succeeds: IQL learns $V(s_2) \approx \text{best continuation from } s_2$. If the dataset includes any transition near $(s_2, \cdot, s_6)$ or if value generalization connects them, IQL extracts $\pi(s_2) = a^*$ that leads toward $s_6$. The stitching happens through the value function, not through sequence matching.
Decision Transformer did not beat IQL on benchmarks. That was never the point. The point was proving that a policy can be a language model — and that the infrastructure, scaling laws, and pretraining recipes of LLMs transfer to RL. Every VLA that tokenizes actions is a descendant of this insight.
20Hybrid: BC + RL
The polishing step that makes specialists out of generalists.
BC gives you a policy that does roughly the right thing. RL gives you a policy that does the right thing reliably.
RL fine-tuning of BC
Train a BC model, initialize an RL run with its weights. Two tricks:
KL constraint against the BC prior: add a $\mathrm{KL}(\pi_\theta \| \pi_{\text{BC}})$ regularizer.
Entropy clipping: bound the policy's stochasticity below the BC's.
Residual RL
Freeze a BC base policy $\pi_{\text{BC}}$. Train a small RL "correction" policy $\pi_\Delta$ that outputs an action delta. The deployed action is $a = \pi_{\text{BC}}(o) + \pi_\Delta(o)$. The RL problem is much easier — the BC prior already does most of the task, and the correction lives in a small action-magnitude box.
The residual formulation in detail
In plain English: the BC policy does the heavy lifting — reaching, approaching, orienting. The RL agent adds a tiny correction on top, like a surgeon fine-tuning a robot arm that a nurse has already roughly positioned. The correction is bounded so the RL can never override the base behavior entirely.
The RL agent sees the observation $o_t$ and outputs the correction $\pi_\Delta(o_t)$ as its action. The environment receives the full action $a_t = \pi_{\text{BC}}(o_t) + \pi_\Delta(o_t)$, but the RL reward function evaluates the composite: $r'(s_t, \pi_{\text{BC}}(o_t) + \pi_\Delta(o_t))$. The RL agent only controls the delta — it cannot override the base policy, only nudge it. The bound $\|\pi_\Delta\| \leq \delta$ (typically $\delta \approx 0.01$ in end-effector space, or 10% of the base policy's action magnitude) prevents the correction from dominating.
Why this is easier than training RL from scratch: the BC policy already solves the gross motion problem. Reaching toward the right object, moving to the right area, orienting the gripper roughly correctly — all of this is handled. The RL correction only needs to learn the fine motion: the last 3mm of insertion, the force modulation during contact, the precise timing of the gripper close. This is a much smaller action space with much denser reward signal, so RL converges in orders of magnitude fewer steps.
The initialization matters: $\pi_\Delta$ is initialized with near-zero weights (e.g., the final linear layer scaled by 0.01). At the start of training, the composite policy is essentially pure BC. As RL training progresses, the correction grows from zero, and the base policy's behavior is smoothly refined rather than disrupted.
Worked example: residual RL for PCB insertion. The BC base policy positions the peg above the hole with ~3mm accuracy. The RL residual learns the final 3mm of insertion, including contact-force modulation.
BC action: $\pi_{\text{BC}}(o) = [0.002, -0.001, -0.015, 0.0, 0.0, 0.0, 0.85]$ (slow descent, gripper mostly closed).
RL correction: $\pi_\Delta(o) = [0.0005, 0.001, -0.003, 0.002, -0.001, 0.0, 0.0]$ (small lateral + rotational adjustment).
Deployed action: $a = [0.0025, 0.0, -0.018, 0.002, -0.001, 0.0, 0.85]$.
The correction is bounded: $\|\pi_\Delta\| \leq \delta$ where $\delta = 0.01$ in EE space. This prevents the RL from overriding the base policy and causing unsafe behavior. The result: the BC base gets the robot to within 3mm, and the residual closes the last 3mm with learned compliance — something the BC alone couldn't learn from demonstrations that didn't have consistent force feedback.
HIL-SERL: the 5-component recipe
Luo et al., 2024. The current state of the art for sample-efficient real-world RL on manipulation. HIL-SERL (Human-in-the-Loop Sample-Efficient RL) is not a single algorithm — it is a carefully assembled pipeline of five components, each essential:
Pre-trained vision encoder (frozen). ResNet-10 or R3M, pretrained on diverse manipulation data. The encoder converts 480x640 RGB images into a 512-dim feature vector. Frozen during RL training — this reduces the RL problem from "learn perception + control" to "learn control on a fixed representation."
Offline pretraining from demos. Collect 20–50 teleoperated demonstrations. Pretrain the Q-network and policy on this data using an offline RL objective (conservative Q-learning or simple BC + Q-regression). This gives the policy a reasonable starting behavior — it can attempt the task, even if imperfectly.
Online RL with SAC. Deploy the pretrained policy on the real robot and fine-tune with SAC. The Q-ensemble (2–10 critics) with high update-to-data ratio (UTD = 20) squeezes maximum learning from every real-world transition.
Human-in-the-loop interventions. A human operator watches the robot via camera feed. When the policy is about to fail (e.g., the gripper is about to drop the object, or the arm is heading toward a collision), the human presses a button and takes over via teleop. The human guides the robot through the difficult part, then releases control back to the policy.
Intervention data goes into the replay buffer. Both the autonomous data (successes and near-failures) and the human intervention data (corrections near failure states) are stored in the SAC replay buffer. The intervention data is crucial: it provides exactly the transitions the policy needs most — how to recover from states near failure. Without interventions, the policy would have to discover recovery behaviors through random exploration, which is dangerous and slow.
The key insight: human interventions are not just a safety mechanism. They are a data collection strategy. The interventions target exactly the states where the policy is weakest, providing "negative examples" near failure boundaries. This is the data RL needs most — transitions at the edge of success and failure, where the Q-function's gradient is steepest.
The result: 100% success on contact-rich tasks (PCB insertion, Jenga manipulation, connector insertion) in under two hours of real-world training. This is the only RL recipe in 2026 that is competitive with BC + lots of data on real robots.
When to use which hybrid approach
Scenario
Recommended
Why
BC is 75%+ and you have a sim
RL fine-tuning in sim
Cheap, safe, unlimited data. Add KL constraint against BC prior.
BC is 75%+ and you need real-world polish
Residual RL
Bounded corrections, safe deployment, fast convergence on fine motion.
BC is 50% and the task is contact-rich
HIL-SERL
Human interventions provide recovery data. High UTD compensates for small data.
No BC at all, only a simulator
PPO from scratch + sim-to-real
On-policy RL scales with parallel envs. DR + RMA for transfer.
VLA foundation model + task-specific deployment
DPPO / RL fine-tuning of generative policy
The VLA is the BC; RL refines it on the specific task and robot.
Limited real data, no sim, no robot access
Offline RL (IQL)
Extract the best policy from the fixed dataset without interaction.
The residual RL initialization trick
The most common failure mode of residual RL is "the RL correction grows too large too fast, overrides the BC base, and the combined policy collapses." The fix is embarrassingly simple: initialize the RL correction network's final layer to output near-zero actions.
Concretely: set the final linear layer's weights to $\mathcal{N}(0, 0.01)$ and biases to 0. At the start of RL training, $\pi_\Delta(o) \approx \mathbf{0}$, so the total action is:
Residual RL with near-zero initialization
$$ a_t = \pi_{\text{BC}}(o_t) + \underbrace{\pi_\Delta(o_t)}_{\approx\, \mathbf{0}\text{ at init}} \approx \pi_{\text{BC}}(o_t) $$
$\pi_{\text{BC}}(o_t)$ — the frozen base policy output. This provides the gross motion: reaching, approaching, orienting. It is never updated by RL gradients.
$\pi_\Delta(o_t)$ — the RL correction network. A small MLP (2–3 layers, 256 hidden units) that takes the same observation as the BC policy. Its output is clipped: $\|\pi_\Delta\| \leq \delta$.
$\delta$ — the correction bound. Typically 5–10% of the BC action magnitude. For a 7-DOF robot with EE-delta actions in the range $[-0.05, 0.05]$ m/step, $\delta \approx 0.005$ m. This prevents the RL from overriding the base policy entirely.
The near-zero initialization means the RL agent starts by executing the BC policy perfectly, and then gradually discovers which small corrections improve the reward. This is analogous to LoRA in LLM fine-tuning: start from the pretrained model and add small rank-1 corrections. The RL agent never "forgets" the BC behavior because it never had to learn it in the first place — the BC policy is frozen and always contributes its full output.
Worked example: HIL-SERL for USB insertion. The task: insert a USB-A connector into a port. The connector must be oriented correctly (no flip) and aligned to ±0.5mm. The initial BC policy, trained on 30 teleoperated demonstrations, achieves 62% success.
Failure analysis of the BC policy (38 failures out of 100 trials):
• 18 failures: connector aligned but slightly too high/low (misses the port opening by ~1mm).
• 12 failures: connector oriented correctly but rotated ~3° around the insertion axis (catches on the shield).
• 8 failures: approach trajectory too fast, overshoots the pre-insertion waypoint.
HIL-SERL setup. A human operator watches the robot through a side-mounted camera. They hold a 6-DOF SpaceMouse. When the robot is about to fail, the human takes over and guides the connector into the port. The takeover typically lasts 2–4 seconds (the fine alignment phase). Both autonomous and intervention trajectories go into the SAC replay buffer.
Minute 0–5: The policy is mostly BC. Success rate: ~62%. The human intervenes on ~40% of trials. Each intervention generates 20–40 transitions at exactly the states where the policy struggles most (near the port opening, at contact). Replay buffer: ~500 transitions.
Minute 5–10: SAC with UTD=20 has already done 10,000 gradient updates on the replay buffer. The Q-ensemble (5 critics) is learning that fine lateral adjustments near the port yield high reward. Success rate climbs to ~75%. Human interventions drop to ~25% of trials.
Minute 10–15: The policy has learned the fine alignment motion from the intervention data. It now self-corrects when the connector catches on the shield (the most common failure mode). Success rate: ~87%. Human interventions: ~12% of trials.
Minute 15–20: The remaining failures are edge cases: unusual USB port orientations, connector wear. The human intervenes on the rare hard cases, providing the exact transitions needed. Final success rate: 94%. Total human interventions: ~50 over 20 minutes (~100 trials total, ~30 seconds per intervention).
The data budget: 20 minutes of real-robot time generated ~3,000 transitions (2,000 autonomous + 1,000 from interventions). SAC with UTD=20 performed ~60,000 gradient updates. The policy went from 62% to 94% — a 32-point improvement — in less time than it takes to collect 30 more teleoperation demonstrations.
The Physical Intelligence π0 → π0.5 → π0.7 recipe
Physical Intelligence's progression from π0 to π0.7 is the most complete public example of the BC → fine-tune → RL pipeline applied at scale. Each stage adds a specific capability:
Stage 1: π0 (foundation model). Pre-train a VLA with flow matching on a large-scale diverse dataset (cross-embodiment, multi-task). The flow matching objective generates continuous action trajectories rather than discrete tokens. The result is a generalist policy that can attempt hundreds of tasks on multiple robots, but does none of them reliably (∼40–60% success on most tasks).
Stage 2: π0.5 (task-specific fine-tuning). Fine-tune π0 on 50–200 demonstrations of the target task using LoRA (rank 16–32, applied to the attention layers of the VLA). The fine-tuning takes 2–4 hours on 8 GPUs. LoRA preserves the foundation model's general knowledge while adapting the action distribution to the specific task. Success rate improves to 65–80% on the target task, with some degradation on non-fine-tuned tasks (the catastrophic forgetting is mild because LoRA modifies only ~2% of the parameters).
Stage 3: π0.7 (RL polish with the RL Token). This is the novel contribution. Add a special RL Token to the VLA's input context — a binary token that, when set to 1, activates an exploration mode. During RL fine-tuning:
The RL Token is set to 1. The VLA adds an entropy bonus to its action distribution, encouraging exploration around the fine-tuned behavior.
Online RL (PPO variant) runs for 30–60 minutes per task on the real robot. The reward is sparse (task success/failure) plus shaped sub-rewards (distance to goal, contact events).
Only the LoRA adapters and a small RL head are updated; the foundation model backbone remains frozen.
At deployment, the RL Token is set to 0, and the policy produces deterministic actions with the RL-refined LoRA weights.
The result: π0.7 achieves 10–25% absolute improvement on dexterous tasks (laundry folding, object reorientation, connector insertion) over π0.5. The improvement is largest on tasks with contact-rich phases where the BC demonstrations were inherently inconsistent (humans demonstrate slightly different force profiles each time).
Why the RL Token works. The fundamental problem with RL fine-tuning of a foundation model is balancing exploration against catastrophic forgetting. Standard RL exploration (entropy bonus on the full action distribution) causes the VLA to produce random, incoherent actions that deviate wildly from the pretrained distribution. The RL Token provides a conditional exploration mechanism: when RL Token = 1, the model adds stochasticity only to the action expert's output, not to the VLM backbone's language and vision processing. The VLM still produces coherent scene understanding and task decomposition; only the low-level motor commands are perturbed. This is analogous to adding noise to the actor in SAC while keeping the critic deterministic — except here the "critic" is the VLM's scene understanding.
At deployment, RL Token = 0 and the action expert runs in its deterministic (mode-seeking) configuration. The RL-refined LoRA weights encode the improved policy; the token simply controls whether exploration noise is added during online learning.
RLHF for robots
Human preference labels over pairs of trajectories train a reward model; the reward model trains a policy with RL. The bottleneck is preference-label collection at scale; the technique is mature, the data isn't.
The hybrid landscape, visualized
Method
BC data
RL interaction
Best for
Pure BC
Yes (expert)
None
When data is plentiful and expert-quality
BC + RL fine-tune
Yes (initialize)
On-policy (sim or real)
Closing the last 10-20% gap
Residual RL
Frozen base
Small correction only
When base is good but needs polish
HIL-SERL
Small seed
Real-world + human safety
Contact-rich tasks, production quality
Offline RL
Mixed-quality
None (fixed dataset)
When no further interaction is possible
VLA + RL Token
Foundation model
Online RL at deploy
Continual improvement post-deployment
The pattern to notice: every successful hybrid method constrains the RL component. Residual RL bounds the correction magnitude. HIL-SERL adds a human safety net. KL-constrained fine-tuning penalizes deviation from the BC prior. The RL Token restricts exploration to the action expert while keeping the VLM backbone frozen. Unconstrained RL from scratch on a real robot is still impractical in 2026 — the search space is too large, the hardware too expensive, and the failure modes too dangerous. The hybrid recipe is: BC provides the prior, RL provides the polish, and the constraint prevents the polish from becoming sandpaper.
The open question for 2026–2027: how much RL budget does each task need? Contact-rich insertion tasks converge in 20 minutes of HIL-SERL. Open-world navigation may need hours. Deformable manipulation (cloth, rope) remains an open challenge for RL fine-tuning because the reward signal is ambiguous (what does "successfully folded" mean for a wrinkled towel?) and the physics are hard to simulate. The hybrid stack is mature; the reward engineering for complex tasks is not.
The meta-lesson of this section: RL is not competing with BC. RL is the second stage of a pipeline where BC is the first. The two techniques are complements, not substitutes. The field spent 2018–2023 debating "BC vs RL." The field in 2026 uses both, in sequence, constrained, on every task that matters. The only remaining debate is how much RL budget each task needs, not whether to use it at all.
If you have a working BC policy and you want it to be 95% rather than 75%, you do not need a new architecture. You need RL fine-tuning with a small budget of real-world interaction and either a human safety net or a calibrated simulator.
This is the closing argument of 2026's playbook.
20·5DPPO & RL fine-tuning of generative policies
The missing piece: how to apply policy gradients when your policy generates actions through iterative denoising.
Section 20 showed the general recipe for RL fine-tuning of BC policies. But diffusion and flow-matching policies pose a unique problem: you cannot easily compute $\log \pi_\theta(a \mid o)$. The action is the output of a multi-step denoising chain, not a single forward pass through a network with a tractable density. Without $\log \pi$, standard policy gradient methods (PPO, SAC) don't apply directly. This section covers the growing family of methods that solve this problem.
The core challenge
A Gaussian policy outputs $a \sim \mathcal{N}(\mu_\theta(o), \sigma^2)$; computing $\log \pi_\theta(a \mid o)$ is one line of code. A diffusion policy generates $a$ by iterating $K$ denoising steps from pure noise $a^{(K)} \sim \mathcal{N}(0, I)$ through $a^{(K-1)}, a^{(K-2)}, \ldots, a^{(0)}$. The final action $a = a^{(0)}$ is a deterministic function of the initial noise and the $K$ network evaluations. The marginal density $\pi_\theta(a \mid o) = \int p(a^{(K)}) \prod_{k} p_\theta(a^{(k-1)} \mid a^{(k)}, o)\, da^{(K:1)}$ is intractable — it requires marginalizing over all intermediate noise samples.
DPPO: Diffusion Policy Policy Optimization
DPPO (Ren et al., 2024) resolves this by reframing the denoising chain as a multi-step MDP. Each denoising step $k \to k-1$ is treated as an "action" in an inner MDP. The "state" at inner step $k$ is the current noisy action $a^{(k)}$ plus the observation $o$. The "action" at inner step $k$ is the denoiser's output that produces $a^{(k-1)}$. The reward is zero for all intermediate steps; the environment reward $r$ arrives only after the final denoised action $a^{(0)}$ is executed. arXiv:2409.00588
This reframing makes each individual denoising step a tractable Gaussian transition — and PPO can be applied to the chain.
The denoising chain as an inner MDP
Formally, DPPO defines:
Inner state $\tilde{s}_k = (a^{(k)}, o)$. The noisy action at denoising step $k$, plus the observation.
Inner action $\tilde{a}_k = \epsilon_\theta(a^{(k)}, k, o)$. The noise prediction at step $k$.
Inner reward $\tilde{r}_k = 0$ for $k > 0$; $\tilde{r}_0 = r(o, a^{(0)})$. Reward only at the end.
Because $\epsilon_\theta$ outputs a Gaussian (or is treated as a deterministic function with added Gaussian exploration noise), we can compute $\log \pi_\theta(\tilde{a}_k \mid \tilde{s}_k)$ at each denoising step. PPO's clipped surrogate objective applies to each step individually.
DPPO's modified PPO objective
In plain English: PPO but applied to each denoising step of the diffusion policy. The diffusion model takes 16 steps to go from pure noise to a clean action. DPPO treats each of those 16 steps as a separate decision point and applies PPO independently to each one. The environment reward only arrives at the end (after the clean action is executed), but GAE propagates the reward signal backward through all 16 steps so every denoising step gets a gradient.
DPPO objective over the denoising chain
$$ \mathcal{L}_{\text{DPPO}} = \sum_{k=0}^{K-1} \mathbb{E}_{\tilde{s}_k, \tilde{a}_k}\left[\min\!\Big(\tilde{r}_k(\theta)\,\hat{A}_k, \;\text{clip}\big(\tilde{r}_k(\theta), 1-\epsilon, 1+\epsilon\big)\,\hat{A}_k \Big)\right]$$
$k$ — the denoising step index, running from $K-1$ (noisiest) to $0$ (cleanest). Each step is treated as a separate "timestep" in the inner MDP.
$\tilde{r}_k(\theta) = \frac{\pi_\theta(\tilde{a}_k \mid \tilde{s}_k)}{\pi_{\theta_\text{old}}(\tilde{a}_k \mid \tilde{s}_k)}$ — the importance ratio at denoising step $k$. Same as standard PPO, but computed for the noise prediction at step $k$, not the final action.
$\hat{A}_k$ — the advantage at denoising step $k$. Because the only reward comes at $k = 0$, the advantage must be propagated backward through the chain. DPPO uses GAE computed over the inner MDP: $\hat{A}_k = \sum_{j=0}^{k} (\gamma \lambda)^j \delta_{k-j}$ where $\delta_k = \tilde{r}_k + \gamma V(\tilde{s}_{k-1}) - V(\tilde{s}_k)$.
$\epsilon$ — the PPO clip range, typically 0.2. Applied independently at each denoising step to prevent any single step from changing too much.
$V(\tilde{s}_k)$ — the inner value function. A learned critic that estimates the expected return from inner state $\tilde{s}_k$. Since all reward comes at $k = 0$, $V(\tilde{s}_k)$ estimates "how good is the partially-denoised action $a^{(k)}$?"
Computing advantages for intermediate denoising steps
The tricky part: only the final action receives reward. So how does the advantage propagate to step $k = K-1$ (the first denoising step from pure noise)?
DPPO treats the denoising chain as a $K$-step episode with a single terminal reward. The value function $V(\tilde{s}_k)$ learns to predict the expected environment reward from inner state $k$. The advantage at step $k$ is:
$\gamma_{\text{inner}}$ — the discount factor for the inner MDP. Typically set to 1.0 (no discounting within the chain), since the chain is short ($K = 10$–$16$ steps) and we want the reward signal to propagate fully.
$\lambda$ — the GAE parameter for the inner MDP. Controls bias-variance trade-off in advantage estimation. Typically 0.95, same as outer PPO.
$\tilde{r}_{k-j}$ — inner reward. Zero for all $k > 0$; equals the environment reward at $k = 0$.
In practice, the backward pass is cheap: the chain is only $K = 10$–$16$ steps, so GAE over the inner MDP is a simple loop.
DPPO advantage computation for the denoising chain
defcompute_dppo_advantages(
inner_values, # V(s_k) for k = K-1, K-2, ..., 0 shape: (B, K)
env_reward, # r(o, a^(0)) shape: (B,)
gamma=1.0, # inner discount (usually 1.0)
lam=0.95 # GAE lambda
):
B, K = inner_values.shape
advantages = torch.zeros_like(inner_values) # (B, K)
last_gae = torch.zeros(B)
# Walk backward through denoising chain: k = 0, 1, ..., K-1# k=0 is the final (clean) step that receives rewardfor k in range(K):
if k == 0:
# Terminal step: reward comes from environment
inner_reward = env_reward # (B,)
next_value = torch.zeros(B) # no step after final actionelse:
# Intermediate step: no reward
inner_reward = torch.zeros(B)
next_value = inner_values[:, k - 1] # V(s_{k-1})
delta = inner_reward + gamma * next_value - inner_values[:, k]
last_gae = delta + gamma * lam * last_gae
advantages[:, k] = last_gae
return advantages # (B, K) — one advantage per denoising step
REBEL: reward-conditioned diffusion
An alternative to modifying the RL objective: condition the diffusion model on the desired reward, analogous to classifier-free guidance but for RL returns. During training, add a reward embedding to the denoiser's conditioning input. At inference, set the reward conditioning to the maximum observed reward. The model generates high-reward actions without any policy gradient computation. This is the diffusion-policy analogue of Decision Transformer's return-to-go conditioning.
The advantage: no inner MDP, no modified PPO, no value function over denoising steps. The disadvantage: like Decision Transformer, it cannot stitch trajectories or extrapolate beyond the best behavior in the dataset.
CalQL + diffusion policies
Calibrated Q-Learning (Nakamoto et al., 2023) extends CQL with a calibration step that prevents excessive conservatism. When paired with a diffusion action head, the diffusion model serves as the policy $\pi$ in the actor-critic loop, and the Q-function provides gradients to update the denoiser. The key insight is that diffusion policies produce diverse action samples naturally — they are excellent proposal distributions for the log-sum-exp term in CQL's regularizer. arXiv:2303.05479
RLPD: RL with Prior Data
RLPD (Ball et al., 2023) is not specific to diffusion policies, but it is the default recipe for mixing offline demonstrations with online RL. The idea is simple: maintain a replay buffer that contains both online transitions (from the current policy interacting with the environment) and offline transitions (from the demonstration dataset). Sample mini-batches from both, with a fixed ratio (typically 50/50), and train SAC as usual. arXiv:2302.02948
RLPD works because SAC is off-policy — it can learn from any data regardless of which policy collected it. The demonstrations provide a warm start (the policy sees successful behavior immediately), and the online data provides coverage of the states the policy actually visits. The 50/50 ratio is surprisingly robust; most practitioners do not need to tune it.
The Physical Intelligence recipe: RL Token
The $\pi_0.5 \to \pi_0.7$ progression from Physical Intelligence reveals the production recipe for RL fine-tuning of generative VLA policies. The mechanism: add a special RL Token to the VLA's vocabulary. When the token is present in the input, the model enters "RL mode" — the action head is trained with online RL (environment interaction + reward signal) rather than BC. When the token is absent, the model behaves as a standard BC policy.
The RL Token mechanism enables fast online polishing without catastrophic forgetting of the BC prior. The BC data remains in the training mix (the RL Token is absent for those examples), so the model simultaneously learns from demonstrations and from its own experience. Think of it as residual RL (Section 20) but implemented at the token level inside the VLA rather than as a separate correction network.
When to use which
Method
Policy type
Data regime
Best for
DPPO
Diffusion / flow
Sim rollouts (on-policy)
Sim-trained policies that need RL polish
Residual RL
Any (frozen base + correction)
Real-world online
When base is good; need small correction
REBEL / reward-conditioned
Diffusion / flow
Offline + reward labels
No online interaction available; have rewards
CalQL + diffusion
Diffusion / flow
Offline (fixed dataset)
Large offline datasets with mixed quality
RLPD
Any (SAC-based)
Online + offline demos
Real-world with prior demos; sample-efficient
RL Token (PI recipe)
VLA with generative head
Online (post-deployment)
Foundation VLAs; continual improvement
Worked example: DPPO on a sim-trained Diffusion Policy. You have a Diffusion Policy trained via BC on 200 demonstrations for a peg-insertion task. Success rate: 72%. You want to push it to 95% using RL in simulation.
Setup. The policy uses $K = 16$ DDIM denoising steps, action dimension $D = 7$ (relative EE pose + gripper). The environment reward is sparse: $r = 1$ on successful insertion, $r = 0$ otherwise.
Inner MDP. Each denoising step $k = 15, 14, \ldots, 0$ is a "timestep." The inner state is $(a^{(k)}, o)$. The inner action is the noise prediction $\epsilon_\theta(a^{(k)}, k, o)$. You train a small inner value function $V_\psi(\tilde{s}_k)$ with 2 hidden layers.
Rollout. Collect 256 environment episodes. For each episode, record the full denoising chain: 16 inner states, 16 noise predictions, and the terminal reward. This gives $256 \times 16 = 4096$ inner transitions per batch.
PPO update. Compute GAE advantages over the inner MDP (the code above). Run 4 PPO epochs with clip $\epsilon = 0.2$. Update both the denoiser $\theta$ and the inner critic $\psi$.
Result. After 500 outer iterations (~128K episodes), success rate climbs from 72% to 94%. The denoiser has learned to slightly adjust its noise predictions at steps $k = 3$–$5$ (the final refinement steps) to produce actions that are more precisely aligned with the peg hole. The early denoising steps ($k = 15$–$8$) barely change — the coarse trajectory was already correct from BC.
The field spent 2023–2024 building diffusion and flow-matching policies. It is now spending 2025–2026 figuring out how to RL-fine-tune them. DPPO cracked the theoretical barrier; RLPD and the RL Token cracked the practical one. If your BC policy plateaus at 80%, the answer is not more data — it is a few hundred episodes of RL on top of the denoising chain.
21Loss compendium
Every loss in the modern stack, named, derived, and placed in its bestiary.
The default. Smooth gradients, well-conditioned. The minimizer is $\mathbb{E}[y \mid x]$ — which is exactly the failure mode for multimodal $y$. Use when the conditional distribution is unimodal or when you've already factored out multimodality with another mechanism (e.g., the noise input to a diffusion model).
The minimizer is the conditional median, which is more robust to label noise. Used in ACT and other policies where teleoperation produces small jittery labels. Slower-converging gradients near zero (the gradient is constant in magnitude), but the resulting model is less prone to over-smoothing fine motions.
L2 inside a band, L1 outside. Robust to outliers without sacrificing convergence near zero. Standard for Q-function regression in DQN and its descendants. The parameter $\delta$ controls the transition: $\delta = 1$ is the standard choice; smaller $\delta$ is more robust but slower to converge.
$w_k$ — the mixture weight for component $k$. Predicted by the network, satisfying $\sum_k w_k = 1$. Represents the probability that the data was generated by component $k$.
$\mu_k(x_i)$, $\Sigma_k(x_i)$ — the predicted mean and covariance of the $k$-th Gaussian, conditioned on input $x_i$. Covariance is often diagonal for tractability.
$\mathcal{N}(y_i; \mu_k, \Sigma_k)$ — the Gaussian density evaluated at target $y_i$. With $K = 1$, this reduces to a single Gaussian NLL (equivalent to MSE up to constants).
$y_{ic}$ — the one-hot ground-truth label: 1 if sample $i$ belongs to class $c$, 0 otherwise. For action bins, $c$ is the bin index (0–255).
$p_\theta(c \mid x_i)$ — the predicted probability of class $c$, typically from a softmax over logits.
The inner sum over $c$ collapses to a single term (the true class), so in practice: $\mathcal{L} = -\sum_i \log p_\theta(c_i^* \mid x_i)$. Minimize = maximize probability assigned to the correct bin.
$x_0$ — a clean sample from the data (action chunk or image). $x_k = \sqrt{\bar\alpha_k}\, x_0 + \sqrt{1-\bar\alpha_k}\,\epsilon$ is its noised version at level $k$.
$\epsilon \sim \mathcal{N}(0, I)$ — the true injected noise. The network's job is to predict exactly this noise vector.
$\epsilon_\theta(x_k, k)$ — the denoiser's noise estimate given the noisy input and the noise level $k$. Observation conditioning $o$ is omitted here for brevity.
$-\mathbb{E}_{q_\phi}[\log p_\theta(x \mid z)]$ — the reconstruction loss. How well the decoder recovers the input $x$ from the latent $z$. For Gaussian likelihood this becomes MSE; for Laplace likelihood, L1. Minimizing this = good reconstruction.
$q_\phi(z \mid x)$ — the encoder (variational posterior). Maps input $x$ to a distribution over latent codes $z$. Typically Gaussian: $q_\phi = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x) I)$.
$p(z)$ — the prior over $z$, usually $\mathcal{N}(0, I)$. The KL term keeps the encoder close to this simple distribution.
$\beta$ — the KL weight. $\beta = 1$ gives the standard ELBO. $\beta > 1$ (as in ACT's $\beta = 10$) trades reconstruction quality for a more regular latent space. Higher $\beta$ = more "disentangled" but potentially blurrier outputs.
$E_\theta(x, y)$ — the energy function. A scalar-valued network that takes an input pair $(x, y)$ (e.g., observation and action) and outputs a "compatibility score." Lower energy = better match.
$y_+$ — the positive sample: the correct action (or matching pair) from the dataset.
$\mathcal{Y}^-$ — the negative samples: a set of $K$ incorrect actions sampled from a proposal distribution. More negatives = tighter bound on mutual information. Typical $K$: 128–1024.
The fraction inside the $\log$ is a softmax over energies: the probability of correctly identifying the positive among all candidates. Minimizing the loss = making the positive's energy much lower than the negatives'.
Minimize energy on positive pairs and maximize it on a sampled set of negatives. Powers Implicit BC, CLIP, R3M, and many self-supervised vision objectives.
PyTorch
# InfoNCE loss (simplified, e.g. for Implicit BC)# obs: (B, obs_dim), action_pos: (B, act_dim), action_neg: (B, K, act_dim)e_pos = energy_net(obs, action_pos) # (B,)e_neg = energy_net(obs.unsqueeze(1).expand_as(action_neg),
action_neg) # (B, K)logits = torch.cat([e_pos.unsqueeze(1), e_neg], dim=1) # (B, 1+K)labels = torch.zeros(B, dtype=torch.long) # positive is index 0loss = F.cross_entropy(-logits, labels)
A new loss is rarely the answer. Almost every "novel objective" in robot learning over the last five years is a re-weighting, re-conditioning, or regularization of one of these. Memorize the shapes. The rest is engineering.
22Training recipes
The unwritten parts of the README that decide whether your run converges.
Optimizer
AdamW with $\beta_1 = 0.9, \beta_2 = 0.95$ for transformers, $\beta_2 = 0.999$ for everything else. Weight decay $\sim 0.05$. Gradient clipping at global norm $1.0$.
Schedule
Linear warmup over the first 1000–5000 steps, then cosine decay to 10% of peak LR over the rest of training. Peak LR depends on architecture: $1\!\times\!10^{-4}$ for from-scratch transformers, $3\!\times\!10^{-5}$ for VLA fine-tuning, $5\!\times\!10^{-4}$ for ResNet-scale BC. Skip warmup and you eat a loss spike in the first hundred steps that the model never fully recovers from.
Maintain a shadow copy of model weights, updated as $\theta_{\text{EMA}} \leftarrow \tau \theta_{\text{EMA}} + (1-\tau) \theta$ at every step. Use the EMA copy at evaluation. Critical for diffusion ($\tau = 0.9999$) and flow matching policies; helpful for everything else. The intuition is that the loss surface has high-frequency noise that the EMA averages out, and the resulting weights generalize better than any single training step's.
PyTorch
# EMA update (called every training step)@torch.no_grad()
defupdate_ema(ema_model, model, tau=0.9999):
forema_p, pinzip(ema_model.parameters(), model.parameters()):
ema_p.mul_(tau).add_(p, alpha=1 - tau)
# At inference, use ema_model, not modelactions = ema_model(obs)
Worked example: EMA warmup. Consider a single weight that starts at $\theta_0 = 0$ and follows a noisy gradient path: $\theta_1 = 0.5$, $\theta_2 = 0.3$, $\theta_3 = 0.7$, $\theta_4 = 0.4$. With $\tau = 0.99$:
$\theta_1^{\text{EMA}} = 0.99 \times 0 + 0.01 \times 0.5 = 0.005$.
$\theta_2^{\text{EMA}} = 0.99 \times 0.005 + 0.01 \times 0.3 = 0.008$.
$\theta_3^{\text{EMA}} = 0.99 \times 0.008 + 0.01 \times 0.7 = 0.015$.
$\theta_4^{\text{EMA}} = 0.99 \times 0.015 + 0.01 \times 0.4 = 0.019$.
The EMA barely moves — it takes $\sim 1/(1-\tau) = 100$ steps to "warm up" to the current weight range. For $\tau = 0.9999$ (typical for diffusion), it takes ~10,000 steps. This is why training runs shorter than 10k steps shouldn't use $\tau = 0.9999$ — the EMA copy never reaches the trained weights. Use $\tau = 0.995$ for short runs.
Mixed precision
BF16 weights and activations on Hopper / Ada hardware; FP32 for the optimizer state, the loss, and any normalization statistics. The 4× memory and ~2× speed savings are too large to leave on the table. Watch for numerical issues in attention softmax and in any explicit $\log$ — keep those in FP32.
The difference between "Diffusion Policy" and "Diffusion Policy that works" is a page of hyperparameters that no paper prints in the main text. Here they are.
Learning rate: peak $1 \times 10^{-4}$, cosine decay to $1 \times 10^{-6}$, 1000-step linear warmup.
Batch size: 256 (single GPU), 1024 (multi-GPU).
EMA decay: $\tau = 0.9999$. Use EMA model for evaluation, never the training model.
Training duration: 600 epochs for 100 demos, 100–150 epochs for 1000 demos. More data = fewer epochs (the model sees more unique transitions per epoch).
Noise schedule: 100 diffusion steps during training, 16 DDIM steps at inference.
Learning rate: $2 \times 10^{-5}$ with cosine decay. Never exceed $5 \times 10^{-5}$ or the pretrained vision encoder destabilizes.
LoRA rank: 32 (if using LoRA). Full fine-tune is better when you have enough data (500+ demos) and compute.
Batch size: 64–128 (limited by the 7B model's memory footprint).
Training duration: 50–100 epochs. VLAs converge fast because the backbone is pretrained.
Freeze vision encoder for the first N steps: common practice. Unfreeze after 10–20% of training.
Training config dataclass — Diffusion Policy
fromdataclassesimportdataclass@dataclassclassDiffusionPolicyConfig:
# Architectureobs_dim: int = 512# ResNet encoder outputact_dim: int = 7# 6-DOF EE + gripperhorizon: int = 16# prediction chunk Hn_obs_steps: int = 2# observation historyn_action_steps: int = 8# executed per chunk (K)# Diffusionnum_train_steps: int = 100# diffusion steps (training)num_infer_steps: int = 16# DDIM steps (inference)beta_start: float = 0.0001beta_end: float = 0.02beta_schedule: str = "squaredcos_cap_v2"# Optimizerlr: float = 1e-4lr_min: float = 1e-6weight_decay: float = 0.05betas: tuple = (0.9, 0.95)
warmup_steps: int = 1000grad_clip: float = 1.0# Trainingbatch_size: int = 256num_epochs: int = 600# for 100 demos; 100 for 1000ema_decay: float = 0.9999mixed_precision: str = "bf16"
Batch construction
Episodes must not be split arbitrarily. Sample chunks within episodes only.
Rebalance multi-task data. Square-root or temperature-weighted sampling per task.
Mix camera views and embodiments within each batch.
Data loading details that matter
Episode-aware sampling. When creating training chunks of size $H$, never cut across episode boundaries. If an episode has 300 steps and $H = 16$, you get valid start indices 0 through 284. Starting at index 290 would cross into the next episode (or into padding), which corrupts the temporal structure. The dataloader must know episode boundaries.
Class-balanced sampling for multi-task. If task A has 10,000 demos and task B has 100, uniform sampling means the policy sees task A 100x more often. Use temperature-weighted sampling: sample each task with probability $p_i \propto n_i^{1/T}$ where $n_i$ is the number of demos and $T$ is the temperature. $T = 2$ (square-root scaling) is the standard. This gives task A ~10x more samples than task B, not 100x.
Chunk-boundary augmentation. For diffusion policies, randomize the chunk start index within each episode. Do not always start chunks at the same positions (e.g., indices 0, 16, 32, ...). Random starts prevent the policy from memorizing temporal position within the episode.
Augmentation, repeated
For pixel inputs, random shift + color jitter + a small random rotation. For proprioception, no augmentation other than dropout (25% on the proprio token, applied during training only — this prevents causal confusion with the action history). For action labels, never augment — those are your targets.
Regularization that's overrated
Dropout in transformers (other than attention dropout for very small datasets) usually hurts. L2 on activations is rarely needed. The only regularizers that consistently help are weight decay on linears, gradient clipping, and EMA.
Compute budget
An ACT-scale single-task policy fits on one GPU for a day. A Diffusion Policy with a transformer backbone fits on one GPU in a couple of days. A 7B-parameter VLA fine-tune wants 8×H100 for a week. A from-scratch VLA pretraining run is a small-cluster operation — on the order of $10^5$ GPU-hours. Plan accordingly.
Worked example: training budget for a Diffusion Policy. You have a dataset of 1,000 demonstrations, each 300 steps at 10Hz, from two cameras. Action dimension: 7 (6-DoF EE + gripper). Chunk size H=16.
Dataset size: 1,000 episodes × 300 steps = 300k training samples. Minus chunk boundaries: ~284k usable chunks.
Batch size: 256 (typical for single-GPU training on a 40GB A100).
Steps per epoch: 284k / 256 ≈ 1,109 steps/epoch.
Total training: 300 epochs × 1,109 = 332,700 gradient steps. At ~0.15s/step on a CNN U-Net denoiser: ~14 hours. On a transformer denoiser: ~28 hours.
Memory: ResNet-18 encoder (11M params) + 1D U-Net denoiser (~25M params) + optimizer state (2× parameters for Adam) ≈ 72M float32 values = 288 MB model + 576 MB optimizer. Image batch at 480×640×3×2 cameras×256 batch ≈ 1.1 GB. Total: ~3 GB. Fits comfortably on any modern GPU.
22bThe tooling layer — LeRobot & data infrastructure
The architecture is not your bottleneck. The data pipeline is.
The data problem is the real problem
Every robot learning paper spends 80% of its pages on the policy architecture and 20% on the data. In practice, the ratio of engineering effort is reversed. The architecture is a PyTorch module you can copy from a reference implementation in an afternoon. The data pipeline — collection, cleaning, storage, loading, normalization, versioning — is the thing that takes months and determines whether the policy works.
Robot learning datasets are harder than vision or NLP datasets for three reasons:
Multi-modal. Each timestep contains images (one or more cameras), proprioception (joint angles, velocities, gripper state), and actions (the labels). These modalities have different shapes, different sampling rates, and different storage requirements. A single 300-step episode with two 480×640 cameras produces ~550 MB of image data but only ~34 KB of proprioception and actions.
Temporal. Episode structure matters. A training sample is not an independent image-label pair — it is a window of consecutive observations and a chunk of future actions. The dataloader must respect episode boundaries, handle observation history, and align modalities to shared timestamps.
Heterogeneous. Different robots have different action spaces (7-DOF joint positions vs. 6-DOF end-effector deltas vs. 2-finger gripper vs. 5-finger hand). Different cameras have different resolutions and intrinsics. A dataset that mixes embodiments must normalize all of this into a common format without losing information.
LeRobot: the ecosystem
LeRobot is an open-source library from HuggingFace that standardizes the entire robot learning data and training pipeline. It is to robot learning what HuggingFace Transformers is to NLP: the reference implementation that everyone forks from. What it provides:
LeRobotDataset. A standardized format that stores episodes as a combination of tabular data (proprioception, actions, timestamps, episode indices), video frames (MP4 or individual images), and metadata (robot description, action space definition, camera parameters). The format is designed for both local training and streaming from the HuggingFace Hub.
Episode-aware batching with delta_timestamps. The key API innovation. Instead of manually computing observation windows and action chunks, you declare temporal offsets: "give me the image from 0.1s ago and the current image as observation, and the next 16 actions at 10Hz as the label." The dataset handles alignment, boundary checking, and padding.
Streaming mode. For large datasets (Open X-Embodiment is ~2 TB), you do not need to download everything. LeRobot streams episodes on demand, caching locally as you train. This makes it feasible to train on datasets that do not fit on your disk.
Pre-built policy implementations. ACT, Diffusion Policy, and SmolVLA ship as reference implementations with configs that reproduce published results. The training loop, evaluation protocol, and checkpointing are standardized.
Data collection tools. Drivers for common teleop hardware (ALOHA leader-follower arms, UMI handheld grippers, keyboard/gamepad control) that record episodes directly into LeRobotDataset format.
Dataset design decisions that matter
Episode boundaries
Never create a training sample that crosses episode boundaries. Episode $k$ ends when the task is complete (or failed) and the robot resets. Episode $k+1$ starts from a new initial state. A chunk that spans the boundary contains a physical discontinuity — the robot teleported from one configuration to another. The policy learns to predict this teleportation as if it were a valid action, and deploys nonsense at test time. Every dataloader must know where episodes begin and end.
Temporal alignment
Images arrive at 30 fps. Proprioception arrives at 200 Hz. Actions are logged at 50 Hz. The dataloader must align all modalities to a common timeline, typically by snapping each modality to the nearest sample at the requested timestamp. Interpolation is sometimes used for proprioception but never for images (interpolated images are physically meaningless).
Normalization
Compute per-feature mean and standard deviation from the training dataset. Normalize observations and actions to approximately zero mean and unit variance. This is not optional — action dimensions can range from $[-\pi, \pi]$ for joint angles to $[0, 1]$ for gripper width. Without normalization, the loss is dominated by the largest-magnitude dimensions and the gripper (the most important dimension for many tasks) is ignored.
Action space standardization
For cross-embodiment training, all action spaces must be mapped to a common representation. The typical choice: end-effector pose deltas in the robot's base frame (6D: $\Delta x, \Delta y, \Delta z, \Delta \text{roll}, \Delta \text{pitch}, \Delta \text{yaw}$) plus a gripper command (1D: open/close). This throws away joint-level information but creates a representation that is (approximately) embodiment-invariant. Policies that operate in this space can transfer between a Franka and a UR5 without retraining.
LeRobot dataset loading with delta_timestamps
fromlerobot.common.datasets.lerobot_datasetimportLeRobotDataset# Load a dataset from HuggingFace Hubdataset = LeRobotDataset(
"lerobot/aloha_sim_transfer_cube_human",
delta_timestamps={
# Observation: current frame + 0.1s-ago frame"observation.images.top": [-0.1, 0.0],
# Proprioception: same two timestamps"observation.state": [-0.1, 0.0],
# Action chunk: next 16 steps at 50Hz (0.02s apart)"action": [i * 0.02foriinrange(16)],
}
)
# Each item is a dict with aligned, windowed dataitem = dataset[0]
print(item["observation.images.top"].shape) # (2, 3, 480, 640)print(item["observation.state"].shape) # (2, 14)print(item["action"].shape) # (16, 14)# The dataset automatically:# - Respects episode boundaries (no cross-episode samples)# - Aligns timestamps across modalities# - Returns tensors ready for the policy
The data collection pipeline
From zero to a trained policy, the data pipeline has five stages:
Build or buy a teleop rig
The operator needs to control the robot in real time while demonstrations are recorded. Three dominant setups in 2026:
ALOHA leader-follower: a second (leader) arm that the operator moves by hand; the follower arm mirrors the motion. The most natural control for bimanual tasks. Cost: ~$20K for the leader arms.
UMI handheld gripper: a passive gripper with tracking markers that the operator uses to demonstrate the task in free space. The robot replays the recorded trajectory. Lowest barrier to entry.
VR controller: the operator moves a VR controller and the robot follows in end-effector space. Works well for single-arm tasks. Requires careful calibration of the VR-to-robot transform.
The teleop interface determines data quality. Jerky, unnatural demonstrations produce jerky, unnatural policies.
Record episodes
The operator performs the task while cameras and proprioception are logged at their native rates. Each episode records: all camera streams (synchronized), joint positions and velocities, gripper state, and a task description string. A good episode takes 10–60 seconds; setup and reset add 2–5 minutes per episode.
Budget 5 minutes per demonstration. A 200-demo dataset costs a full day of human time.
Quality filter
Remove failed episodes (the operator dropped the object, bumped the camera, took a suboptimal path). Trim dead time at the beginning and end of each episode (the operator reaching for the teleop handle is not part of the task). A 20% rejection rate is typical; aggressive filtering to 50% rejection rate improves policy quality measurably.
Bad demonstrations are worse than no demonstrations. The policy learns the failures too.
Annotate
Add task description strings for language-conditioned policies ("pick up the red mug and place it on the coaster"). For multi-task datasets, annotations are mandatory. For single-task datasets, they are optional but future-proof your data for VLA fine-tuning.
Annotation is cheap relative to collection. Do it now or regret it later.
Upload and share
Push the dataset to HuggingFace Hub in LeRobotDataset format. This enables streaming, versioning, and sharing with the community. A well-documented dataset card (robot description, task description, collection protocol, known failure modes) multiplies the value of the data.
The Open X-Embodiment dataset exists because 21 institutions shared their data. Your dataset is more valuable public than private.
Scaling: how much data for what ambition
Ambition
Data scale
Compute
Training time
Single-task specialist
50–200 demos
1 GPU
4–8 hours
Multi-task, same robot
1K–5K demos
1–4 GPUs
1–2 days
Cross-embodiment generalist
100K+ demos
32+ GPUs
1–2 weeks
Foundation model (π₀-scale)
10M+ demos
256+ GPUs
Months
The jump from "single-task specialist" to "multi-task generalist" is not linear. A 10× increase in data does not produce a 10× increase in task diversity — it produces a qualitative shift in what the policy can represent. Below a threshold, the policy memorizes. Above it, the policy generalizes. The threshold depends on the architecture, but for a Diffusion Policy it is roughly 1,000 demonstrations; for a VLA, roughly 10,000.
Open X-Embodiment
The Open X-Embodiment dataset (Collaboration et al., 2024) contains 1.4 million robot trajectories from 22 different robot embodiments, collected by 21 research institutions. It is the ImageNet moment for robot learning: the first large-scale demonstration that cross-embodiment transfer is not just possible but beneficial. Policies pre-trained on this dataset and fine-tuned on a target robot outperform policies trained from scratch on the target robot alone.
The limitations are real:
Heterogeneous quality. Some contributing labs have excellent teleop setups; others have noisy, jerky demonstrations. The policy learns from all of them.
Inconsistent action spaces. Joint-space actions for one robot, end-effector deltas for another. The normalization and mapping layer must handle all variants.
Biased task distribution. Pick-and-place dominates. Dexterous manipulation, deformable objects, and tool use are underrepresented.
The cheapest way to a better policy in 2026 is still more demonstrations. Architecture improvements give you 5–15% gains; doubling the dataset gives you 10–30%. If you have a week and must choose between tuning the architecture and collecting more data, collect more data. Every time.
22·CCo-fine-tuning — mixing web and robot data
The training trick that makes VLAs work. Without it, your model forgets what a cup looks like. With it, internet knowledge and motor skill reinforce each other.
The catastrophic forgetting problem
A VLM arrives pre-trained on billions of image-text pairs. It knows what "red cup" means, what "pick up" implies, and what a kitchen looks like. Then you fine-tune it on robot data: 100K trajectories of (image, instruction, action chunk). The robot data teaches the model to predict actions. But here is the catch — the gradient signal from robot data overwrites the VLM's pre-trained weights. After 20K gradient steps of robot-only fine-tuning, the representation of "red" has drifted. After 50K steps, "red cup" no longer activates the right visual features. The model can still grasp objects (it learned that from robot data), but it can no longer distinguish "pick up the red cup" from "pick up the blue cup."
This is catastrophic forgetting — the phenomenon where new learning destroys previously acquired knowledge. In NLP, this manifests as a fine-tuned model forgetting grammar or factual knowledge. In robotics, it is worse: the model forgets the very visual-semantic features that make language-conditioned manipulation possible. A VLA that has forgotten what "red" looks like is not a slightly worse VLA. It is a VLA that cannot follow language instructions at all.
The co-fine-tuning solution
The fix, introduced by RT-2 and refined by every VLA since, is simple in concept: mix robot data with web data during fine-tuning. Every training batch contains both kinds of samples:
Robot samples: (image, language instruction, action chunk) → supervised with the action prediction loss (flow matching, diffusion, or cross-entropy on discrete tokens).
Web samples: (image, question, answer) → supervised with the standard VQA cross-entropy loss.
The two losses are weighted and summed into a single scalar that the optimizer minimizes:
$\mathcal{L}_{\text{robot}}$ — the action prediction loss. For flow matching VLAs (π₀, SmolVLA): MSE between predicted and target velocity fields. For discrete-token VLAs (RT-2, OpenVLA): cross-entropy over action bin indices. Computed only on robot samples in the batch.
$\mathcal{L}_{\text{web}}$ — the VQA loss. Standard next-token cross-entropy on image-question-answer triplets from web datasets (VQAv2, GQA, TextVQA, etc.). Computed only on web samples in the batch.
$\lambda_{\text{robot}} = 0.8$, $\lambda_{\text{web}} = 0.2$ — typical loss weights. The ratio controls the tradeoff between learning new motor skills and preserving old semantic knowledge. Higher $\lambda_{\text{web}}$ = less forgetting but slower robot learning.
In plain English: every time you show the model a batch of robot grasping data, also show it a batch of web data that reminds it what objects look like and what words mean. The web data acts as an anchor — it keeps the VLM's semantic features in place while the robot data teaches the action expert to use those features for motor control.
The web data acts as a regularizer that keeps the VLM's semantic features alive. Think of the VLM's weights as encoding two kinds of knowledge: (1) visual-semantic features (what objects look like, what words mean) and (2) task-specific motor mappings (how to translate "pick up" into gripper motion). The web loss penalizes any weight change that degrades visual-semantic features. The robot loss rewards weight changes that improve motor mappings. The optimizer finds a compromise where motor skill improves without sacrificing too much semantic understanding.
This is not metaphorical — it is a direct consequence of the multi-task loss gradient. At every step, the gradient from the web loss points back toward the pre-trained feature space, while the gradient from the robot loss points toward an action-optimized feature space. The weighted sum of these gradients moves the parameters along a direction that satisfies both objectives to the degree allowed by the weighting ratio.
The mixing ratio matters
The ratio $\lambda_{\text{robot}} / \lambda_{\text{web}}$ is the single most important hyperparameter in VLA training, and getting it wrong produces one of two failure modes:
Too much web data ($\lambda_{\text{web}} > 0.4$): the model retains its semantic features perfectly but converges slowly on robot tasks. After the same compute budget, the robot-task success rate is 15–25 points lower than the optimal ratio. The web loss dominates the gradient and prevents the action expert from learning efficiently.
Too little web data ($\lambda_{\text{web}} < 0.1$): the model learns robot tasks quickly but forgets visual semantics. After 50K steps, language-conditioned success rate drops by 20–30 points because the model can no longer distinguish objects by name. Open-vocabulary instruction following degrades to near-random.
The sweet spot depends on three factors:
Domain distance. If the robot domain is visually similar to the web domain (tabletop manipulation with common household objects), the features transfer well, and less web data is needed as regularizer. If the robot domain is visually distant (underwater inspection, surgical manipulation), more web data is needed to preserve the relevant features. Typical ranges: 0.15–0.25 for tabletop, 0.25–0.35 for unusual domains.
Robot data volume. With more robot data, the gradient signal from robot samples is more stable and less likely to cause catastrophic forgetting through noise. Labs with 100K+ robot demonstrations can reduce $\lambda_{\text{web}}$ to 0.1 without forgetting. Labs with 1K demonstrations need $\lambda_{\text{web}} \geq 0.25$ to prevent overfitting the action expert to the small dataset.
Backbone freezing strategy. If the VLM backbone is mostly frozen (LoRA adapters only), less web data is needed because the frozen weights cannot be overwritten. If the backbone is fully trainable (as in RT-2's original recipe), more web data is needed.
Worked example: co-fine-tuning a VLA on 1,000 pick-and-place demos
Worked example: co-fine-tuning from scratch. You have 1,000 pick-and-place demonstrations on a Franka Panda. Each demonstration: 2 camera images (224×224), language instruction, 16-step action chunk (7D). You are fine-tuning a PaliGemma-based VLA with a flow matching action expert.
Batch construction. Batch size 256. Each batch: 200 robot samples + 56 web VQA samples. The robot samples come from your 1,000 demonstrations (with random chunk start indices and augmentation). The web samples come from a shuffled stream of VQAv2 + GQA + TextVQA.
Loss computation. Robot loss: flow matching MSE on the 200 robot samples. Web loss: cross-entropy on the 56 VQA samples. Combined: $\mathcal{L} = 0.8 \times \mathcal{L}_{\text{robot}} + 0.2 \times \mathcal{L}_{\text{web}}$.
Training dynamics. At step 0, both losses start high. The robot loss drops quickly (from ~1.0 to ~0.1 by step 20K) as the action expert learns the basic grasping motion. The web loss stays roughly constant (~2.5, close to the pre-trained level) because the web gradient prevents the backbone from drifting.
By step 100K: robot loss converges to 0.02. Web VQA accuracy: 78% (vs 82% pre-training). The 4-point drop in web accuracy is the cost of robot specialization — acceptable because the semantic features that matter for "red cup" and "on the table" are preserved. The features that degrade are the ones relevant to "What breed of dog is this?" — irrelevant for manipulation.
Comparison: no co-fine-tuning. Same setup, $\lambda_{\text{web}} = 0$ (robot data only). By step 100K: robot loss = 0.015 (slightly better). But web VQA accuracy = 41% (halved). And language-conditioned success rate on novel instructions: 32% (vs 71% with co-fine-tuning). The model forgot what objects look like.
Total compute: 100K steps × ~0.3s/step = ~8 hours on 8×A100. The 56 web samples per batch add ~15% overhead vs robot-only training. That 15% compute cost buys a 39-point improvement in language-conditioned generalization.
Comparison: three training strategies
Strategy
Robot success (trained tasks)
Robot success (novel instructions)
Web VQA accuracy
Training time
Forgetting
No co-fine-tuning (robot data only)
89%
32%
41%
7 hours
Severe
Co-fine-tuning ($\lambda_\text{web} = 0.2$)
85%
71%
78%
8 hours
Minimal
LoRA-only (backbone frozen)
82%
68%
82%
4 hours
Zero
The table tells the story. Robot-only training wins on the specific tasks it was trained on (89% vs 85%) but collapses on novel instructions (32% vs 71%). Co-fine-tuning sacrifices 4 points on trained tasks to gain 39 points on novel instructions — a massive net improvement for any deployment where the task set will grow. LoRA-only preserves web knowledge perfectly (82%) but has slightly less action capacity than the fully updated model.
The emerging best practice in 2026 is a hybrid: co-fine-tune with $\lambda_{\text{web}} = 0.2$ during Stage 2 (when the action expert is learning from scratch), then switch to LoRA-only for Stage 3 (task-specific fine-tuning). This gives the action expert enough gradient signal to learn during the critical early phase, then freezes the backbone to prevent any further drift during per-task specialization.
The schedule within a run
Some teams vary the web ratio over the course of training. The rationale: early in training, the action expert is near-random and the gradients are large and noisy — this is when forgetting risk is highest. Late in training, the gradients are small and the features are stable. So start with more web data and decay:
Steps 0–10K: $\lambda_{\text{web}} = 0.4$ (strong regularization during the noisy phase)
Steps 10K–50K: $\lambda_{\text{web}} = 0.2$ (standard ratio once the action expert has stabilized)
This schedule is not universally adopted — many teams use a fixed ratio throughout — but the teams that use it report 2–4 points of improvement on both robot tasks and web accuracy, at zero extra compute cost. The decay schedule matches the intuition: the model needs the most protection against forgetting when it is learning the most (early training), and the least protection when the features have converged (late training).
Co-fine-tuning is to VLAs what pre-training was to NLP: not optional. Every successful VLA since RT-2 uses it. The only question is the mixing ratio. If you fine-tune a VLM on robot data without web data in the mix, you will get a model that can grasp but cannot follow instructions. The 20% compute overhead for web samples buys a 2× improvement in language-conditioned generalization. There is no cheaper lever in the VLA training stack.
23Inference and deployment
The system around the model is the system you ship.
Receding-horizon control, decoded
The deployment loop for a chunked policy:
Read the latest observation $o_t$ from cameras and proprioception.
Run the policy forward to predict $a_{t:t+H}$.
Push the predicted chunk into a control buffer.
Send actions from the buffer to the robot at the control rate (50–200Hz).
After $K$ control ticks, return to step 1.
The trick is decoupling policy frequency ($1/K$ of the control rate) from control frequency. The policy can be slow; the controller is fast. A 200ms diffusion policy that produces 16 actions executed at 50Hz controls the robot for 320ms — well within budget.
Python (pseudocode)
# Receding-horizon deployment loopbuffer = deque()
whiletask_not_done:
iflen(buffer) < K: # need new chunkobs = get_observation() # cameras + propriochunk = policy.predict(obs) # H actions, shape (H, 7)buffer.extend(chunk[:H]) # push all H into bufferaction = buffer.popleft() # consume one actionaction = safety_filter(action) # clip to workspace, vel limitsrobot.send(action) # execute at control ratesleep(1/control_hz) # 20ms for 50Hz
Latency budgets
Architecture
Inference
Notes
ACT
5–10 ms
Single forward, 80M params
Diffusion Policy
30–80 ms
16 DDIM steps
π₀ flow
40–60 ms
10 Euler steps
π₀-FAST
~20–40 ms
~30 FAST tokens
GR00T N1 (2.2B)
63.9 ms
L40 GPU, bf16
OpenVLA 7B
200–400 ms
INT4 cuts ~2×
SmolVLA 450M
~30–50 ms
Jetson-class
One-step distilled
5–15 ms
1 sampling step
policy inferencepredicted chunkcontroller ticks
Latency budget: where the milliseconds go
A concrete breakdown for a Diffusion Policy running at 10Hz (100ms budget per policy call):
TensorRT compilation matters 16x for diffusion policies: the denoiser runs 16 times per action chunk. A 1ms improvement per denoiser call saves 16ms per policy call. The compilation pipeline: export the denoiser to ONNX ($\texttt{torch.onnx.export}$), compile to a TensorRT engine ($\texttt{trtexec}$). TensorRT fuses convolution + batch norm + ReLU into a single kernel, eliminates memory round-trips, and can run the denoiser step in ~1.5ms instead of ~3ms. Total savings: 24ms, which is the difference between a comfortable 10Hz and a strained 15Hz.
Action smoothing and safety filters
Velocity / acceleration limits on commanded actions.
Workspace bounding boxes — clip EE poses to the safe operating volume.
Force / torque limits — abort if measured forces exceed safe thresholds.
Watchdog timer on policy inference.
None of these are part of the model. All of them are part of the policy system. Skip them and your first deployment is your last.
Safety monitoring: what to check at runtime
The safety filter is a hard constraint: if the policy commands a dangerous action, the safety filter overrides it with a safe default (hold position, open gripper, or controlled stop). The checks, in order of priority:
Joint position limits. Each joint has hard stops. Clip commands to $[q_{\min} + \epsilon, q_{\max} - \epsilon]$ with a 2-degree margin. Violations damage the robot.
Joint velocity limits. Clip to manufacturer-specified limits (typically 150°/s for Franka joints 1–4, 180°/s for joints 5–7). Sudden accelerations stress the gearbox.
End-effector workspace bounds. Define a convex polytope (box or cylinder) around the work area. If the commanded EE pose falls outside, project it back to the boundary. This prevents the arm from reaching into humans or equipment.
Force/torque monitoring. If the wrist force/torque sensor reads above a threshold (e.g., 30N for a Franka), trigger a compliant stop. The policy may be pushing into an obstacle the perception system missed.
Self-collision check. A fast geometric check (sphere approximation of each link) to ensure the arm does not collide with itself. This is especially important for 7-DOF arms with large workspaces.
Inference timeout. If the policy takes longer than the allocated budget (e.g., 100ms), do not wait. Execute the last safe action (hold position) and log a warning. A stale action is better than no action.
For diffusion policies, sampling steps drop from $K = 100$ training to $K = 16$ inference via DDIM with negligible loss. Consistency models, distilled samplers, and rectified flow further compress this to single-step sampling at small accuracy cost. The 2026 production diffusion policy almost always samples in 4–16 steps, never 100.
Quantization
For VLA-scale policies, INT8 or FP8 weight-only quantization gives ~2× speedup with minimal degradation. Activation quantization is dicier — attention can be sensitive. AWQ and GPTQ work; SmoothQuant works; full INT4 is fragile but viable for the largest models when latency is critical.
Worked example: latency budget for a Diffusion Policy.Control rate: 50 Hz (20ms per tick).
Chunk size: $H = 16$ actions. Execute $K = 8$, discard 8.
Chunk duration: $K / 50 = 160$ms.
Policy inference: 16 DDIM steps × 3ms/step (CNN U-Net) = 48ms.
Vision encoding: ResNet-18 = 4ms. Total: 52ms.
Timeline: At $t = 0$, observe. At $t = 52$ms, first action ready. Lags 2.6 control ticks behind — but the chunk covers 160ms, so the buffer holds 8 actions that the controller consumes at 20ms intervals. The next observation happens at $t = 160$ms, giving 108ms of slack for the next inference.
Failure mode: If inference takes >160ms, the buffer empties and the robot stalls or coasts on stale commands. This is why fast denoisers matter: switching from transformer to CNN saves ~40ms/call, turning an unsafe 200ms inference into a safe 52ms one.
23bAsync inference — decoupling planning from execution
If your robot pauses briefly every half-second, you have sync inference. The fix is to never let the action queue run dry.
The problem
A Diffusion Policy with 16 DDIM steps takes ~50ms to plan. The control loop runs at 30Hz (33ms per tick). If you plan synchronously — exhaust the action chunk, then stop the robot, then compute the next chunk — the robot is blind for 50ms every time it needs a new plan. For a chunk of $H = 16$ actions at 30Hz, the robot re-plans every $16 \times 33 = 533$ms. That is fine for slow tabletop tasks where a half-second pause is invisible. It is catastrophic for fast tasks: a pouring trajectory that pauses mid-pour spills; a handover that pauses mid-reach drops the object; a bimanual assembly that pauses mid-insertion jams.
The deeper problem is latency jitter. GPU inference does not take exactly 50ms every time. It takes 45ms once, 62ms the next (a kernel scheduling hiccup), 48ms the third time. In sync mode, every spike directly extends the blind gap. In a 30Hz control loop, a single 70ms spike means the robot coasts on stale actions for two extra ticks — enough to miss a grasp by centimeters.
Sync vs. async inference
Synchronous: exhaust the entire action chunk, then plan the next one. The control loop blocks during inference. Simple to implement, easy to debug, and sufficient for any task where the chunk duration (533ms at 30Hz with $H = 16$) is longer than the planning latency plus comfortable margin. Most published Diffusion Policy results use sync inference because the benchmark tasks are slow enough.
Asynchronous: decouple planning from execution entirely. The robot executes actions from a queue while a background thread (or a remote GPU server) computes the next chunk. When the queue drops below a threshold, the current observation is sent to the planner. The planner returns a new chunk, which is merged with the remaining queue. The robot never stops moving. The control thread and the planning thread run on independent clocks.
The action queue model
The core data structure is a FIFO queue of remaining actions. At each control tick:
Pop the front action from the queue and execute it.
Check the queue fill level: $\text{fill} = |\text{queue}| / H$.
If $\text{fill} < g$ (the refill threshold) and no planning request is in flight, send the current observation to the planner.
When the planner returns a new chunk of $H$ actions, merge it with the remaining queue. For overlapping timesteps, average the old and new actions (temporal ensembling).
The temporal ensembling in step 4 is important. When the new chunk arrives, the queue still has $|\text{queue}|$ unconsumed actions that were predicted by the previous planning call using an older observation. The new chunk's first few actions overlap with these. Averaging smooths the transition between plans and reduces jerk. This is exactly the same temporal ensembling used in ACT (section 07), applied here at the deployment level rather than the architecture level.
When does the queue run empty?
The queue runs empty when inference takes longer than the time needed to consume the remaining actions. Let $\ell$ be the inference latency (planning time), $\Delta t$ be the control period, and $H$ be the chunk size. When we trigger planning at fill level $g$, the queue has $g \cdot H$ actions left. These last for $g \cdot H \cdot \Delta t$ seconds. Planning must finish before they run out:
Queue safety condition
$$ g \cdot H \cdot \Delta t > \ell \quad \Longleftrightarrow \quad g > \frac{\ell}{H \cdot \Delta t} $$
$g$ — the refill threshold. The fraction of the chunk that must remain in the queue before triggering a new planning call. $g = 0.5$ means "plan when the queue is half-empty."
$\ell$ — the inference latency in seconds. For a Diffusion Policy with 16 DDIM steps: ~50ms. For a 7B VLA with INT4: ~150ms.
$H$ — the chunk size. Number of actions predicted per planning call.
$\Delta t$ — the control period. $1/30 \approx 33$ms for 30Hz, $1/50 = 20$ms for 50Hz.
Worked example. Diffusion Policy, $\ell = 50$ms, $H = 16$, 30Hz control ($\Delta t = 33$ms).
Minimum threshold: $g > 50 / (16 \times 33) = 0.095$. Any $g > 0.095$ prevents queue starvation in the average case.
But inference has jitter. If $\ell$ spikes to 80ms, you need $g > 80 / 528 = 0.152$. In practice, $g = 0.5$ provides a comfortable margin: planning starts when 8 actions remain (264ms of runway), which absorbs latency spikes up to 264ms — 5× the typical inference time.
For a 7B VLA with $\ell = 150$ms on the same setup: $g > 150 / 528 = 0.284$. A threshold of $g = 0.5$ still works, but the margin shrinks. With $g = 0.75$ (plan when 12 actions remain, 396ms of runway), you absorb spikes up to 396ms comfortably.
Remote inference
The planning thread does not need to run on the robot. The planner can run on a powerful remote GPU server while the robot executes on a lightweight edge device (a Jetson, an Intel NUC, even a Raspberry Pi with a USB camera). The robot client sends observations over the network (a compressed JPEG image + a proprioception vector, ~50 KB total), and receives an action chunk (~1 KB) in return.
This splits the system into two processes:
Robot client (edge). Runs the control loop at 30–50Hz. Captures observations, manages the action queue, executes safety filters. No GPU required. Latency-critical: must never miss a control tick.
Planning server (cloud/on-prem GPU). Runs the policy forward pass. Receives observation, returns action chunk. Can be shared across multiple robots. Not latency-critical in absolute terms, but lower is better because it determines the minimum $g$.
Network latency adds to $\ell$. On a local network (robot and server in the same room), round-trip adds ~2ms. Over a WAN (robot in a warehouse, server in a data center), round-trip can be 20–50ms. This makes $g$ larger but is still feasible for chunk sizes $H \geq 16$. The tradeoff is clear: remote inference lets you deploy a 7B VLA on a $200 edge device, at the cost of requiring network connectivity and absorbing network jitter.
Interactive: sync vs. async action queue
Toggle between sync and async inference to see the difference. In sync mode, the queue empties before replanning starts, creating a visible idle gap. In async mode, replanning starts early enough that the queue never runs dry.
If your robot pauses briefly every half-second, you have sync inference. Switch to async and the pauses vanish. The cost is a threading lock and ~20 lines of queue management. The benefit is smooth, continuous motion even with slow planners. Every production deployment in 2026 uses some variant of this pattern.
When to use what
Scenario
Inference mode
Why
Benchmarking / research
Sync
Simpler, deterministic, easier to debug. Most sim benchmarks pause physics during inference anyway.
Slow tasks (pick-and-place)
Sync
Chunk duration >> inference latency. The pause is invisible.
Fast tasks (pouring, handovers)
Async
Any pause disrupts the task. Async keeps motion smooth.
Large VLA on edge hardware
Async + remote
Inference exceeds one chunk duration. Must plan ahead to avoid stalls.
Multi-robot deployment
Async + shared server
One GPU server plans for N robots. Each robot runs a lightweight async client.
24Evaluation
The hard part is not making the policy work. The hard part is knowing whether it works.
Success rate, with footnotes
How many trials? 10 trials gives a 95% CI of about ±30 points. 50 is the bare minimum; 100+ for comparative claims.
Reset distribution. Identical resets across methods, ideally videoed.
Time limit. A 10-minute success is not a 10-second success.
Statistical significance for robot experiments
Robot experiments are expensive. You cannot run 10,000 trials. This means your confidence intervals are wide, and most "improvements" reported in papers are within noise. Understanding the statistics is not optional.
With $n$ trials and $k$ successes, the success rate $\hat{p} = k/n$ follows a binomial distribution. The 95% confidence interval width depends critically on $n$:
Trials ($n$)
Success rate
95% CI
CI width
20
85% (17/20)
[62%, 97%]
35 points
50
85% (42.5/50)
[72%, 93%]
21 points
100
85% (85/100)
[77%, 91%]
14 points
200
85% (170/200)
[80%, 90%]
10 points
The lesson: 20 trials is not enough to compare two methods that are within 10% of each other. If method A scores 85% on 20 trials and method B scores 75% on 20 trials, the confidence intervals overlap massively — you cannot claim A is better. You need 100+ trials to distinguish 85% from 75% with statistical confidence.
Practically, this means most single-task comparisons in robotics papers are underpowered. The community knows this. The remedy is to report confidence intervals, run on multiple tasks, and use aggregate metrics (average across tasks) where the effective sample size is larger.
Confidence interval for binomial success rate
importnumpyasnpfromscipyimportstatsdefwilson_ci(successes, trials, confidence=0.95):
"""Wilson score interval for binomial proportion."""p_hat = successes / trialsz = stats.norm.ppf(1 - (1 - confidence) / 2)
denom = 1 + z**2 / trialscenter = (p_hat + z**2 / (2 * trials)) / denommargin = z * np.sqrt(
(p_hat * (1 - p_hat) + z**2 / (4 * trials)) / trials
) / denomreturncenter - margin, center + margin# Example: 35 successes out of 50 trialslo, hi = wilson_ci(35, 50)
print(f"70% success: 95% CI = [{lo:.1%}, {hi:.1%}]")
# Output: 70% success: 95% CI = [56.7%, 80.4%]# Can we claim method A (85%, 50 trials) beats B (75%, 50 trials)?ci_a = wilson_ci(42, 50) # [72%, 93%]ci_b = wilson_ci(37, 50) # [62%, 85%]# CIs overlap: 72-85%. Cannot claim A > B with 50 trials.
Generalization axes
Success rate on the training distribution is necessary but insufficient. The real question is: what changes can the policy tolerate? The generalization axes, in roughly increasing difficulty:
Object pose. Same instance, different starting pose. The easiest generalization. If your policy fails here, your observation space is too narrow or your action space is absolute rather than relative.
Object instance. Same class, different instance (different mug, different bowl). Tests whether the vision encoder has learned class-level features rather than memorizing a single object.
Background and lighting. Same task, new scene. Tests visual robustness. Policies that overfit to table color or lighting conditions fail here. Foundation-model encoders (DINOv2, SigLIP) help enormously.
Distractors. Unrelated objects cluttering the scene. Tests whether the policy attends to the right objects. Language-conditioned policies do better here because the instruction disambiguates.
Language paraphrase. Same task, different phrasing ("pick up the mug" vs. "grab the cup" vs. "take the blue thing"). Tests language understanding beyond keyword matching.
Out-of-distribution objects. Objects the policy has never seen (novel shape, novel material). The hardest visual generalization. Only the largest VLAs (trained on internet-scale data) show meaningful OOD object generalization.
Embodiment transfer. Same task, different robot. Tests whether the learned behavior is embodiment-specific or abstract. Cross-embodiment training (Open X-Embodiment) is the current approach.
A rigorous evaluation reports success rate on each axis separately, not just an aggregate. A policy that scores 90% on seen objects but 30% on unseen objects is not "a 60% policy" — it is a policy with a specific, diagnosable generalization failure.
Long-horizon evaluation
For multi-step tasks, success rate alone is too crude. Useful instead: per-stage success rate, average completion fraction, and median time to first failure. A policy that consistently fails at stage 3 is more diagnostic than one that succeeds 40% of the time without telling you where it falls over.
Worked example: long-horizon eval for a 4-stage task. Task: pick mug, move to coffee machine, place under spout, press button. 100 trials.
Per-stage success: Pick: 95/100 (95%). Move: 88/95 (93%). Place: 62/88 (70%). Press: 55/62 (89%).
Overall success: 55/100 = 55%.
Diagnosis: The bottleneck is placement (70%). The other stages are all >89%. Engineering effort should focus on the placement phase — perhaps the vision encoder loses the spout position, or the gripper-release timing is inconsistent. Without per-stage breakdown, you'd only see "55% success" and not know where to look.
Average completion fraction: (95 × 0.25 + 88 × 0.50 + 62 × 0.75 + 55 × 1.0) / 100 = (23.75 + 44 + 46.5 + 55) / 100 = 1.69 / 4 = 0.72. This single number captures "how far does the policy typically get."
Sim vs real
Sim eval is fast, free, deterministic, and only loosely correlated with real-world success. The standard discipline: track both, report sim-eval as a development signal and real-eval as the metric. A 10-point gap between the two on the same task is normal; a 30-point gap is a sign your sim is mis-specified.
The eval that catches problems early
Closed-loop validation on a held-out subset of trajectories: does the policy reach the same states the demos did?
Action distribution diagnostics: histogram of predicted actions vs demo actions. A skewed histogram is an early warning of mode collapse.
Latency and jitter measurement under deployment conditions. A policy that's fast on the dev box and slow on the cell controller is a deployment-day surprise.
Worked example: confidence intervals for success rate. You run 50 trials and get 35 successes (70% success rate). What's the 95% confidence interval?
Using the Wilson score interval (preferred for proportions): $\hat{p} = 35/50 = 0.7$, $n = 50$, $z = 1.96$.
$$CI = \frac{\hat{p} + \frac{z^2}{2n} \pm z\sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac{z^2}{4n^2}}}{1 + \frac{z^2}{n}}$$
Numerator center: $0.7 + 0.0384 = 0.7384$. Denominator: $1 + 0.0768 = 1.0768$.
Margin: $1.96 \times \sqrt{0.21/50 + 0.00037} = 1.96 \times 0.0651 = 0.1276$.
CI = $\frac{0.7384 \pm 0.1276}{1.0768} = [0.567, 0.804]$.
Your 70% result has a true success rate somewhere in [57%, 80%] with 95% confidence. If a competing method gets 80% on 50 trials, the confidence intervals overlap — you cannot claim it is better. You need 100+ trials to distinguish a 70% from an 80% success rate.
Comparing two methods: Fisher's exact test
You ran method A and method B on the same task, same resets. Method A: 17/20 successes. Method B: 13/20 successes. Is A significantly better? Your intuition says yes (85% vs 65%), but your intuition is wrong for small sample sizes.
The right tool is Fisher's exact test for a 2×2 contingency table. Construct the table:
Success
Failure
Total
Method A
17
3
20
Method B
13
7
20
Fisher's exact test computes the probability of observing a table this extreme (or more extreme) under the null hypothesis that both methods have the same success rate. The result: p-value = 0.17. At the conventional threshold of $p < 0.05$, this is not significant. You cannot claim A is better than B.
How many trials do you need? To distinguish 85% from 65% with $p < 0.05$ (Fisher's exact test, one-sided), you need approximately 40 trials per method. To distinguish 85% from 75%, you need approximately 150 trials per method. This is why most single-task comparisons in robotics papers are underpowered: 20 trials per method can only detect differences of 30+ percentage points with confidence.
Fisher's exact test for comparing two methods
fromscipy.statsimportfisher_exactimportnumpyasnpdefcompare_methods(successes_a, trials_a, successes_b, trials_b):
"""Fisher's exact test: is method A better than method B?"""table = [
[successes_a, trials_a - successes_a],
[successes_b, trials_b - successes_b],
]
odds_ratio, p_value = fisher_exact(table, alternative='greater')
returnp_value, odds_ratio# Example: Method A 17/20, Method B 13/20p, odds = compare_methods(17, 20, 13, 20)
print(f"p = {p:.3f}, odds ratio = {odds:.2f}")
# p = 0.166, odds ratio = 3.05# NOT significant at p < 0.05. Cannot claim A > B.# With more trials: Method A 85/100, Method B 65/100p2, odds2 = compare_methods(85, 100, 65, 100)
print(f"p = {p2:.6f}, odds ratio = {odds2:.2f}")
# p = 0.000642, odds ratio = 3.05# Significant at p < 0.001. Same effect size, but 5x the trials.# Minimum trials needed to detect a given differencedefmin_trials(p_a, p_b, alpha=0.05, power=0.8):
"""Approximate sample size per group (arcsine method)."""h = 2 * np.arcsin(np.sqrt(p_a)) - 2 * np.arcsin(np.sqrt(p_b))
fromscipy.statsimportnormz_a = norm.ppf(1 - alpha)
z_b = norm.ppf(power)
returnint(np.ceil((z_a + z_b)**2 / h**2))
print(min_trials(0.85, 0.65)) # ~38 trials per methodprint(min_trials(0.85, 0.75)) # ~148 trials per method
The generalization matrix
A single success rate number tells you almost nothing about how a policy will perform outside its training distribution. A generalization matrix tells you everything. The idea: create a grid where rows are training conditions and columns are test conditions. Each cell is a success rate. The diagonal is in-distribution performance. Off-diagonal cells reveal generalization — or lack thereof.
Worked example: generalization matrix for a pick-and-place policy. The policy was trained on 3 objects (mug, bowl, bottle) in 3 environments (lab table, kitchen counter, warehouse shelf). Evaluated on 50 trials per cell.
Object generalization (trained objects, novel environments):
Train → Test
Lab table
Kitchen
Warehouse
Outdoor (OOD)
Mug
92%
78%
70%
45%
Bowl
88%
82%
68%
40%
Bottle
90%
80%
72%
48%
Banana (OOD)
52%
40%
35%
22%
Diagnosis: The diagonal (bold) averages 90% — good in-distribution performance. The off-diagonal drops smoothly: kitchen (-10pts) and warehouse (-20pts) are manageable degradation. But the outdoor column (OOD environment) and banana row (OOD object) show catastrophic drops: the policy is memorizing, not generalizing.
Actionable insight: The largest gap is environment transfer (lab → warehouse: -20pts). This suggests the vision encoder is overfitting to background/lighting features. Fix: either fine-tune with 10 demos per new environment, or switch to a frozen foundation-model encoder (DINOv2 or SigLIP) that provides environment-invariant features.
A policy with 90% on-diagonal and 35% off-diagonal is a memorizer. A policy with 80% on-diagonal and 70% off-diagonal is a generalizer. The latter is almost always more useful in production.
Treat evaluation infrastructure with the same rigor as the policy. The teams that ship reliable robots are the teams whose eval harness is older and better-tested than their model.
24·BBenchmarks & environments
Every result in robot learning was measured somewhere. The reader needs a map of the somewheres.
The benchmark you choose determines what you measure, and what you measure determines what you optimize. A policy that scores 95% on LIBERO may fail at 40% on RLBench — not because it got worse, but because the benchmarks test different things. Knowing the landscape means knowing which numbers to trust, which to compare, and which to ignore.
Simulation benchmarks
LIBERO
Liu et al., 2023. 130 manipulation tasks organized into 5 suites of increasing difficulty. Each task comes with a natural-language instruction. Built on robosuite with a Franka Panda arm. LIBERO is the de facto standard for behavior cloning benchmarks in 2025–2026 because it covers the right axes (language, spatial, long-horizon) and provides standardized train/eval splits.
The 5 suites, in detail:
LIBERO-Spatial (10 tasks) — tests spatial reasoning. Same objects, but placed in different spatial configurations. "Put the bowl on the top shelf" vs. "Put the bowl on the bottom shelf." Tests whether the policy understands spatial language, not just object recognition.
LIBERO-Object (10 tasks) — tests object generalization. Same spatial layout, but different objects. The policy must recognize novel objects from language descriptions.
LIBERO-Goal (10 tasks) — tests goal variation. Same objects and layout, but different goals. "Open the drawer" vs. "Close the drawer" vs. "Put the block in the drawer." Tests whether the policy conditions on the instruction rather than memorizing a single behavior.
LIBERO-Long (10 tasks) — 10-step sequential tasks with long horizons. The hardest suite and the most diagnostic. A policy that scores 70%+ on LIBERO-Long is genuinely capable of multi-step manipulation. Errors compound: if per-step success is 95%, the 10-step chain success is $0.95^{10} = 60\%$.
LIBERO-100 (100 tasks) — the full benchmark. Everything at once. This is where aggregate performance is reported for VLA papers.
The standard evaluation protocol: train on 50 demonstrations per task, evaluate over 20 episodes per task with held-out starting conditions, report mean success rate per suite. The 20-episode evaluation is underpowered for per-task claims (Section 24 shows why) but sufficient for suite-level comparisons when aggregated across 10–100 tasks.
LIBERO suites in detail: what each one actually tests.LIBERO-Spatial (10 tasks): All tasks use the same objects (bowls, plates, blocks) but require spatial reasoning. Examples: "put the red bowl on the top shelf" vs. "put the red bowl on the bottom shelf" vs. "put the red bowl to the left of the plate." A policy that ignores the language instruction and always places the bowl in the same location will score ~10% (1/10 by chance). This suite catches policies that have learned a visual placing heuristic rather than genuine spatial-language grounding.
LIBERO-Object (10 tasks): Same spatial layout, same instructions, but different objects not seen during training. The policy must generalize from "pick up the blue mug" (trained) to "pick up the green bottle" (novel). This tests the vision encoder's object-level features: does it recognize objects by shape and affordance, or by memorized texture? Policies with frozen SigLIP/DINOv2 encoders consistently outperform those with trained-from-scratch encoders on this suite.
LIBERO-Goal (10 tasks): Same objects, same environment, but different goals. "Open the drawer," "close the drawer," "put the block in the drawer," "take the block out of the drawer." The objects and scene are identical — only the language instruction changes the required behavior. This is the purest test of language conditioning: a policy that ignores language will either always open or always close the drawer, scoring ~25%.
LIBERO-Long (10 tasks): Multi-step tasks requiring 5+ subtasks in sequence. Example: "open the top drawer, pick up the red block, place it in the drawer, close the drawer, push the button." Each subtask must succeed for the overall task to succeed. With per-step success of 95%, the 5-step chain success is $0.95^5 = 77\%$. With 90% per-step: $0.9^5 = 59\%$. This suite ruthlessly exposes policies with marginal per-step reliability. It is the hardest LIBERO suite and the most predictive of real-world deployment performance.
LIBERO-100 (100 tasks): The union of all suites plus 60 additional tasks spanning all axes of variation. This is the headline number for VLA papers. State-of-the-art (early 2026): ~85% average success rate across all 100 tasks. The distribution across suites is typically: Spatial 90%, Object 80%, Goal 88%, Long 65%, Other 87%.
Calvin
Mees et al., 2022. Long-horizon, language-conditioned manipulation in a tabletop environment. The key metric is average chain length: how many sequential subtasks the policy completes before failing. A 5-task chain ("open drawer, pick up block, put in drawer, close drawer, turn on light") requires all 5 in sequence. Calvin is the standard test for whether your policy can chain behaviors — a capability that single-task benchmarks miss entirely.
MetaWorld
Yu et al., 2020. 50 distinct manipulation tasks (push, pick-place, drawer open, door close, etc.) designed for meta-learning and multi-task RL. Each task has a parametric family of initial conditions. MetaWorld is the standard RL manipulation benchmark — if your RL method can't solve MetaWorld-ML45, it can't solve manipulation. The limitation: the tasks are simpler than LIBERO's and the observations are state-based (no images), so it tests motor learning, not perception.
RLBench
James et al., 2020. 100 tasks with language descriptions, keyframe-based demonstrations, and multi-view RGB-D observations. Built on CoppeliaSim/PyRep. RLBench is older and more varied than LIBERO but harder to use (CoppeliaSim is finicky). The 3D policy community (3D Diffuser Actor, PerAct, GNFactor) uses RLBench as their primary benchmark because it provides point clouds natively.
Language Table
Lynch et al., 2023. Simple 2D pushing tasks on a flat surface with language instructions ("push the red block to the blue block"). Deliberately minimal — the point is to test language grounding in isolation, without the confound of complex manipulation. Google's internal benchmark; used in RT-1, RT-2, and SayCan evaluations.
SimplerEnv
Li et al., 2024. A standardized simulation environment for evaluating VLAs, specifically designed to correlate with real-world performance. SimplerEnv recreates the exact setups used in real-robot VLA evaluations (Google's kitchen, Bridge V2 tasks) in simulation, so you can run 1000 eval episodes instead of 50.
Why SimplerEnv matters: most VLA researchers do not have access to a real robot. Even those who do cannot afford to run 1000 trials per checkpoint for hyperparameter tuning. SimplerEnv provides a simulated version of the Bridge V2 real-world setup (WidowX arm, kitchen table, the exact objects used in real experiments) and the Google Robot setup. The sim environments are carefully calibrated so that sim success rate correlates with real success rate ($R^2 \approx 0.8$ on the Bridge V2 tasks). This means you can iterate on policy architecture and training in sim, and only do the expensive real-robot evaluation on the final candidate. The key finding: not all sim metrics correlate with real performance. Raw success rate in a visually simple sim does not predict real success rate. But success rate in a sim with calibrated visual appearance and physics does.
RoboCasa
Nasiriany et al., 2024. Kitchen manipulation at scale: 100+ tasks across multiple kitchen layouts with procedurally generated scenes, realistic object assets, and language instructions. Built on robosuite. RoboCasa's contribution is scene diversity — testing whether your policy transfers across counter heights, cabinet configurations, and lighting conditions that vary between kitchens.
Simulation benchmarks at a glance
Benchmark
Tasks
Modalities
Typical use
Key metric
LIBERO
130
RGB, language, proprio
BC evaluation, VLA fine-tuning
Success rate (per suite)
Calvin
34
RGB, language, proprio
Long-horizon chaining
Avg chain length (1–5)
MetaWorld
50
State (no images)
RL, meta-learning
Success rate
RLBench
100
RGB-D, language, point cloud
3D policies, keyframe methods
Success rate
Language Table
~20
RGB, language
Language grounding
Success rate
SimplerEnv
~15
RGB, language
VLA sim-to-real correlation
Success rate (sim–real R²)
RoboCasa
100+
RGB, language, proprio
Kitchen generalization
Success rate
Real-world datasets
Open X-Embodiment
Open X-Embodiment Collaboration, 2024. The ImageNet of robot learning. Over 1 million trajectories collected across 22 different robot embodiments from 21 institutions. Tasks range from simple pick-and-place to kitchen manipulation to mobile navigation.
What is in OXE: 22 robot types (Franka, WidowX, Google Robot, Kuka, UR5, xArm, and more), standardized in the RLDS (Reinforcement Learning Datasets) format. Each trajectory contains: RGB images (1–3 cameras), proprioceptive state, actions (in the robot's native action space), language instructions (when available), and metadata (robot type, camera intrinsics, action space definition). The heterogeneity is the point: a VLA trained on OXE sees grasping from a 7-DOF Franka, from a 6-DOF WidowX, and from a 4-DOF Google Robot. The shared visual semantics (all of them are picking up mugs) provide the transfer signal; the action-space differences are handled by per-embodiment action heads or zero-padding.
The lesson from OXE: dataset breadth matters more than dataset cleanliness for foundation-model pre-training. Models trained on OXE's messy mix outperform models trained on any single clean dataset, even when the eval is on tasks from that single dataset. Diversity is a regularizer.
Dataset size vs. quality tradeoff
A persistent finding: Bridge V2 (60K demos, high quality, single embodiment) often beats subsets of Open X-Embodiment (100K+ demos, noisy quality, mixed embodiments) for single-task performance on the WidowX arm. The reason: quality compounds. A noisy demo teaches the policy to recover from mistakes that should never happen in the first place. A clean demo teaches the policy the right behavior directly. For single-task, single-embodiment deployment, 1000 clean demos beats 10,000 noisy ones.
For foundation-model pretraining, the tradeoff flips. The noise is tolerable because the model sees millions of transitions and the noise averages out. The diversity is essential because it provides coverage over the space of possible tasks, objects, and environments. The rule of thumb: pretrain on everything, fine-tune on clean data for your specific task.
Dataset quality vs. size: a concrete comparison. You are fine-tuning a VLA for "pick up mug and place on coaster" on a WidowX arm. Two data options:
Option A: Bridge V2 subset. 50 clean demonstrations, all from the same lab with consistent camera angle, lighting, and operator quality. Each demo is 30–45 seconds, with smooth trajectories and no hesitation or correction mid-grasp. Total: 50 demos × ~350 timesteps = 17,500 training samples.
Option B: Open X subset. 200 demonstrations from 4 different labs using WidowX arms. Heterogeneous camera angles (some overhead, some side-mount), variable lighting (some fluorescent, some natural), different operator styles (some fast and aggressive, some cautious with mid-trajectory corrections). ~30% of demos contain sub-optimal behaviors (two grasps, hesitation, near-drops). Total: 200 demos × ~350 timesteps = 70,000 training samples.
Result: Option A (50 clean) achieves 82% success rate. Option B (200 noisy) achieves 74% success rate. 4× more data produced a worse policy. Why? The noisy demos teach the policy to hesitate and self-correct — behaviors that are reasonable for a human teleoperator but catastrophic for an autonomous policy that cannot recover from a near-drop.
Option C: Bridge V2 50 clean + 20 targeted demos of failure recoveries. 84% success rate. The targeted demos (deliberately collecting demonstrations of "pick up mug from awkward orientation") add coverage where the original 50 demos are sparse. Quality + targeted diversity beats quantity.
The rule: For single-task fine-tuning, 50 clean demos > 200 noisy demos. For multi-task pretraining, 200 diverse noisy demos > 50 clean demos of one task. Know which regime you are in.
DROID
Khazatsky et al., 2024. 76,000 demonstrations collected from a distributed network of Franka Panda robots across multiple institutions. The contribution is scale + consistency: same robot, same camera setup, diverse tasks and environments. DROID fills the gap between "lab-scale" datasets (50–500 demos) and the full heterogeneity of OXE. For single-embodiment pre-training on Franka, DROID is the starting point.
Bridge V2
Walke et al., 2023. 60,000 demonstrations on a WidowX-250 robot arm, covering kitchen manipulation tasks with language labels. The small-scale gold standard: cheap hardware ($3K per setup), good task diversity, well-curated labels. Bridge V2 is what academic labs use when they can't afford a fleet of Frankas. Many VLA papers report Bridge V2 fine-tuning results as their primary real-world benchmark.
Simulators
MuJoCo / MJX
DeepMind. The default physics engine for locomotion and manipulation RL. MuJoCo provides accurate contact dynamics, fast simulation (~10M steps/hour on CPU), and is free since 2022. MJX is the JAX-compiled GPU-accelerated variant: 4096 parallel environments on a single GPU, enabling RL training runs that would take days on CPU to complete in hours. MJX is what makes massively parallel PPO (the locomotion recipe) practical.
Isaac Sim / Lab / Gym
NVIDIA. The full NVIDIA simulation ecosystem. Isaac Sim provides photorealistic rendering (Omniverse/USD), Isaac Lab provides the RL training harness, and Isaac Gym provides massively parallel GPU-accelerated physics (up to 65K parallel environments). The NVIDIA stack dominates when you need both visual realism and massive parallelism — which is the sim-to-real recipe for locomotion and dexterous manipulation.
Genesis
Genesis Team, 2025. An open-source physics engine that unifies rigid-body (PhysX backend), soft-body (MPM), and fluid (SPH) simulation in a single Python-native framework. Genesis's pitch is physics breadth: a single simulator that handles rigid grasping, deformable cloth, and fluid pouring, with GPU acceleration throughout. Early but promising for tasks that involve multiple physics regimes.
SAPIEN
Xiang et al., 2020. Specializes in articulated-object manipulation: cabinets, drawers, faucets, and other objects with internal degrees of freedom. SAPIEN provides realistic contact dynamics for articulated objects and integrates with the PartNet-Mobility dataset of 3D articulated assets. The go-to for any task involving opening, closing, or manipulating mechanisms.
Robosuite
Zhu et al., 2020. A manipulation benchmark framework built on MuJoCo. Provides standardized task definitions, data collection pipelines, and baseline implementations. LIBERO and RoboCasa are both built on robosuite. If you're doing BC research on manipulation, robosuite is likely your simulation backend whether you know it or not.
Simulators at a glance
Simulator
Physics
GPU parallel
Visual realism
Best for
MuJoCo / MJX
Rigid + contact (accurate)
MJX: 4096+ envs
Low (functional)
Locomotion RL, fast prototyping
Isaac Sim/Lab/Gym
PhysX 5 (rigid + soft)
65K+ envs
High (ray-traced)
Sim-to-real, dexterous, locomotion
Genesis
PhysX + MPM + SPH
Yes (CUDA)
Medium
Multi-physics tasks (cloth, fluid)
SAPIEN
PhysX (articulated focus)
Limited
Medium-high
Articulated object manipulation
Robosuite
MuJoCo
Via MuJoCo
Low-medium
BC/IL benchmark framework
The eval crisis
Results do not transfer across benchmarks. A policy that achieves 90% on LIBERO-Spatial may achieve 60% on an equivalent RLBench task and 75% in the real world. The reasons are structural:
Task definitions differ. "Pick up the mug" in LIBERO means something different than in RLBench: different mug models, different table heights, different success thresholds, different camera viewpoints.
Observation spaces differ. LIBERO provides 128×128 RGB; RLBench provides 128×128 RGB-D from 4 cameras; MetaWorld provides state vectors. A policy optimized for one observation format may fail on another even if the underlying task is identical.
Action spaces differ. Joint positions, end-effector deltas, absolute end-effector poses, and discrete bins are all common, and a policy architecture that shines in one may struggle in another.
Evaluation protocols differ. 20 episodes vs 50 vs 100. Fixed seeds vs random. Deterministic resets vs stochastic. These procedural choices affect the measured success rate by 5–15 points.
What to do about it: report results on at least two benchmarks. Use SimplerEnv to validate sim-to-real correlation before investing in real-robot evaluations. Report confidence intervals (Section 24 shows how). And when reading papers, compare methods that used the same evaluation protocol — not just the same benchmark name.
The eval transfer problem
A policy that achieves 95% on LIBERO-Spatial in simulation may achieve only 60% on the physical equivalent of the same task. The gap is not a bug — it is a structural feature of the sim-to-real boundary that persists even when the policy architecture and training data are identical. SimplerEnv was built specifically to measure and characterize this gap.
SimplerEnv provides simulated reconstructions of two real-world evaluation setups: the Bridge V2 WidowX setup (Berkeley) and the Google Robot setup (Google DeepMind). For each setup, SimplerEnv recreates the exact table geometry, camera placement, object meshes, and lighting conditions used in the real experiments. Researchers then evaluate their policies in both the SimplerEnv simulation and the real setup, producing paired (sim score, real score) data points.
The key finding: the sim-to-real correlation coefficient varies by benchmark and is typically $r = 0.5$–$0.7$. This means sim performance is predictive of real performance but with substantial noise. Specifically:
Bridge V2 tasks (SimplerEnv): $r \approx 0.65$. A policy scoring 90% in sim typically scores 65–80% in real. The gap comes from contact dynamics (sim grasps are more reliable than real grasps) and visual differences (sim lighting is more uniform).
Google Robot tasks (SimplerEnv): $r \approx 0.55$. The larger robot has more actuator dynamics that the sim does not capture perfectly. A policy scoring 85% in sim may score anywhere from 55% to 75% in real.
Visual fidelity matters: SimplerEnv with calibrated textures and lighting ($r \approx 0.65$) correlates better than a generic MuJoCo sim with default textures ($r \approx 0.35$). This confirms that the visual gap, not just the physics gap, drives sim-to-real transfer failure.
The practical implication: use sim eval for relative comparisons (is checkpoint A better than checkpoint B?) but not for absolute performance estimates (will this policy work at 90% in the real world?). A 5-point improvement in SimplerEnv reliably predicts a real-world improvement, even though the absolute numbers do not match. This makes SimplerEnv ideal for hyperparameter tuning and architecture search, where you need to rank candidates — not predict their exact real-world performance.
Worked example: benchmark selection for a VLA project. You're fine-tuning SmolVLA for a kitchen manipulation deployment.
Development eval: LIBERO-Long (10-step tasks, fast iteration, language-conditioned). Run 500+ episodes per checkpoint in simulation. Track per-suite success rate.
Sim-to-real validation: SimplerEnv (correlates with real performance). Run the top-3 checkpoints from LIBERO on SimplerEnv's Google Robot tasks. Pick the one with the highest SimplerEnv score.
Real-robot eval: 50 trials per task, 5 tasks, with confidence intervals. Per-stage success rate for multi-step tasks. Video all trials. This is the number that matters.
Total eval budget: ~50K sim episodes (overnight on 1 GPU) + 250 real-robot trials (~2 days with resets). The sim eval catches 80% of failures; the real eval catches the rest.
The sim-to-real eval correlation problem
SimplerEnv was built to address the single most expensive problem in VLA development: you cannot run thousands of real-robot trials for every hyperparameter sweep. The solution: build a simulator calibrated so that sim performance predicts real performance, then use sim as a proxy for development decisions.
The calibration process: run the same set of policies (10–20 checkpoints spanning a range of training steps, architectures, and hyperparameters) on both the SimplerEnv simulation and the real robot. Plot sim success rate vs. real success rate. If the correlation is high ($r > 0.6$), sim rankings predict real rankings: the checkpoint that is best in sim is likely best in real. The absolute numbers may differ (sim 90% ↔ real 70%), but the ordering transfers.
The correlation is highest for tasks with simple contact dynamics (pick-and-place) and lowest for tasks with complex contacts (articulated manipulation, deformable objects). This matches expectations: the physics gap is smallest for tasks where contact does not dominate the dynamics.
A benchmark is not a task. A benchmark is an operationalization of a task — with specific observation formats, action spaces, success criteria, and evaluation protocols. Two benchmarks that look like "the same task" may test entirely different capabilities. Know what your benchmark actually measures before trusting its numbers.
25The road ahead
A field manual is a snapshot. The map will be different next year.
The picture this manual draws is a rough consensus that did not exist three years ago: imitate at scale with a foundation-model backbone, polish with RL when it pays for itself, evaluate honestly, ship the policy as one piece of a controlled system.
Where the field actually stands, May 2026
Still true: the two-system VLA is the dominant generalist architecture; flow matching has overtaken plain DDPM; cross-embodiment training pays off; co-training with web data is mandatory; HIL-SERL is the only RL recipe competitive with BC + lots of data on real hardware.
New since 2025: autoregressive VLAs caught up with diffusion via FAST tokenization; π₀.₇ added open-world generalization and multi-scale embodied memory; GR00T N1/N1.5 demonstrated synthetic data from video diffusion at scale; Gemini Robotics 1.5 split reasoning from control via tool calls; Helix 02 added a System-0 whole-body motion prior; SmolVLA showed 450M-parameter models can match 7B baselines; 3D and equivariant policies started making sample-efficiency arguments the data-rich camp can no longer dismiss.
Data, still
Every plot of policy success rate versus dataset size is a line that has not yet bent. The cheapest way to a better policy in 2026 is more demonstrations.
Synthetic data and the new bitter lesson
The most interesting development of the past year is that generative models are themselves becoming a data source for robot policies. GR00T N1 trains on neural-generated trajectories from video diffusion; DexMimicGen and MimicGen synthesize new demos from a small seed of real ones; Genesis and Newton push the upper bound on what physics simulators can model. The "bitter lesson 2.0" version of the field's debate is no longer "does scaling work" — it works — but "what is the cheapest source of marginal data?" Increasingly, the answer is a generative model.
The simulator question
Real-world data is rich and expensive. Simulation is fast and lossy. Closing the gap with better physics simulators (Genesis, MuJoCo MJX, NVIDIA Newton), neural simulators (world models trained on real video), digital twins, and synthetic-data pipelines is an open contest. The eventual answer is probably all of the above, layered, not one of them dominating.
Tactile, force, and the contact-rich plateau
Vision-only policies are reaching their plateau on contact-rich tasks. Force-torque sensing helps when wired in correctly. Tactile arrays (DIGIT, GelSight, ReSkin, AnySkin) help even more when there is data. The bottleneck remains large, diverse, well-labeled tactile datasets.
Whole-body humanoid control
Humanoids force the field to confront a problem manipulation policies have ignored: the policy and the locomotion controller are not independent. The architecture has not converged. The unresolved questions are: who owns balance — the controller or the policy? — and how do you train a policy that has to walk, reach, and stay up at the same time?
Continual and lifelong learning
A robot that ships, runs in a customer's facility, and never improves from the data it generates there is leaving most of its potential on the table. The infrastructure to safely fine-tune deployed policies on deployed data — without catastrophic forgetting, without privacy violations, without dangerous regressions — does not yet exist as a productized standard. The π₀.₇ "RL Token" and the inference-time online RL machinery from Physical Intelligence are the closest thing in the open literature. The full version of this is what 2027 looks like.
Read the loss compendium until you can sketch every loss from memory. Pick one architecture and rebuild it from scratch. Run it on a real robot with a real evaluation harness. Don't chase the latest VLA — train your eye to recognize which parts are new and which parts are the same DDPM you already know.
The staff engineer's takeaway
The hard problems are still where they were five years ago: data quality, eval rigor, deployment safety, the gap between a notebook result and a customer-deployed policy. Architecture is not the bottleneck for any team you join. Be the person who makes the dataloader fast, the eval suite honest, the deployment robust, the safety filter trustworthy. That role is undersupplied, undervalued, and load-bearing.
A decision framework
Situation
Architecture
Action space
Training
Single task, 50-200 demos, known object
ACT or Diffusion Policy
Joint positions
BC, 1 GPU, 1 day
Single task, need precision
Diffusion Policy + CLIP encoder
Relative EE 6D
BC, 1 GPU, 2 days
Multi-task, language conditioned
Fine-tune OpenVLA or SmolVLA
Discrete tokens / FAST
BC, 8 GPUs, 1 week
High-precision contact-rich
Diffusion Policy + RL fine-tune
Relative EE
BC then HIL-SERL, 2 hours real
Locomotion
PPO + domain rand
Joint torques
RL, sim only, 4096 envs
Humanoid whole-body
Two-system VLA
Hierarchical
BC + motion prior + RL
Data-scarce manipulation
3D Diffusion Policy
Relative EE
BC, 10-50 demos
A robot policy is a piece of code that closes a feedback loop with the physical world. The architecture is the smallest part of what makes it work. The rest of the work — the data, the evals, the deployment, the trust — is where this field will spend the next decade.