The Robot Learning Stack — A Field Manual

Dimension	Classical control	Learned policies
Accuracy (known dynamics)	Millimeter-precise, provably optimal	Sub-optimal; limited by data and generalization
Generalization	Zero — each system is hand-tuned	Transfers across objects, scenes, embodiments
Model requirement	Full analytical model (URDFs, mass, friction)	None — learns from data
Contact handling	Fragile; requires mode-switching logic	Handles contact implicitly from demonstrations
Task diversity	One controller per task	One policy, many tasks (with enough data)
Safety guarantees	Formal (Lyapunov stability, bounds)	Empirical only — no worst-case guarantees
Data requirement	Physics knowledge (free but expert-intensive)	Demonstrations or sim experience (expensive)

Domain	Training data	Feedback loop	Failure cost
Image classification	Millions of labeled images (cheap)	None — iid data	Wrong label (harmless)
Language modeling	Trillions of tokens (free)	None — iid data	Bad text (harmless)
Game RL (Atari, Go)	Billions of sim steps (free)	Yes, in simulation	Lost game (harmless)
Robot learning	Hundreds of demos ($$$)	Yes, in physics	Broken robot ($10K+)

Component	Rate	Latency budget	What it does
Low-level servo	1–10 kHz	< 1 ms	PD controller that tracks joint position/torque targets. Runs on the robot's internal controller.
Policy inference	5–50 Hz	20–200 ms	Neural network forward pass: observation → action. This is the bottleneck.
Vision encoder	5–30 Hz	10–100 ms	ResNet, ViT, or SigLIP encodes camera images into feature vectors. Often dominates inference time.
Action head	5–50 Hz	1–50 ms	Converts features to actions. MSE head: 1ms. Diffusion head: 10–50ms (iterative denoising).

POMDP element	Concrete instantiation	Dim.
State $s_t$	Joint angles, velocities, object poses, contacts, friction	~50–1000
Observation $o_t$	2 RGB images (480×640×3) + 15 proprioceptive values	1.8M raw; ~1039 encoded
Action $a_t$	7 joint targets or 6D EE delta + gripper	7
Horizon $T$	100–300 steps at 10 Hz = 10–30 seconds	—
Reward $r_t$	+1 at success, 0 otherwise (sparse)	1

Problem	Response	Section
Compounding error	Action chunking (predict $H$ steps at once)	06
Compounding error	DAgger (train on policy's own states)	04
Compounding error	RL fine-tuning (correct via reward signal)	14
Multimodality	Diffusion / flow / VQ action heads	05, 08, 09
Distribution shift	Temporal ensembling, relative actions	02, 06
Reality gap	Domain randomization, sim-to-real	15–16
Latency	DDIM, consistency distillation, flow matching	08, 09
All of the above	Scale (pre-trained VLAs amortize learning)	10–13

Representation	What the policy outputs	Use when
Joint positions	Target $q_t \in \mathbb{R}^n$ for a position controller running at 500–1000Hz underneath	Bimanual tabletop, precise contact (ALOHA / ACT)
Joint velocities	Target $\dot q_t$	Compliant control, when integration drift is acceptable
EE pose, abs.	$T \in SE(3)$ for the end-effector, solved by IK	Cross-embodiment, when the body shouldn't matter
EE pose, rel.	$\Delta T$ relative to current pose	UMI, Diffusion Policy — robust to recovery from drift
Torques	$\tau_t$	Locomotion, rich contact, sim-trained policies

Property	Behavior Cloning	Reinforcement Learning	Model-Based
Data	Expert demos (expensive per sample, cheap per step)	Self-generated rollouts (free per sample, slow per step)	Any experience (demos or rollouts)
Compute	Low — standard supervised learning	High — millions of env steps in sim, or slow real-world rollouts	Medium — model learning + planning
Sim required?	No — train on real data	Usually yes — real-world RL is too sample-hungry without one	Helpful but not required
Failure mode	Compounding error, multimodal averaging, distribution shift	Reward hacking, sparse reward starvation, sim-to-real gap	Model error compounds in planning horizon
Best for	Manipulation with teleop data, rapid prototyping, pre-training VLAs	Locomotion, dexterous manipulation, fine-tuning after BC	Sample-efficient exploration, when dynamics are learnable
Ceiling	Limited by demonstrator skill — cannot exceed expert	Unbounded in principle — can discover superhuman strategies	Limited by model accuracy — compounding model error is the analog of compounding BC error

Head type	Multimodality	Inference	Pros	Cons
MSE (Gaussian)	None — collapses to mean	1 pass (fastest)	Simple, fast, good baseline	Averages modes → invalid actions
GMM	K modes (fixed)	1 pass + sample	Interpretable, moderate expressiveness	Must choose K; mode collapse risk
Discretized	Arbitrary per-dim	1 pass (fast)	Native to LMs, easy tokenization	Loses cross-dim correlations; resolution = B
Energy-based	Arbitrary	50–100 grad steps	Maximally expressive in theory	Slow inference, sampling brittleness
Diffusion	Arbitrary	10–100 denoise steps	Expressive, stable, SOTA fidelity	Multi-step latency
Flow matching	Arbitrary	1–10 ODE steps	Fewer steps, clean objective	Newer, less battle-tested
VQ	Arbitrary (codebook)	AR over codes	Full correlation, discrete sampling	Two-stage pipeline, codebook collapse

System	$H$	$K$	Ensemble	Why
Diffusion Policy	16	8	No	Execute half the chunk. Good balance for general manipulation.
ACT (ALOHA)	100	1	Yes	Re-predict every step, ensemble for smoothness. Fine bimanual tasks.
RT-2	1	1	No	Single-step. Relies on autoregressive tokens for coherence.
$\pi_0$	50	1	Yes	Long chunk + flow matching. Re-predict every step.

Knob	Default	If you change it
Chunk H	100	Smaller = more reactive, less smooth. Below 20, multimodality issues return.
Latent dim	32	Bigger latents over-fit; KL regularizer barely scales.
KL weight β	10	Lower β → posterior collapse; higher β → ignored latent.
Ensemble α	0.01	Larger α trusts recent predictions more; smaller α smooths harder.
Image size	480×640	Wrist cams justify the compute; lower is fine for scene cams.

Property	Diffusion (DDIM)	Flow Matching
Path shape	Curved (through noise schedule)	Straight lines (linear interpolant)
Training target	Noise $\epsilon$ or clean $a_0$	Velocity $a_1 - a_0$
Time parameterization	Discrete $k \in \{1, \ldots, K\}$ + schedule	Continuous $t \in [0,1]$ + uniform sampling
Inference sampler	DDIM deterministic reverse	ODE Euler integration
Typical steps	16	5–10
Schedule tuning	Cosine / linear / learned $\beta_k$	None (uniform $t$)
Code complexity	~100 lines for schedule + sampler	~40 lines total
Quality	State-of-the-art	Comparable to slightly better

11UMI and the data shift

The most important paper of 2024 is not about a model. It is about a stick.

Universal Manipulation Interface (Chi et al., 2024) is a handheld parallel-jaw gripper with a GoPro camera, two side mirrors, and a fingertip-mounted IMU. A human picks it up and performs the task. Software extracts the 6-DoF gripper trajectory from visual SLAM and the gripper width from a fiducial; the resulting (image, EE-pose, gripper-width) trajectory is then used to train a Diffusion Policy. That policy is then transferred to a real robot with the same parallel-jaw end-effector.

UMI is not a new architecture. The policy on top is vanilla Diffusion Policy. The contribution is the data layer — and the contribution is large enough to reshape the field.

Why this works

Embodiment is the gripper, not the arm. If the policy outputs relative EE poses and gripper width, the body that holds the gripper does not need to match between collection and deployment. A human's wrist is a perfectly good "robot arm" for data purposes.
Mirrors give multi-view from one camera. The fisheye GoPro plus side mirrors yields three pseudo-views in a single frame. The policy gets multi-camera robustness from a single sensor.
SLAM gives proprioception. No motion-capture rig, no instrumented environment. The trajectory is recovered from the camera's own motion.
Latency-matched action representation. UMI shifts the predicted action sequence forward in time to compensate for robot actuation delay, so a policy trained on instantaneous-human-motion data still works on a robot with $\sim$200ms latency.

The policy stack

Two-step observation history of (RGB, EE pose, gripper width).
CLIP-pretrained ViT vision encoder; the EE-pose history is an MLP-encoded vector token.
Transformer Diffusion Policy denoiser predicting 16 future relative-EE-pose + gripper-width steps.
Receding horizon: execute first 8, replan.

The hardware, unpacked

The UMI gripper is deliberately low-tech. A 3D-printed parallel-jaw gripper body, a GoPro Hero 10 in a fisheye housing, two planar mirrors angled at ~45° on each side, an ArUco fiducial sticker on each fingertip, and an optional IMU for gravity-aligned orientation. Total hardware cost: under $200. The human picks up this device like a pair of tongs and performs the task naturally — no robot arm, no teleoperation rig, no motion-capture suit.

The key physical insight: the gripper is the embodiment that matters. A parallel-jaw gripper has one degree of freedom (open/close width) plus 6-DoF end-effector pose. Whether that gripper is held by a human hand or mounted on a Franka, a UR5, or a Sawyer does not change the gripper's interaction with the object. By decoupling data collection from the robot, UMI makes the cost of one demonstration approximately 15 seconds of human effort plus no marginal hardware cost.

Why relative EE actions are essential

The UMI gripper has no absolute reference frame. It does not know where it is in the room — only where it was relative to where it just was. This forces the entire pipeline to operate in relative end-effector coordinates: the action at time $t$ is $\Delta T_t = T_{t+1} \cdot T_t^{-1}$, a relative SE(3) transform. This is a feature, not a limitation:

Robot-agnostic. The same relative-EE actions deploy on any robot that has a parallel-jaw gripper and an operational-space controller.
Translation-invariant. The policy does not memorize workspace positions. It learns motions relative to the current gripper pose, which generalizes across table heights, object placements, and starting configurations.
No calibration. There is no camera-to-robot-base transform to estimate. The SLAM trajectory is in the camera's own coordinate frame, and relative actions cancel the frame origin.

The latency matching trick

A human demonstrating with the UMI gripper reacts instantly — the delay between intention and motion is effectively zero. A robot has actuation latency: commands issued at time $t$ are not executed until $t + \delta$, where $\delta \approx 100$–$300$ms depending on the robot's control pipeline. If you train on human-speed data and deploy on a latency-ridden robot, the policy is always "behind" — it predicts actions that were appropriate $\delta$ milliseconds ago.

UMI's fix is temporal resampling. During data processing, the action labels are shifted forward in time by $\delta / \Delta t$ steps (where $\Delta t$ is the control period). At time $t$, the training target is the action the human actually performed at $t + \delta / \Delta t$. At deployment, the robot's actuation delay means the action arrives at the gripper at approximately the right time. This is a simple index shift in the trajectory array, but without it the policy consistently undershoots and lags behind the task.

Collection cost comparison

System	Hardware cost	Setup time	Time per demo	Expert needed?	Robot needed?
UMI	~$200 (gripper + GoPro)	Minutes	~15s	No	No
ALOHA teleoperation	~$30K (full ALOHA rig)	Hours (calibration)	~20s + reset	Trained teleop	Yes
DexCap (dexterous)	~$500 (glove + cameras)	30 min	~20s	No	No
Kinesthetic teaching	$0 (use robot)	Minutes	~30s + reset	Robot operator	Yes
VR teleoperation	~$1K (Quest headset)	30 min	~25s + reset	Trained teleop	Yes

The cost difference is not incremental — it is structural. UMI removes the robot from the data collection loop entirely. A lab can hand 10 UMI grippers to 10 undergrads and collect 1,000 demonstrations in an afternoon. No scheduling the robot cell, no teleoperator training, no reset scripts. This is the reason UMI's impact exceeds its technical novelty.

The full data pipeline

The UMI data pipeline has five stages, each solving a specific problem that arises from collecting robot data without a robot:

Stage 1: Raw video capture. The GoPro records 4K video at 30fps with a fisheye lens. The wide field of view captures the gripper, the object, and the surrounding scene in every frame. The side mirrors extend the effective FOV to nearly 270° — the camera can "see" objects approaching from the sides that a standard lens would miss. Raw output: $\sim$13,500 frames per 15-second trial at 4K resolution.

Stage 2: SLAM-based pose estimation. ORB-SLAM3 processes the fisheye video to recover the camera's 6-DoF pose at each frame. Because the camera is rigidly mounted to the gripper, the camera pose is the gripper pose. The SLAM system uses visual features (ORB keypoints) to track the camera's motion through 3D space. ArUco fiducial markers on the gripper fingertips provide a secondary measurement of gripper width: the distance between the two fiducials in the camera image, scaled by the known fiducial size, gives the finger separation to $\pm$1mm.

Why SLAM and not AprilTags alone? A common alternative is to place AprilTag fiducials in the workspace and track the gripper relative to them. This requires instrumenting the environment (taping tags to the table, walls, etc.) and limits data collection to that specific workspace. SLAM is environment-agnostic — it builds its own map of the scene on the fly. The UMI gripper can be used in a kitchen, an office, or outdoors without any setup. This is the difference between "robotics-grade data collection" and "anyone can do it anywhere."

Stage 3: Temporal resampling. The raw trajectories are at 30fps (camera rate). The robot policy will run at 10Hz (the standard control frequency for manipulation). Linear interpolation for positions and SLERP (spherical linear interpolation) for rotations downsample each trajectory from 30fps to 10Hz while preserving smooth motion. Each 15-second trial becomes 150 timesteps.

Stage 4: Action space conversion. Absolute 6-DoF poses are converted to relative actions: $\Delta T_t = T_{t+1} \cdot T_t^{-1}$. The relative transform is decomposed into a position delta $(\Delta x, \Delta y, \Delta z)$ and a rotation delta (three Euler angles or a rotation vector). Combined with gripper width, the action vector is 7-dimensional: $a_t = [\Delta x, \Delta y, \Delta z, \Delta\text{roll}, \Delta\text{pitch}, \Delta\text{yaw}, w_{\text{gripper}}]$.

Stage 5: Latency compensation and normalization. Actions are shifted forward by $\lceil \delta / \Delta t \rceil$ timesteps to account for robot actuation delay $\delta$. All action dimensions are normalized to zero mean, unit variance based on the training set statistics. Images are resized to 224×224 and normalized to match the vision encoder's expected input distribution.

Generalizations of the UMI idea

The UMI principle — decouple data collection from the robot by matching the end-effector — has been extended to other embodiments, each with its own proxy device:

System	End-effector	Proxy device	Pose estimation	Key innovation
UMI	Parallel jaw gripper	Handheld gripper + GoPro	SLAM + ArUco	Original concept; $200 BOM
DexCap	Dexterous hand (16-DoF)	Glove with finger tracking	Multi-camera hand pose	Per-finger retargeting
HumanPlus	Humanoid whole-body	Human body + motion capture	RGB pose estimation	Full-body teleoperation-free demos
AnyTeleop	Various	VR headset + controllers	VR tracking	Universal retargeting across embodiments

The deeper lesson

For two decades the bottleneck on imitation learning was data — specifically, synchronized expert action data, which is expensive because it requires a robot. UMI shows that for parallel-jaw manipulation, much of that data can be collected without a robot at all, by humans acting through a handheld proxy. The implications cascade: cross-embodiment datasets, in-the-wild collection by non-experts, scaled-out pretraining corpora.

The same idea has been generalized: DexCap for dexterous hands, HumanPlus for whole-body humanoids, and a long tail of "make a thing a human can wear or hold to record actions" projects. The common thread is that the action space of the gripper or hand is shared between human and robot; everything else can vary.

Worked example: UMI data pipeline. A human picks up the UMI gripper and performs a "pick up cup and place it on the saucer" task 50 times. Each trial takes ~15 seconds at 30fps = 450 frames. The pipeline: 1. SLAM trajectory extraction: ORB-SLAM3 on the GoPro fisheye camera recovers the 6-DoF gripper pose at each frame. Typical precision: ±2mm position, ±1° rotation. ArUco fiducial on the gripper fingers tracks gripper width. 2. Resampling: Camera runs at 30fps; the robot policy runs at 10Hz. Resample trajectories to 10Hz using linear interpolation for positions and SLERP for rotations. Each 15s trial → 150 timesteps. 3. Relative action conversion: Convert absolute EE poses to relative: $\Delta T_t = T_{t+1} \cdot T_t^{-1}$. This is the action the robot will predict. 4. Latency compensation: The robot has ~200ms actuation latency. Shift the action sequence forward by $200\text{ms} / 100\text{ms/step} = 2$ timesteps. At time $t$, the action label is what the human did at $t + 2$. 5. Dataset: 50 episodes × 150 steps = 7,500 observation-action pairs. With 2-step observation history, 7,350 training samples. This is enough to train a Diffusion Policy to ~85% success on this specific task. Total time: 50 × 15s = 12.5 minutes of human demonstration time. No robot involved in data collection.

The data pipeline, formally

The full transformation from raw UMI video to training-ready dataset can be expressed as a sequence of six operators:

UMI data processing pipeline $$ \mathcal{D} = \text{Norm} \circ \text{Shift}_\delta \circ \text{Rel} \circ \text{Resamp}_{f} \circ \text{SLAM} \circ \text{Video}$$

$\text{Video}$ — raw GoPro video at 30fps, 4K resolution. Contains fisheye RGB frames with side-mirror views.
$\text{SLAM}$ — ORB-SLAM3 extracts 6-DoF camera pose $(R_t, p_t) \in SE(3)$ per frame + ArUco-based gripper width $w_t \in \mathbb{R}$.
$\text{Resamp}_f$ — temporal resampling from 30fps to control frequency $f$ (typically 10Hz). Uses linear interpolation for $p_t$ and SLERP for $R_t$.
$\text{Rel}$ — convert absolute poses to relative actions: $\Delta T_t = T_{t+1} \cdot T_t^{-1} \in SE(3)$, decomposed into $(\Delta x, \Delta y, \Delta z, \Delta \text{r}, \Delta \text{p}, \Delta \text{y})$.
$\text{Shift}_\delta$ — latency compensation. Shift action labels forward by $\lceil \delta / (1/f) \rceil$ timesteps.
$\text{Norm}$ — normalize action dimensions to zero mean, unit variance.

Each operator is deterministic and invertible (except quantization in normalization). The pipeline takes ~10 minutes per 50 episodes on a laptop CPU, dominated by SLAM processing.

Worked example: UMI deployment on a new robot. The lab has trained a Diffusion Policy on 50 UMI demonstrations of "pick up cup and place on saucer." They want to deploy it on a Franka Panda with a Robotiq 2F-85 gripper. Step 1: Hardware matching. The Robotiq 2F-85 is a parallel-jaw gripper with similar finger geometry to the UMI gripper. Mount a fisheye camera (Intel RealSense D435 with fisheye firmware) at the same relative position as the GoPro on the UMI gripper. The camera-to-gripper transform is measured once with a calibration pattern. Step 2: Action space mapping. The policy outputs relative EE poses $(\Delta x, \Delta y, \Delta z, \Delta\text{roll}, \Delta\text{pitch}, \Delta\text{yaw})$ and gripper width. The Franka's operational-space controller accepts Cartesian velocity commands and a gripper position target. The mapping is: $v_{\text{EE}} = \Delta T / \Delta t$ (relative transform divided by control period), $w_{\text{target}} = w_{\text{current}} + \Delta w$. The Franka's inverse kinematics solver converts EE velocity to joint velocities internally. Step 3: Latency recalibration. The Franka + Robotiq system has ~150ms actuation latency (vs 200ms assumed during UMI training). Two options: (a) retrain with the correct latency shift, which requires re-processing the data (10 minutes of compute); (b) add a 50ms software delay to match the training assumption. Option (b) is simpler and works in practice. Step 4: Deploy. Run the trained policy in a receding-horizon loop at 10Hz. The camera captures the scene, the policy predicts 16 relative-EE actions, the first 8 are sent to the robot. Success rate: ~80% (vs ~85% on the original UMI gripper). The 5-point gap comes from minor differences in finger geometry and gripper stiffness. Total deployment effort: ~2 hours (camera mounting, calibration, latency tuning). No retraining of the policy network.

The UMI ecosystem in 2026

The UMI design has spawned a family of handheld data-collection devices, each targeting a different end-effector morphology. The common principle — decouple data collection from the robot by building a human-holdable proxy of the end-effector — has been validated across grippers, dexterous hands, and even whole-body humanoid motion. The limitation remains end-effectors with high force requirements (industrial grippers, heavy-payload arms) where a human cannot replicate the necessary forces. For these, teleoperation remains necessary.

The UMI codebase is open-source (MIT license), with documented instructions for 3D-printing the gripper body, sourcing the GoPro and ArUco markers, and running the data pipeline. Multiple research groups have independently reproduced the setup and confirmed the reported results. This reproducibility is itself a contribution: UMI is not just a paper — it is a protocol that any lab can adopt in an afternoon.

Scaling the UMI approach

The natural question: if 50 demos cost 12.5 minutes of human time, how far can you scale? The answer is limited not by collection cost but by task diversity. Collecting 10,000 demos of the same pick-and-place task adds diminishing returns after ~500. The leverage comes from collecting across many tasks and environments: 50 demos each of 200 different tasks, collected by 20 different humans in 10 different kitchens. This is the UMI vision — not a data collection tool for one lab, but a data collection protocol for the entire field.

Several groups have begun organized UMI data collection campaigns: multiple research labs sharing a common task protocol, shipping UMI grippers to collaborators, and aggregating the resulting datasets into cross-institution training corpora. The target: 100K diverse demonstrations across 50+ tasks and 10+ environments, collected at a cost that would be impossible with robot teleoperation.

Worked example: UMI from zero to deployed policy in 6 hours. You want a robot that folds dish towels. You have never collected robot data before. Here is exactly what happens. Hour 0–0.5: Hardware setup. Unbox the UMI gripper kit ($200). Attach the GoPro Hero 10 to the fisheye mount. Stick the ArUco fiducials to the fingertips. Charge the GoPro battery. Download the UMI data-processing codebase. Total prep: 30 minutes, no robotics expertise required. Hour 0.5–2.5: Data collection. Collect 50 demonstrations of the towel-folding task. Each demo: pick up the UMI gripper, grasp one corner of the towel, fold it in half, release. Each trial takes ~30 seconds of actual manipulation plus ~10 seconds of repositioning the towel. The GoPro records continuously. You accumulate 50 × 30s = 25 minutes of task-relevant video at 30fps → 45,000 frames. A non-expert undergraduate can do this with 5 minutes of verbal instruction. Hour 2.5–3: Data processing. Run the UMI pipeline on a laptop: (1) Hand keypoint detection: MediaPipe processes each frame to detect the human's hand, confirming that the gripper is being held and providing coarse hand-pose priors. Runtime: ~5 minutes for 45K frames on a laptop GPU. (2) 6-DOF wrist pose estimation: ORB-SLAM3 processes the fisheye video to recover the camera (= gripper) trajectory. The ArUco markers on the fingertips give gripper width at each frame. Runtime: ~15 minutes, dominated by SLAM. (3) Relative EE action extraction: Convert absolute SE(3) poses to relative: $\Delta T_t = T_{t+1} \cdot T_t^{-1}$. Resample from 30fps to 10Hz. Shift forward by 2 timesteps for latency compensation (assuming 200ms robot delay). Runtime: seconds. (4) Normalization: Compute per-dimension mean and standard deviation across all 50 episodes. Normalize actions to zero mean, unit variance. Resize images to 224×224. Hour 3–3.5: Dataset verification. Replay the extracted trajectories in a visualizer. Check that the gripper poses track the actual motion. Check that gripper-width labels match the actual open/close events. Discard any demos where SLAM lost tracking (typically 2–5 out of 50). Final dataset: ~45 good episodes × ~300 timesteps each (towel folding is longer than a quick pick) = ~13,500 training samples. Hour 3.5–6: Training. Train a Diffusion Policy on 1 GPU (RTX 4090). Architecture: CLIP-pretrained ViT encoder (frozen) for images, MLP encoder for EE-pose history, Transformer denoiser predicting 16 future action steps. Batch size 256, 200K gradient steps, ~2.5 hours. The loss curve should plateau by 150K steps. Deployment: Mount a fisheye camera on the robot's Robotiq gripper at the same relative position as the GoPro on the UMI gripper. Run the trained policy at 10Hz in receding-horizon mode (predict 16, execute 8). Expected first-attempt success rate: 60–75% (towel folding is deformable manipulation — harder than rigid pick-and-place). With 30 more targeted demos on failure cases and a 1-hour retrain: 80–85%. Total wall-clock: 6 hours from unboxing to a deployed towel-folding policy. Total compute cost: ~$2 of GPU time. Total human labor: ~3 hours of active work.

Scaling: how many demos for how hard a task?

The number of demonstrations required depends on the task's complexity along three axes: precision of the required motion, variability of the objects and scene, and number of sequential stages. The following table summarizes empirical findings across multiple UMI deployment campaigns:

Task category	Example	Demos needed	Expected success	Bottleneck
Simple pick-and-place	Pick up a mug, place on coaster	10–20	85–95%	Gripper alignment with handle
Moderate pick-and-place	Stack 3 blocks in a specific order	30–50	75–85%	Sequencing and block-pose variation
Precise insertion	Insert USB connector into port	80–120	70–80%	Sub-millimeter alignment, contact forces
Articulated manipulation	Open a drawer, place object inside	50–80	75–85%	Handle grasp + pull trajectory
Deformable manipulation	Fold a towel, tie a knot	200–500+	60–80%	Fabric state is high-dimensional and stochastic
Tool use	Use a spatula to flip a pancake	100–200	50–70%	Tool-object interaction dynamics
Multi-step kitchen	Pour, stir, plate (5+ stages)	300–500	40–60%	Error compounds across stages

The scaling is sub-linear in task complexity but super-linear in precision requirements. Doubling the number of demos typically adds 5–10 percentage points of success rate until saturation. The saturation point — where more demos stop helping — is determined by the policy architecture's capacity and the irreducible stochasticity of the task (a towel falls differently each time, and no amount of data can eliminate that variance). Beyond saturation, the next lever is environment diversity: collecting the same task in 10 different kitchens with 10 different towels beats collecting 10x more demos in one kitchen with one towel.

Data quality signals: how to spot a bad UMI demo

Not all UMI demos are usable. The SLAM trajectory can fail silently, producing a dataset that looks complete but contains garbage actions. Three quality signals to check before training:

SLAM tracking loss. ORB-SLAM3 reports a "tracking state" per frame. If the tracker enters "lost" state for more than 5 consecutive frames, the recovered pose after re-localization will have a discontinuous jump. Discard any episode with a position jump > 5cm between consecutive timesteps after resampling.
Gripper width consistency. The ArUco-based gripper width should change smoothly (< 2mm/timestep at 10Hz). A spike > 5mm in one timestep means the fiducial detection failed (occlusion, motion blur). Interpolate over short gaps (< 3 frames); discard episodes with long gaps.
Action magnitude distribution. Compute the L2 norm of the relative action at each timestep across all episodes. The distribution should be unimodal with a thin tail. Episodes whose mean action magnitude is > 3 standard deviations from the population mean are likely corrupted (SLAM drift, human fumble, accidental recording). These should be reviewed manually before inclusion.

In a typical UMI data collection campaign, 5–10% of episodes fail quality checks. This is a tolerable loss rate — it takes 15 seconds to collect another demo. The alternative (no quality filtering) produces a dataset where 5% of the training samples have corrupted action labels, which can reduce final policy success rate by 10–15 points.

A simple automated pipeline: after running the SLAM + resampling pipeline on all episodes, compute the per-episode statistics (mean action norm, max gripper-width delta, SLAM tracking loss count). Flag episodes that exceed 2.5 standard deviations from the population mean on any metric. Manually review only the flagged episodes (typically 10–15% of the total) and discard the truly corrupted ones. This takes 5–10 minutes for a 50-episode dataset and catches the most damaging outliers without requiring frame-by-frame inspection of every demo.

Architectures saturate. Data does not. The single highest-leverage move in modern robot learning is finding a way to collect more demonstrations faster, more cheaply, and from less specialized labor.

12Vision–Language–Action models

When the policy is a frozen LLM with a different output head — and increasingly, with two heads running at different speeds.

A VLA is a single network that ingests images and natural-language instructions and emits robot actions. The bet behind every VLA is that the abstractions a model learns from internet-scale vision-language data — objects, affordances, spatial relations, intent — transfer to robotics, and that they transfer better than anything you could pretrain on robot data alone. By 2026 the bet has paid off, the architectures have converged, and the open question is no longer "do VLAs work" but "what fraction of the stack should be the VLM versus the action expert, and at what frequencies."

The lineage, in one table

Model	Year	Backbone	Action head	Notable
RT-1	2022	EfficientNet + USE + FiLM	Discrete tokens (256 bins)	First scaled VLA recipe; 35M params; 130k demos.
RT-2	2023	PaLI-X / PaLM-E (12B–55B)	Tokens overloaded into LLM vocab	First true VLA; web + robot co-finetuning.
Octo	2024	Custom transformer (27M / 93M)	Diffusion (continuous)	Open. Goal-image or language; 800k demos.
OpenVLA	2024	Llama-2 7B + DINOv2 + SigLIP	Discrete tokens	Open RT-2 recipe; 970k demos.
RDT-1B	2024	DiT (1B)	Diffusion	Bimanual specialist; 1M+ episodes.
π₀	2024	PaliGemma 3B + 300M expert	Flow matching	50Hz bimanual; cross-embodiment training.
π₀-FAST	2025	Same backbone	Autoregressive on FAST tokens	5× faster training; matches diffusion quality.
π₀.₅	2025	PaliGemma + action expert	Flow matching	Open-world generalization; new kitchens/bedrooms.
π₀.₇	2026	+ MEM, RL Token	Flow + RL fine-tuning	Steerable; multi-scale memory; >10-min tasks.
GR00T N1	2025	Eagle-2 VLM (1.34B) + DiT	Diffusion / flow matching	Humanoid; 2.2B; 63.9ms / 16-action chunk.
Helix	2025	7B VLM at 7–9Hz	Visuomotor at 200Hz	35-DOF upper body; runs on Jetson Orin; <100ms.
SmolVLA	2025	SmolVLM (450M)	Flow matching expert	Compact; matches 10× larger models on benchmarks.

The two-system split, explicitly

The convergent architecture of 2026 has two unequal halves. A large vision-language model — the slow brain — observes the scene at 5–10Hz and emits either a latent plan, a chain-of-thought string, or a sequence of FAST tokens. A small action expert — the fast brain — runs at 50–200Hz, reads the latest observation plus the slow brain's output, and produces continuous joint or end-effector commands. The split is what makes language-conditioned humanoid control viable: a 7B forward pass per control tick is not feasible; a 7B forward pass per plan with a 100M expert per tick is.

System 2 · slow System 1 · fast bridge signal

The five families of bridges

Different VLAs disagree about what the slow brain sends to the fast brain:

Hidden states. π₀ and GR00T pass the VLM's last-layer hidden states through cross-attention into the action expert. Highest bandwidth; tightest coupling; requires joint training.
Discrete tokens. RT-2 / OpenVLA / π₀-FAST emit action tokens from the LLM's own vocabulary, decoded back into actions. Lowest latency for the VLM; throws away cross-dimension structure unless paired with FAST.
Latent plan vectors. Helix-style designs emit a small "plan vector" updated at System-2 frequency that conditions System 1. Loose coupling; allows the two halves to be trained separately.
Natural-language reasoning. Gemini Robotics 1.5 interleaves language reasoning steps with action chunks — "first I'll pick up the cup, then place it in the sink" — making behavior interpretable.
Tool calls. Gemini Robotics-ER 1.5 acts as an orchestrator, calling a separate VLA as a tool. The reasoning model never sees the actuators directly.

Motion Transfer and embodiment soup

A VLA trained on Open X-Embodiment sees seven different arms doing similar tasks with different action spaces. Motion Transfer (Gemini Robotics 1.5) and π₀'s zero-padding to the largest action vector are two answers to the same question: how do you make a single policy reuse motor knowledge across robots? The recipe that works is a shared semantic representation in the VLM, plus an action expert whose output is masked to the active embodiment's true degrees of freedom.

Embodied thinking

Gemini Robotics 1.5 added an explicit reasoning trace before action emission — the model writes natural language describing what it is about to do, then emits the action tokens. The trace is conditioned on by the action head, so the reasoning is causally upstream of motion. The cost is latency. The benefit is that "pour the milk before the cereal" requires reasoning the model could not previously do at all.

Inside the VLA: tokenization walkthrough

A VLA ingests three modalities and must convert all of them into a common token format that the transformer backbone can process. Here is the data flow, layer by layer:

Image tokenization. Each camera image (typically 224×224 or 336×336) passes through a ViT encoder (DINOv2, SigLIP, or CLIP). The ViT divides the image into non-overlapping patches (14×14 or 16×16 pixels each), projects each patch to an embedding, and outputs $N$ spatial tokens — typically 256 for a 224/14 grid. For multi-camera setups, each camera produces its own $N$ tokens, which are concatenated into the sequence. A two-camera robot thus starts with $2 \times 256 = 512$ image tokens.

Text tokenization. The language instruction ("pick up the red cup") is tokenized by the VLM's text tokenizer (SentencePiece for Gemma-family, BPE for Llama-family). A typical instruction becomes 5–20 text tokens. These are prepended or interleaved with the image tokens.

Action prediction. The transformer processes the combined token sequence and must produce actions. Two families:

Discrete action tokens (RT-2, OpenVLA). Each action dimension is quantized into 256 bins. A 7-DoF action becomes 7 tokens, predicted autoregressively. The loss is cross-entropy per token.
Continuous action heads (Diffusion Policy, flow matching expert). The transformer's last hidden states are fed to a separate action expert network that generates continuous actions. The loss is the diffusion or flow matching objective, conditioned on the transformer's hidden representations.

The autoregressive action loss, derived. For discretized actions (RT-2 family), each action dimension $d$ is binned into $B = 256$ buckets. The bin index for dimension $d$ at timestep $t$ is $b_{t,d} = \lfloor (a_{t,d} - a_{\min,d}) / (a_{\max,d} - a_{\min,d}) \times (B-1) \rceil$. The training loss is cross-entropy over bins: $$\mathcal{L}_{\text{action}} = -\sum_{t=1}^{H} \sum_{d=1}^{D} \log p_\theta(b_{t,d} \mid b_{<(t,d)}, o)$$ where $b_{<(t,d)}$ is all previously predicted bin indices (autoregressive ordering). At inference, the model samples (or argmaxes) one bin per dimension, then converts back: $\hat{a}_{t,d} = a_{\min,d} + b_{t,d} \cdot (a_{\max,d} - a_{\min,d}) / (B-1)$. The resolution bottleneck: with 256 bins over a 0.4m workspace, each bin spans $0.4/256 \approx 1.6$mm. This is adequate for pick-and-place but coarse for insertion tasks requiring sub-millimeter precision. Diffusion and flow matching heads avoid this quantization ceiling entirely.

The two-system split: a concrete example

Worked example: two-system execution of "pick up the red cup." The robot is a bimanual humanoid with 35 DOF upper body, a scene camera, and wrist cameras. System 2 (slow brain, 7B VLM, 7Hz). At $t = 0$: the VLM receives the scene image + "pick up the red cup." It outputs a latent plan vector $z_{\text{plan}} \in \mathbb{R}^{512}$ encoding "reach toward red cup with right hand, grasp." This forward pass takes ~140ms. System 1 (fast brain, 200M action expert, 200Hz). Between $t = 0$ and $t = 143$ms (the next System 2 tick), System 1 runs ~28 control steps. Each step: read the latest joint positions $q_t$, wrist camera image token, and the cached $z_{\text{plan}}$ from System 2. Output: 35-dimensional joint velocity target. Each forward pass: ~4ms. At $t = 143$ms: System 2 re-observes the scene. The hand is now closer to the cup. It updates $z_{\text{plan}}$ to encode "close fingers around cup." System 1 seamlessly transitions to grasping motions using the updated plan. At $t = 286$ms: System 2 sees the cup is grasped. Updates plan: "lift cup." System 1 executes the lift. Total task time: ~2 seconds. System 2 ran ~14 times. System 1 ran ~400 times. The VLM provided semantic understanding; the action expert provided fast motor control. Neither alone could have done the task — the VLM is too slow for 200Hz control, the action expert has no concept of "red cup."

The data scaling equation

Training a VLA requires data from three very different sources, at very different scales:

Data source	Scale	What it teaches	Example
Internet text	Trillions of tokens	Language understanding, common-sense reasoning, world knowledge	"Cups go on saucers, not in sinks"
Internet images/video	Billions of images	Visual recognition, object categories, spatial relations, physics intuition	"This is a red cup; it's on a table"
Robot demonstrations	Millions of trajectories	Motor skills, contact physics, action-observation mapping	"To grasp this cup, close fingers at this pose"

The ratio matters. RT-2 used ~4:1 web:robot data; $\pi_0$ used comparable ratios. Too much robot data and the model overfits to the robot distribution, losing web knowledge. Too little robot data and the model knows what a cup is but cannot grasp one. The emerging consensus: pretrain on web data (text + images), then fine-tune on robot data with a low learning rate and frozen early layers. This is why Knowledge Insulating ($\pi_0$, 2025) — freezing most VLM weights during robot fine-tuning — works: it preserves the internet knowledge structurally while adapting only the action-relevant layers.

Co-training on web data

RT-2 introduced and every successor confirmed: continue training on web vision-language data while fine-tuning on robot data. Otherwise the model loses its world knowledge — it can grab the green block, but ask it to "grab the dinosaur" and it doesn't know what a dinosaur looks like anymore. Mix ratios run 1:1 to 4:1 web:robot. π₀ + Knowledge Insulating (2025) takes this further: freeze most VLM weights through fine-tuning so internet knowledge is preserved structurally, not just statistically.

The VLA architecture in detail

A modern VLA has four distinct components, each with different computational profiles:

1. Vision encoder ($\sim$300M params, frozen). Typically a ViT-L/14 (DINOv2 or SigLIP). Processes each camera image into spatial tokens. For a 224×224 image with 14×14 patches: 256 tokens per image, each $\in \mathbb{R}^{1024}$. Forward pass: ~8ms on an A100. This is the most expensive per-image operation, but it only runs once per observation (not per denoising step).

2. Language encoder ($\sim$0 additional params, shared). In most VLAs, the language encoder is the same transformer backbone that processes the combined sequence. The text tokens are embedded by the VLM's standard tokenizer and processed alongside the image tokens. In $\pi_0$, the language is processed by PaliGemma's text encoder, which shares parameters with the vision pathway.

3. Transformer backbone ($\sim$1B–7B params, frozen or LoRA). The core of the VLA. Processes the concatenated sequence of [text tokens, image tokens, proprioception tokens, (optional) action tokens]. Self-attention allows every token to attend to every other token, enabling cross-modal reasoning: the model can correlate the word "red" with the red-colored image patches and the proprioceptive state indicating the arm is near a red object.

4. Action head ($\sim$100M–300M params, trained). Converts the backbone's output into robot actions. This is where the architectural diversity lives. The action head might be a diffusion denoiser (conditioned on backbone hidden states), a flow matching expert, an autoregressive token predictor, or a simple MLP. The head is almost always trained from scratch — unlike the backbone, it has no useful pretrained initialization for robot actions.

Worked example: token sequence for one VLA forward pass. An OpenVLA-7B processing a language-conditioned pick task: Input sequence construction: 1. Language: "pick up the red block" → SentencePiece → 7 text tokens, each $\in \mathbb{R}^{4096}$ 2. Image: 224×224 RGB from scene camera → DINOv2 ViT-L/14 → 256 spatial tokens, projected to $\mathbb{R}^{4096}$ 3. Image: 224×224 RGB from wrist camera → same encoder → 256 spatial tokens 4. Proprioception: [joint angles (7); gripper width (1); EE pose (6)] = 14D → MLP → 1 token $\in \mathbb{R}^{4096}$ Total input: 7 + 256 + 256 + 1 = 520 tokens. Backbone forward pass: 520 tokens through 32 transformer layers × 4096 dim = ~120ms on A100. Action prediction: The backbone's last hidden state at the proprioception token position is passed to the action head. For discrete tokens: predict 7 bin indices autoregressively (7 × ~3ms = 21ms). For flow matching: 10 Euler steps × ~5ms = 50ms. Total latency: vision encoding (16ms) + backbone (120ms) + action head (21–50ms) = ~160–190ms. At 5Hz control, this fits comfortably. At 10Hz (100ms budget), only the action head can run repeatedly while the backbone is amortized across multiple control steps — which is exactly the two-system split.

Synthetic data is the new Open X-Embodiment

GR00T N1's training mix is real-robot trajectories, human videos, and entire neural-generated trajectories from video diffusion models. The shift is significant: when image and video generation are themselves at foundation-model scale, the cheapest source of robot training data may be a generative model rather than a teleoperator.

The VLA training pipeline

Training a VLA from a pretrained VLM checkpoint follows a consistent recipe across most published models:

Stage 1: Pretrain on web data (already done). The VLM backbone arrives pretrained on internet text and images. This is the most expensive stage (thousands of GPU-hours) and is done by the model provider, not the robotics lab.

Stage 2: Co-finetune on web + robot data. Interleave batches of web VQA data with robot demonstration data. The web data preserves the VLM's general knowledge; the robot data teaches action prediction. Typical mix ratio: 1:1 to 4:1 web:robot. Learning rate: 1e-5 to 5e-5. Duration: 50K–200K gradient steps on 8–64 GPUs.

Stage 3: Task-specific fine-tuning (LoRA). Freeze the backbone, attach LoRA adapters, and fine-tune on the target robot's demonstrations. This is the step most practitioners perform. Learning rate: 2e-5. Duration: 10–30 epochs on a single GPU. Trainable parameters: 0.2–2% of total.

The key engineering decision in Stage 2 is what to supervise. For discrete action tokens (RT-2 family), the loss is next-token cross-entropy on both web tokens and action tokens — the same loss function for both modalities. For continuous action heads ($\pi_0$ family), the web data is supervised with the VLM's original loss (captioning, VQA) while the robot data is supervised with the action head's loss (flow matching, diffusion). The two loss terms are weighted and summed.

The knowledge insulation problem. When you fine-tune a VLM on robot data, the model's internet knowledge degrades. This is called catastrophic forgetting. The model learns to predict actions but forgets what a dinosaur looks like. Two solutions: 1. Co-training (RT-2 approach): keep web data in the training loop. The model sees both web and robot data at every step. This works but requires maintaining a large web dataset during robot training. 2. Weight freezing ($\pi_0$ approach): freeze most VLM weights. Only the action expert and a few adapter layers are trainable. The internet knowledge is preserved by construction, because the weights that encode it cannot change. This is simpler and increasingly preferred. The emerging best practice: freeze the VLM backbone entirely, use LoRA adapters for the minimal adaptation needed, and train the action head from scratch. This gives the best of both worlds: internet knowledge preservation + task-specific motor skill learning.

The generalist vs specialist tradeoff, quantified

The empirical data on when generalist VLAs beat specialist policies is now clear enough to state as a rough rule:

The crossover analysis. Consider a deployment with $N$ distinct tasks, each with $D$ demonstrations. Specialist approach: train $N$ separate Diffusion Policies. Each achieves $\sim$90% success with 50+ demos. Total training: $N$ models × 2 hours = $2N$ GPU-hours. Deployment: one model per task, hot-swap at task boundaries. Generalist VLA approach: fine-tune one OpenVLA with LoRA on all $N \times D$ demonstrations. Success rate: $\sim$80% on average (worse than specialist on any single task, but covers all tasks with one model). Total training: 1 model × 8 hours = 8 GPU-hours. Deployment: one model, task selected by language instruction. Crossover at $N \approx 10$–$15$. Below 10 tasks, the specialist is both better (90% vs 80%) and cheaper ($2N < 8$ when $N < 4$). Above 15 tasks, the VLA wins on engineering cost: one model to maintain, one inference pipeline, no task-switching logic. The success rate gap narrows as the VLA sees more diverse data. At $N = 50$ tasks, the VLA often matches or beats the specialist because cross-task transfer improves the shared representations.

Where VLAs are weak

Raw latency. A 7B forward pass dominates the control budget. Two-system splits, FAST tokenization, INT4 quantization, and speculative decoding are the four levers.
Fine motor control. A generalist policy still underperforms a specialist on its specialty by 5–15 points. RL fine-tuning closes most of the gap.
Out-of-distribution physics. A VLA that never saw deformable cloth does not learn cloth physics from a few demos.

A VLA is not a robot policy that happened to use a language model. It is a language model that happens to have a robot as an output device. The implications of that framing — for data, architecture, evaluation, and team structure — are still being worked out.

12·53D representations and equivariance

When the input is a point cloud, the symmetries of physics start paying for themselves.

2D image policies are the dominant paradigm for one reason: 2D images are easy to collect, easy to encode, and have ImageNet-scale priors available. They are also geometrically lossy. A policy trained on RGB images alone has no built-in notion of where things are in 3D space; it has to learn that from data, every time. A small but rapidly growing corner of the field argues that the right move is to give the policy 3D structure directly — and, while you're there, to bake the physical symmetries of 3D space into the architecture.

Why 3D helps

Spatial generalization for free. A policy that sees raw RGB has to learn that an object 30cm to the left looks similar to one straight ahead. A policy that operates on 3D points has the translation built into the input geometry.
Camera invariance. 3D point clouds aggregated from RGB-D or stereo cameras are indifferent to camera placement.
Sample efficiency. 3D Diffusion Policy needs ~10× fewer demos than 2D Diffusion Policy on contact-rich tasks.

The 3D policy zoo

Model	Input	Architecture	Notable
3D Diffusion Policy	Sparse point cloud (~512 pts)	1D embedding + diffusion	Cheap; strong on data-scarce tasks.
3D Diffuser Actor	Multi-view RGB-D → 3D scene tokens	Relative-position 3D attention	Translation equivariant; SOTA on RLBench.
EquiBot	Point cloud	Sim(3)-equivariant network	Scale-equivariant; data efficient.
Spherical Diffusion Policy	Point cloud	SE(3)-equivariant in spherical Fourier space	Full 3D rotational equivariance.

The symmetry argument

If you rotate the entire scene by some $R \in SO(3)$, the correct robot action rotates by the same $R$. A policy that doesn't know this has to learn it from data — separately for every angle. A policy that has it baked in is, by construction, correct for every angle the moment it works for one. This is the same argument that made convolutional networks beat MLPs on images: a network that respects translation symmetry sees the same image once, regardless of where the object is. 3D policies extend the argument from $\mathbb{R}^2$ translations to $SE(3)$ rigid motions.

SE(3) equivariance, formally

In plain English: if you rotate and shift the entire scene — the table, the cup, the robot's coordinate frame — the robot's planned motion rotates and shifts by the exact same amount. The policy "understands" 3D geometry well enough that its answer transforms correctly with the world, rather than memorizing specific positions.

A policy $\pi$ is SE(3)-equivariant if for any rigid transform $g = (R, t) \in SE(3)$ (a rotation $R \in SO(3)$ and translation $t \in \mathbb{R}^3$):

SE(3) equivariance constraint $$ \pi(g \cdot o) = g \cdot \pi(o) $$

$g \cdot o$ — the transformed observation. If $o$ is a point cloud, $g \cdot o$ rotates and translates every point. If $o$ includes images, $g$ transforms the 3D scene that the images depict.
$\pi(o)$ — the policy output (an action, typically in SE(3) end-effector space). Given the original scene, the policy predicts this action.
$g \cdot \pi(o)$ — the transformed action. If the scene rotates by $R$ and translates by $t$, the correct action rotates and translates by the same $R$ and $t$.

What this means concretely: if you rotate the entire scene 90° clockwise (the table, the cup, the robot's coordinate frame), the predicted end-effector target rotates 90° clockwise too. The policy does not need to learn rotational invariance from data — it is built into the architecture via equivariant layers that preserve the group structure through every computation.

What this means for your system: building equivariant layers requires libraries like e3nn or escnn. The upside is 6–10× data efficiency on orientation-diverse tasks. The downside is 2–5× slower inference (tensor products in spherical harmonic space) and 3–4 weeks of implementation time versus 1 week for a standard policy. If you have >100 demos or your objects appear in fixed orientations, skip equivariance and spend the engineering time collecting more data.

Why equivariance gives sample efficiency. An SE(3)-equivariant policy with $N$ demonstrations effectively has $N \times |SE(3)|$ demonstrations — which is infinite, because $SE(3)$ is a continuous group. For any training scene, the equivariance constraint guarantees correct behavior for every possible rigid transform of that scene, without ever seeing those transforms in the data. This is not data augmentation (which is approximate and finite); it is an exact architectural constraint that provides infinite augmentation for free. The practical consequence: 10 demos with equivariance match or beat 100+ demos without it on contact-rich manipulation tasks.

The 3D policy zoo, expanded

3D Diffusion Policy (Ze et al., 2024). Takes a sparse point cloud (~512 points from depth cameras), encodes it with PointNet++, and uses the resulting feature vector to condition a standard diffusion action head. The 3D structure is in the input representation, not the network architecture — the diffusion process itself operates in flat action space. Cheap to implement; gives the sample efficiency of 3D input without requiring equivariant network layers.

EquiBot (Yang et al., 2024). A Sim(3)-equivariant network that handles not just rotations and translations but also scaling. The architecture uses steerable features based on spherical harmonics, ensuring that the output transforms correctly under any similarity transform of the input. This means the same policy that picks up a small cup can pick up a large bowl without retraining — the scale equivariance handles the geometric adaptation.

Equivariant Diffusion Policy (Wang et al., 2024). Combines SE(3)-equivariant networks with diffusion action generation. The key architectural choice: the denoiser operates in the group's irreducible representations (irreps), so the denoising process itself respects the symmetry. This is harder to implement than 3D Diffusion Policy (which only uses 3D input, not 3D-equivariant layers) but gives stronger generalization guarantees.

The engineering cost: when is equivariance worth it?

Equivariant architectures carry real costs that must be weighed against their sample efficiency benefits:

Implementation complexity. Libraries like e3nn and escnn provide equivariant layers, but they are far less mature than standard PyTorch modules. Debugging requires understanding representation theory (irreps, Wigner-D matrices, Clebsch-Gordan decomposition). A team that takes one week to implement a standard diffusion policy will take 3–4 weeks for an equivariant one.
Inference speed. Equivariant layers involve tensor products in spherical harmonic space, which are 2–5× slower than standard linear layers at comparable parameter counts. For a 200Hz control loop, this overhead can push inference time over budget.
Incompatibility with 2D priors. Foundation model vision encoders (CLIP, DINOv2) produce 2D features. Feeding these into an equivariant 3D network requires lifting — projecting 2D features into 3D space — which loses some of the pretraining benefit.

The decision heuristic: use equivariant architectures when (a) you have fewer than 100 demonstrations, (b) the task involves diverse orientations (e.g., grasping objects in arbitrary poses), and (c) you do not need to leverage 2D foundation model features. If any of these conditions is false, the standard 2D pipeline is likely the better bet.

The downside is engineering. Equivariant networks are harder to write, harder to debug, and harder to compose with foundation-model priors. Spherical-Fourier and steerable-CNN libraries exist but are far less mature than PyTorch's standard transformer. Most of the field is still betting on data + 2D + flexible architectures over symmetry-baked 3D — but the 3D camp's sample efficiency numbers keep getting harder to ignore.

The 3D input representations

Before any equivariance can be applied, the raw sensor data must be converted to a 3D representation. Three options are in common use:

Point clouds. The simplest representation. An RGB-D camera produces a depth image; back-projecting each pixel using the camera intrinsics gives a 3D point $(x, y, z)$ with an associated color $(r, g, b)$. Multiple cameras are merged by transforming each point cloud into a common world frame. The result: $N$ points $\in \mathbb{R}^{N \times 6}$ (XYZ + RGB). Typical $N = 512$–$4096$ after downsampling. Encoded by PointNet++ or DGCNN.

Voxel grids. Discretize the workspace into a 3D grid of voxels, each containing occupancy + features. Resolution is the bottleneck: a 1cm grid over a 0.5m³ workspace requires $50^3 = 125,000$ voxels. Sparse voxel representations (MinkowskiEngine) make this tractable. The advantage: 3D convolutions are well-understood and fast on GPUs. The disadvantage: fixed resolution trades off detail vs memory.

Lifted 2D features. Extract per-pixel features from a ViT (DINOv2 or CLIP), then lift each feature to 3D using the depth map. The result: a 3D feature field where each point has a high-dimensional feature vector instead of just RGB. This is the best of both worlds: 2D pretraining priors + 3D spatial structure. Used by 3D Diffuser Actor and Polarnet.

Worked example: SE(3) equivariance in action. A policy must pick a mug from a table. The mug can appear in any orientation. Without equivariance (2D policy, 50 demos): The policy sees the mug handle pointing right in 40/50 demos and pointing left in 10/50. At test time, the mug handle points toward the camera (never seen). The policy hesitates, predicts an averaged grasp between the "right-handle" and "left-handle" strategies, and misses the handle entirely. Success: 35%. With SE(3) equivariance (3D policy, 50 demos): The equivariant architecture guarantees that if the policy can grasp a mug with the handle at 0°, it can grasp it at any angle. The 50 demos teach the concept of handle grasping; the equivariance constraint generalizes it to all orientations. At test time, the mug at a novel orientation is transformed to the canonical orientation internally, the policy predicts the canonical grasp, and the output is transformed back. Success: 88%. The sample efficiency ratio: the 2D policy would need ~300 demos (covering diverse orientations) to match the equivariant policy's 50-demo performance. The equivariance provides a 6× data efficiency multiplier for this task.

Point cloud encoding architectures

The choice of point cloud encoder determines both the quality of 3D features and whether equivariance is possible:

PointNet++ (Qi et al., 2017). The workhorse. Processes raw $(x, y, z, r, g, b)$ points through set abstraction layers: each layer samples a subset of points, groups nearby points, and applies a shared MLP to produce per-group features. Not equivariant — the MLP operates on raw coordinates, so the features change under rotation. But it is fast (~3ms for 512 points on GPU) and well-understood.

DGCNN (Wang et al., 2019). Constructs a $k$-nearest-neighbor graph in feature space at each layer and applies edge convolutions. Slightly better than PointNet++ for shape classification, comparable for manipulation. Also not equivariant.

Vector Neurons (Deng et al., 2021). Replaces scalar features with 3D vector features that rotate with the input. Each "neuron" outputs a vector in $\mathbb{R}^3$ instead of a scalar, and the network operations (linear layers, nonlinearities) are designed to be SO(3)-equivariant. Used in EquiBot.

Spherical CNNs / e3nn (Geiger et al., 2022). Operate in the basis of spherical harmonics, using tensor products of irreducible representations to ensure exact SE(3) equivariance. The most principled approach but also the slowest: tensor products are computationally expensive, and the library ecosystem is immature compared to standard PyTorch.

Encoder	Equivariant?	Speed (512 pts)	Implementation difficulty	Best for
PointNet++	No	~3ms	Easy (PyTorch Geometric)	General-purpose 3D policies
DGCNN	No	~4ms	Easy	Shape-sensitive tasks
Vector Neurons	SO(3)	~8ms	Moderate	Rotation-diverse tasks
e3nn / Spherical CNN	SE(3)	~15ms	Hard	Maximum sample efficiency

Hybrid strategies

Lift, don't replace. Keep the ViT backbone. Use it to extract per-pixel features, then lift those features into 3D via the camera intrinsics + depth. The downstream policy operates on 3D feature points. You get 3D structure without losing the 2D pretraining.
Canonicalize the input. Before feeding a point cloud into a policy, rotate it to a canonical orientation. The policy itself is not equivariant; the preprocessor handles symmetry.
Use 3D only at the contact phase. Run a 2D VLA for high-level reasoning and reaching, switch to a 3D contact-aware policy for the final approach. The slow-fast split, in 3D form.

Worked example: sample efficiency gain from equivariance. Consider a pick task where the object can appear at any of 12 orientations on a table. A non-equivariant 2D policy needs demonstrations at each orientation — 12×50 = 600 demos minimum. An SE(3)-equivariant 3D policy needs demos at one orientation — 50 demos total — because the equivariance constraint guarantees generalization to all orientations. At test time, the object appears at a 13th unseen orientation. The 2D policy has never seen it and must interpolate. The 3D equivariant policy handles it by construction, because the mapping $f(R \cdot \text{input}) = R \cdot f(\text{input})$ is baked into the architecture.

Equivariance vs invariance, precisely. These two properties are often confused. An invariant function satisfies $f(g \cdot x) = f(x)$ — the output does not change when the input is transformed. An equivariant function satisfies $f(g \cdot x) = g \cdot f(x)$ — the output transforms in the same way as the input. For robot policies, equivariance is the correct constraint, not invariance. If you rotate the scene 90°, the correct action rotates 90° too (equivariance). An invariant policy would predict the same action regardless of rotation — which is wrong. The distinction matters architecturally: equivariant layers (e3nn, escnn) propagate group actions through the network. Invariant layers (max-pooling over orientations, rotation-invariant features) discard group information. Using invariant features and then trying to predict orientation-dependent actions is fundamentally ill-posed.

The canonicalization alternative

If equivariant architectures are too expensive to implement, there is a simpler alternative: canonicalize the input. Before feeding a point cloud to a standard (non-equivariant) policy, rotate it to a canonical frame. The policy only ever sees point clouds in the canonical orientation, so it only needs to learn one orientation.

Two canonicalization strategies:

PCA-based. Compute the principal axes of the point cloud and rotate so the first principal axis aligns with the $x$-axis. This is fast (~1ms) and deterministic, but fails for symmetric objects (a sphere has no principal axis).
Learned canonicalization. Train a small network to predict the canonical rotation from the point cloud. This handles arbitrary objects but requires training data with canonical annotations. EquiBot and some 3D Diffusion Policy variants use this approach.

The tradeoff: canonicalization is a preprocessing step that provides approximate equivariance without requiring equivariant network layers. It is much easier to implement than true equivariance, but it introduces errors when the canonicalization is imperfect (which it always is for novel objects). True equivariance is exact by construction but costs 2–5× in implementation effort and inference time.

Depth sensor considerations

Every 3D policy depends on a depth sensor to produce point clouds. The choice of sensor has a direct impact on policy performance:

Structured light (Intel RealSense D435). Projects an IR pattern and triangulates depth. Depth noise: $\sim$1% of range (5mm at 50cm). Fails on shiny, transparent, and black surfaces (IR is absorbed or reflected specularly). The most common sensor in manipulation research. ~$300.

Time-of-flight (Azure Kinect, L515). Measures photon travel time. More robust to surface material than structured light, but higher noise at close range (10mm at 50cm). Better for scenes with mixed materials. ~$400.

Stereo (ZED 2). Triangulates from two RGB cameras. No active illumination, so it works outdoors and in bright light. But depth accuracy depends on texture — featureless surfaces (white walls, smooth plastic) produce noisy depth. ~$450.

The practical advice: use Intel RealSense D435 for tabletop manipulation (cheap, good enough), Azure Kinect for scenes with transparent objects (the ToF sensor handles glass), and ZED for outdoor or mobile applications. Always filter point clouds with statistical outlier removal before feeding them to the policy — depth sensors produce spurious points that can corrupt the 3D features.

When 3D is not worth the engineering cost

Despite the sample efficiency arguments, most deployed robot policies in 2026 still use 2D images. Three reasons:

Foundation model priors are 2D. CLIP, DINOv2, SigLIP, and every large vision encoder are trained on 2D images. No 3D foundation model exists at comparable scale. Using 3D inputs means giving up the most powerful visual priors available.
Depth sensors are noisy. RGB-D cameras (Intel RealSense, Azure Kinect) have depth noise of 1–5mm at close range and 10–30mm at 1m. Shiny, transparent, and dark surfaces produce depth holes. Point clouds derived from noisy depth are themselves noisy — the 3D input is less clean than the 2D input in practice.
Most tasks don't need 3D equivariance. For pick-and-place with a top-down camera and a fixed set of objects, a 2D policy with 200 demos works fine. The equivariance advantage only manifests when objects appear in diverse orientations — which is common in research benchmarks but less common in structured industrial settings.

The practical decision: use 3D only when the task involves diverse 3D orientations (bin picking, random object poses on a table) AND you have fewer than 100 demos AND you do not need language conditioning. Otherwise, bet on 2D + more data.

3D is to robot policies what convolution was to vision: a re-parameterization that does not give you new capabilities, but lets the network learn the capabilities it was always supposed to learn from a fraction of the data.

12·7DVA — Direct Video Action

Imagine success, then figure out the motor commands. A two-model architecture that uses video prediction as an intermediate representation for robot control.

Every policy we have discussed so far maps observations directly to actions. Direct Video Action (DVA) takes a detour: first predict what the future looks like, then predict what actions would produce that future. The bet is that video prediction — trained on internet-scale data — captures richer physics, geometry, and task understanding than any robot-only dataset can provide.

The two-model architecture

DVA decomposes the policy into two learned components:

Video model $\mathcal{V}$. A causal video diffusion (or autoregressive) model that takes the current observation frame(s) $o_t$ and a task specification (language instruction $\ell$ or goal image $g$) and generates $N$ future frames: $\hat{I}_{t+1}, \hat{I}_{t+2}, \ldots, \hat{I}_{t+N}$. This is the "imagination" — it predicts what success looks like, without knowing anything about joints or motors.
Inverse dynamics model (IDM) $\phi$. A small network that takes two consecutive frames $(I_t, I_{t+1})$ and predicts the action $a_t$ that would move the robot from the scene in $I_t$ to the scene in $I_{t+1}$. This is the "execution" — it converts visual plans into motor commands.

At inference, the full pipeline is: observe $o_t$ → video model generates $N$ future frames → IDM converts each adjacent pair to an action → execute the first $K$ actions → re-observe and replan. The receding-horizon structure is identical to Diffusion Policy's action chunking, except the "chunk" is derived from imagined video rather than directly predicted.

Why video prediction as an intermediate representation?

Internet-scale pretraining. Billions of video frames exist on the internet. None of them have action labels, but all of them teach physics: objects fall, liquids pour, hands grasp. A video model pretrained on this data has priors that no robot dataset can match.
Task-agnostic planning. The video model doesn't need to know what a robot arm is. It learns "if you see a hand approaching a cup and the instruction says 'pick up the cup,' the next frames show the hand grasping the cup." The IDM handles the embodiment-specific translation.
Visual reasoning for free. Complex tasks that require spatial reasoning (stacking, insertion, tool use) are hard to express in action space but natural in pixel space. The video model can "see" the solution before the IDM computes the trajectory.

Training pipeline

The two models are trained separately, on different data:

Component	Training data	Loss	Scale
Video model $\mathcal{V}$	Internet video + robot video	Diffusion / AR next-frame prediction	Billions of frames
IDM $\phi$	Robot demonstrations only	MSE on predicted actions	Thousands of trajectories

The video model is pretrained on internet video (no actions needed), then optionally fine-tuned on robot video to improve visual realism in the robot's workspace. The IDM is trained only on robot demonstrations where ground-truth actions are available. This separation is the key economic insight: you can scale the video model with cheap, abundant internet data while the IDM stays small and robot-specific.

Conditioning the video model

The video model conditions on the current frame(s) plus a task specification. Two conditioning modes:

Language conditioning. A text encoder (CLIP, T5) embeds the instruction $\ell$ into tokens that cross-attend into the video diffusion process. "Pick up the red cup and place it on the saucer" → the model generates frames showing exactly that.
Goal-image conditioning. The model is given a goal frame $g$ showing the desired end state. It generates intermediate frames that connect the current observation to the goal. This is visual planning in the literal sense: the model fills in the trajectory between "here" and "there."

The IDM loss

In plain English: show the model two consecutive photos of the workspace. Ask it: "what did the robot do between these two snapshots?" The model guesses a 7D action vector, and you penalize it for how far off it was from the action the robot actually took. That is the entire training signal.

The inverse dynamics model is trained to predict the action that transitions between two consecutive frames. Given a frame pair $(I_t, I_{t+1})$ from a robot trajectory and the ground-truth action $a_t^*$ (the action the robot actually executed between those frames):

Inverse dynamics objective $$ \mathcal{L}_{\text{IDM}} = \mathbb{E}_{(I_t, a_t^*, I_{t+1}) \sim \mathcal{D}}\Big[\big\| a_t^* - \phi(I_t, I_{t+1}) \big\|^2 \Big] $$

$(I_t, a_t^*, I_{t+1})$ — a transition tuple from the robot demonstration dataset $\mathcal{D}$. $I_t$ is the observation image at time $t$, $a_t^*$ is the action the robot executed, and $I_{t+1}$ is the resulting observation.
$\phi(I_t, I_{t+1})$ — the IDM's predicted action. Takes two consecutive images and outputs the action that would transition the scene from $I_t$ to $I_{t+1}$. Architecturally, this is typically a ResNet or ViT that encodes both frames, concatenates the features, and passes them through an MLP.
$\| \cdot \|^2$ — squared L2 norm (MSE). Works well because the IDM predicts a single deterministic action per frame pair. Multimodality is not an issue here — given two specific frames, there is essentially one correct action.
$a_t^*$ — the ground-truth robot action. Typically a 7D vector: $[\Delta x, \Delta y, \Delta z, \Delta \text{roll}, \Delta \text{pitch}, \Delta \text{yaw}, \text{gripper}]$ in end-effector space.

In code: loss = F.mse_loss(idm(frame_t, frame_t1), action_t) — that is literally it. The IDM is a small encoder-MLP that takes two 224×224 images and outputs a 7D action vector. Training takes 2–4 hours on a single GPU. The failure mode to watch: if the two frames look nearly identical (slow motion phases), the predicted action is ill-conditioned and noisy.

The IDM is small (10–50M parameters), fast to train (a few hours on a single GPU), and does not need internet data. It only needs to learn the mapping from "visual change" to "motor command" for a specific robot embodiment.

Video model conditioning: the mechanics

The video model is typically a causal video diffusion transformer (similar to Sora's architecture). It conditions on two signals simultaneously:

Language conditioning via cross-attention. The text instruction $\ell$ is encoded by a frozen text encoder (CLIP or T5) into a sequence of text tokens $z_\ell \in \mathbb{R}^{M \times d}$, where $M$ is the number of text tokens and $d$ is the embedding dimension. At every attention layer of the video diffusion model, the video tokens cross-attend to these text tokens: $\text{Attn}(Q_{\text{video}}, K_{\text{text}}, V_{\text{text}})$. This is the same mechanism that lets Stable Diffusion condition on text prompts — the video model learns to align its generated frames with the semantic content of the instruction.

Current-frame conditioning via concatenation. The current observation frame $o_t$ is typically concatenated as the first frame of the sequence that the video model generates. The model is trained to predict frames $\hat{I}_{t+1}, \ldots, \hat{I}_{t+N}$ conditioned on $I_t$ being real. This grounds the generation: the model cannot hallucinate an entirely different scene, because the first frame is anchored to reality.

The IDM generalization gap

The IDM is trained on real frame pairs from robot demonstrations. At inference, it must process imagined frame pairs from the video model. These imagined frames look slightly different from real frames — subtly blurred textures, imperfect lighting, occasionally physically impossible configurations. The IDM must generalize across this domain gap.

Three mitigation strategies:

Train the IDM on augmented frame pairs. Apply color jitter, Gaussian blur, random crops, and compression artifacts to the real training frames. This makes the IDM robust to the kinds of imperfections the video model produces.
Fine-tune the video model on robot data. After internet-scale pretraining, fine-tune the video model on the robot's actual camera feed. This closes the visual domain gap between imagined and real frames, making the IDM's job easier.
Use a discriminator to filter impossible frames. Train a small classifier to distinguish physically plausible frames from implausible ones (hand passing through table, objects floating). Reject imagined frame sequences that fail the plausibility check and re-sample. This adds inference cost but prevents the IDM from receiving garbage inputs.

The error propagation problem. DVA chains two learned models in series. If the video model imagines frame $\hat{I}_{t+3}$ where the gripper has passed through the table, the IDM dutifully predicts the action that would achieve this impossible configuration — which, when executed on the real robot, produces a collision or a wild trajectory. This cascading failure mode is DVA's Achilles heel. The receding-horizon replanning (re-observe reality every $K$ steps) limits the blast radius, but does not eliminate it. A single bad imagined frame within the executed window can cause a real-world failure before the system has a chance to replan.

Worked example: IDM predicts $\Delta$pose from two 224×224 frames. Two consecutive frames from a Franka robot picking a block. Frame $I_t$ shows the gripper 5cm above the block. Frame $I_{t+1}$ shows it 3cm above. Encoding: Both frames pass through a frozen DINOv2 ViT-B/14, producing 257 tokens each (256 patch + 1 CLS). We take the CLS tokens: $z_t \in \mathbb{R}^{768}$, $z_{t+1} \in \mathbb{R}^{768}$. Feature fusion: Concatenate: $[z_t; z_{t+1}] \in \mathbb{R}^{1536}$. Pass through a 3-layer MLP: $1536 \to 512 \to 256 \to 7$. Prediction: $\phi(I_t, I_{t+1}) = [0.001, -0.002, -0.020, 0.003, -0.001, 0.000, 0.85]$. The dominant component is $\Delta z = -0.020$ (2cm downward motion), matching the visual change between frames. The gripper value 0.85 means "mostly closed" — the robot is about to grasp. Ground truth: $a_t^* = [0.000, -0.001, -0.022, 0.002, 0.000, 0.001, 0.85]$. The MSE loss: $\|a_t^* - \hat{a}_t\|^2 = 0.001^2 + 0.001^2 + 0.002^2 + 0.001^2 + 0.001^2 + 0.001^2 + 0.0^2 = 9 \times 10^{-6}$. Tiny — the IDM has learned this mapping well.

IDM architecture choices

The inverse dynamics model is small and fast but its design still matters. Two architectures dominate:

Siamese encoder + MLP. Both frames pass through the same frozen encoder (shared weights). The CLS tokens or pooled features are concatenated and fed to a 3-layer MLP that predicts the action. This is the simplest and most common design. Pros: fast, easy to implement, leverages pretrained features. Cons: the concatenated CLS tokens lose spatial information — fine-grained motion (rotation, small displacements) is harder to predict.

Feature-difference encoder. Both frames are encoded, then the difference of their feature maps (or spatial tokens) is computed: $\Delta z = z_{t+1} - z_t$. This difference map is processed by a small CNN or transformer to predict the action. Pros: the subtraction highlights what changed between frames, suppressing static background. Cons: requires spatial features (not just CLS tokens), and the subtraction is sensitive to encoder alignment.

The IDM's accuracy is bounded by the visual resolution of the encoder and the magnitude of the actions. Small actions ($<$1mm displacement between frames) produce nearly identical frames, making the prediction ill-conditioned. This is why DVA typically uses longer frame gaps ($\hat{I}_t$ vs $\hat{I}_{t+2}$ instead of $\hat{I}_{t+1}$) for slow-motion phases of the task.

Inference: the full loop

Observe. Capture current frame $o_t$ and receive language instruction $\ell$.
Imagine. Video model generates $N$ future frames: $\hat{I}_{t+1}, \ldots, \hat{I}_{t+N}$. Typical $N = 8$–$16$.
Translate. IDM converts each adjacent pair to an action: $\hat{a}_{t+i} = \phi(\hat{I}_{t+i}, \hat{I}_{t+i+1})$ for $i = 0, \ldots, N-2$.
Execute. Send the first $K$ actions ($K \leq N-1$) to the robot. Typical $K = 4$–$8$.
Replan. After $K$ steps, re-observe and repeat from step 1.

The replanning loop is essential. Video predictions degrade over long horizons — small errors compound frame by frame. By re-observing reality every $K$ steps, DVA corrects for drift. This is the same receding-horizon principle as Diffusion Policy's action chunking, just applied to imagined frames instead of directly predicted actions.

Worked example: error propagation in DVA. The video model generates 8 future frames for a "pick up cup" task. Frames 1–5 are visually realistic: the gripper approaches the cup from above. Frame 6 has a subtle error: the gripper's shadow is missing, and the cup appears slightly transparent. Frame 7: the gripper appears to pass through the cup's rim (physically impossible). Frame 8: the cup is "grasped" but the fingers are in the wrong position. IDM on frames 5→6: predicts $\Delta z = -0.015$m (continue descending). Reasonable — the visual change is subtle. IDM on frames 6→7: predicts $\Delta z = -0.030$m (aggressive descent through the cup). The IDM has never seen a gripper pass through a solid object in training, so it predicts the best-fit action for the impossible visual transition. This action will cause a collision on the real robot. IDM on frames 7→8: predicts grasp closure. But the gripper is in the wrong position from the previous bad action. With replanning ($K = 4$): only frames 1–4 are executed. The robot re-observes after frame 4. The bad frames (6–8) are never executed. The replanning catches the error before it matters. This is why short execution horizons ($K = 4$–$6$) are critical for DVA — they limit the time window during which video prediction errors can accumulate.

Computational cost of DVA

The elephant in the room: video generation is expensive. Generating 8 frames at 256×256 resolution with a video diffusion model (50 denoising steps) takes 2–5 seconds on an A100 GPU. For a 10Hz control loop with $K = 4$ executed actions (0.4s between replans), the video model must generate in under 400ms — which requires either a heavily distilled model, fewer denoising steps (10–15 with quality loss), or a latent-space video model that generates at lower resolution and upsamples.

This computational constraint is the primary reason DVA has not replaced direct action prediction. A Diffusion Policy generates a 7-dimensional action chunk in ~30ms. A video model generates the same information encoded in $256 \times 256 \times 3 \times 8 = 1.57$M values, at 100× the cost. The information-theoretic argument is clear: predicting in action space is vastly more efficient than predicting in pixel space, unless the pixel-space predictions carry internet-scale priors that the action-space model cannot access.

Video model architectures for DVA

The video generation component in DVA pipelines uses one of three architectures, each with different tradeoffs:

Video diffusion transformers (UniPi, UniSim). The dominant architecture. A causal transformer operates on latent tokens (compressed from pixels by a VAE encoder). The diffusion process adds and removes noise from the latent sequence. Conditioning on text and current frame is via cross-attention. Typical model size: 1–3B parameters. Generation cost: 2–8 seconds for 8 frames at 256×256.

Autoregressive video transformers (Genie). Predict the next frame token-by-token, conditioned on previous frames. Faster per-frame generation but lower visual quality than diffusion. Action-conditioned variants (Genie 2) can also generate video given actions, inverting the DVA pipeline — useful for building world models.

Subgoal image generators (SuSIE). Instead of generating a full video, generate a single goal image showing the desired outcome. A low-level policy then navigates from the current observation to the goal image. This dramatically reduces generation cost (one image vs eight) and avoids temporal consistency issues, at the expense of losing intermediate trajectory information.

Approach	Output	Generation cost	Planning quality	Error accumulation
Full video diffusion	$N$ future frames	High (2–8s)	High (dense trajectory)	High (per-frame errors compound)
Autoregressive video	$N$ future frames	Medium (1–3s)	Medium	High
Single subgoal image	1 goal frame	Low (0.3–1s)	Low (no trajectory)	Low (no cascading)
Trajectory sketch	2D overlay on current frame	Very low (<0.5s)	Medium (2D path only)	Low

What DVA gains and loses

	DVA (video + IDM)	Diffusion Policy / ACT (direct)
Internet pretraining	Yes — video model uses billions of internet frames	No — robot data only
Visual reasoning	Strong — video model plans in pixel space	Implicit only — reasoning must emerge in action space
Cross-embodiment	Video model transfers; only IDM is embodiment-specific	Entire policy is embodiment-specific
Error propagation	Two models in series — video errors compound through IDM	Single model — no cascading errors
Inference cost	High — video generation is expensive (diffusion over pixels)	Low — diffusion over action vectors (7D vs 224×224×3)
Action precision	Limited by video resolution and IDM accuracy	Direct — sub-millimeter possible

When to use DVA vs direct action prediction

The decision between DVA and direct action prediction (Diffusion Policy, VLA) comes down to three factors:

1. Data availability. DVA shines when you have abundant internet video but limited robot demonstrations. The video model can be pretrained on billions of internet frames (no actions needed), and only the small IDM requires robot data. If you have <50 robot demonstrations but the task involves common objects, DVA's internet priors give it a significant edge.

2. Task complexity. DVA's video prediction excels at tasks requiring spatial reasoning: stacking, insertion, tool use, arrangement. The video model can "see" the solution in pixel space before the IDM computes the trajectory. Direct action prediction struggles with these tasks because reasoning about spatial outcomes in action space is harder than reasoning in pixel space.

3. Latency requirements. DVA's inference cost is dominated by video generation (2–8 seconds). Direct action prediction runs in <50ms. If the task requires reactive control (catching objects, responding to perturbations), DVA is not viable. If the task is slow enough to tolerate 2-second planning pauses between execution phases, DVA is competitive.

The DVA paradigm is most compelling not as a replacement for direct action prediction, but as a pretraining strategy. Train a video model on internet data, use it to generate synthetic robot trajectories (via an IDM), and use those synthetic trajectories to pretrain a direct action prediction model. This pipeline — internet video → imagined robot trajectories → pretrained policy — combines DVA's data advantage with direct prediction's inference speed.

The family tree

DVA is not one paper but a paradigm. The key members:

UniPi (Du et al., 2023). The founding paper. Text-conditioned video diffusion model + IDM. Demonstrated the idea on simulated tasks. arXiv:2302.00111
SuSIE (Black et al., 2023). Generates a single subgoal image rather than a full video, then uses a low-level policy to reach it. Cheaper and more robust than full video generation. The subgoal is generated by an image-editing model conditioned on the instruction. arXiv:2312.07526
RT-Trajectory (Gu et al., 2023). Draws a coarse trajectory sketch over the current image rather than predicting future frames. The policy conditions on this 2D trajectory overlay. The "video" is simplified to a single annotated frame. arXiv:2311.01977
Genie / Genie 2 (Bruce et al., 2024). Learned world models that can generate interactive video from actions. Genie operates in the opposite direction — actions → video — but the shared infrastructure (causal video transformers, latent dynamics) is the same. arXiv:2402.15391
UniSim (Yang et al., 2023). A universal simulator that generates realistic video conditioned on diverse actions (robot commands, human actions, camera motion). Can serve as both the video model in a DVA pipeline and a training simulator. arXiv:2310.06680

The deeper lesson

DVA is a bet on representation. The claim is that pixels — not actions, not latent vectors, not language — are the natural intermediate representation for robot planning, because pixels are what internet-scale pretraining understands best. The counterargument is that pixels are wasteful: you generate 224×224×3×N values only to extract 7×N action values from them. The answer, for now, is that the waste is worth it when the pretraining priors are strong enough. As action-space foundation models (VLAs) improve, the balance may shift — but in 2026, the video-prediction camp remains the only group that can leverage truly internet-scale data for robot control.

DVA in the broader context

DVA is best understood as one instance of a broader pattern: using a foundation model's representation space as an intermediate layer for robot control. The foundation model provides priors that no robot dataset can match; the robot-specific component provides the embodiment mapping. The variants differ in which foundation model and which intermediate representation:

Paradigm	Foundation model	Intermediate representation	Robot-specific component
DVA (video)	Video diffusion model	Imagined future frames	Inverse dynamics model (IDM)
VLA (language)	Vision-language model	Hidden states / tokens	Action head (diffusion/flow/discrete)
Language planning	LLM	Text plans / code	Low-level policy per primitive
Value-map planning	LLM + VLM	3D scalar fields	Motion planner (MPC/optimization)

The trend is toward unifying these approaches: a single VLM that can reason in language, predict in video, and act in continuous space simultaneously. Gemini Robotics 1.5 is the first model to attempt all three. Whether the unified approach outperforms specialized decompositions remains an open empirical question as of 2026.

DVA decomposes robot policy learning into "what should the world look like next?" and "what motor commands make that happen?" The first question has internet-scale training signal. The second has a simple, well-posed answer. The decomposition is the insight.

12·DThe VLA zoo

The full landscape of vision-language-action models — the ones that shipped, the ones that scaled down, and the ones that proved cross-embodiment is real.

Section 12 gave the lineage table. This section opens each row and looks inside. The field moved fast enough in 2024–2025 that "VLA" now covers at least four distinct architectural bets: cross-embodiment pre-training, small-VLA distillation, bimanual specialists, and open-source generalists. Knowing which bet each model makes is the difference between picking the right starting point and wasting a quarter.

Cross-embodiment pre-training

HPT — Heterogeneous Pre-trained Transformers

Wang et al., 2024. The core idea: different robots have different observation and action spaces, but the task semantics are shared. HPT handles heterogeneity by giving each embodiment its own lightweight "stem" encoder — a small MLP or CNN that projects that robot's observations into a shared token space — and a shared transformer trunk that processes the tokens regardless of where they came from. Actions are decoded by per-embodiment "head" MLPs.

The architecture is deliberately modular. Adding a new robot means training a new stem and head while keeping the trunk frozen. This is the "plug-and-play embodiment" idea: the trunk learns task-level abstractions (approach, grasp, place) and the stems/heads handle the geometry of each particular arm.

In plain English: different robots plug different adapters into the same brain. A Franka arm and a UR5 have different cameras and different joints, but the concept of "reach toward the red cup" is the same. Each robot gets its own small translator (the stem) that converts its sensors into a common language the shared brain speaks, and another small translator (the head) that converts the brain's output back into that robot's joint commands.

HPT forward pass $$ z_t = \text{Stem}_e(o_t^e), \qquad h = \text{Trunk}(z_{1:T}), \qquad a_t = \text{Head}_e(h_t) $$

$o_t^e$ — the observation for embodiment $e$ at time $t$. Different embodiments have different observation shapes (number of cameras, proprioception dimensions).
$\text{Stem}_e$ — the per-embodiment encoder. Projects $o_t^e$ into a fixed-dimensional token $z_t$. Typically a 2–3 layer MLP for proprioception and a small CNN or frozen ViT for images.
$\text{Trunk}$ — the shared transformer. Processes tokens from all embodiments identically. This is where cross-embodiment transfer happens.
$\text{Head}_e$ — the per-embodiment action decoder. Maps the trunk's output back to the action space of embodiment $e$.

In code: z = stem_franka(obs); h = trunk(z); a = head_franka(h) — three calls, three modules. To add a new robot, you write a new stem and head (tiny MLPs, ~2M params each), freeze the 300M-param trunk, and fine-tune on 50 demos. Training: ~1 hour on a single GPU.

HPT was pre-trained on data from over 50 robot embodiments. The result: fine-tuning the trunk + a new stem/head on a novel robot with just 50 demos beats training from scratch with 200. The trunk is genuinely learning transferable motor abstractions, not just averaging.

CrossFormer

Doshi et al., 2024. Same thesis as HPT — cross-embodiment pre-training with heterogeneous inputs — but different architectural choices. CrossFormer uses a single transformer that ingests all modalities (images, proprioception, language) as tokens, with learned "embodiment embeddings" added to each token to tell the model which robot produced it. No separate stems; the transformer does the alignment internally.

The tradeoff: CrossFormer is simpler to implement (one model, one forward pass) but harder to extend to new embodiments without retraining. HPT's modular stems make zero-shot embodiment addition cleaner; CrossFormer's monolithic design makes within-distribution performance slightly higher.

The small-VLA wave

TinyVLA

Wen et al., 2024. The first serious attempt to ask: how small can a VLA be and still work? TinyVLA uses a 1B-parameter backbone (a distilled VLM) and shows that with careful LoRA fine-tuning and aggressive data curation, a 1B model matches the 7B OpenVLA on standard benchmarks. The insight is that most of the 7B parameters are dedicated to language understanding that robot control does not exercise — a smaller model with the same visual and motor capacity suffices.

SmolVLA

HuggingFace, 2025. Pushed the frontier further: 450M parameters, flow-matching action expert, and performance matching models 10× its size on LIBERO and SimplerEnv. The recipe: SmolVLM as the vision-language backbone (itself a distillation of larger VLMs), a lightweight flow-matching expert for actions, and LoRA adapters for task-specific fine-tuning. SmolVLA runs on a single consumer GPU at inference — important for labs that don't have an A100 per robot cell.

The small-VLA wave is not about compression for its own sake. It is about deployment economics. A 7B VLA needs a $10K GPU per robot. A 450M VLA runs on a $2K Jetson. When you're deploying 100 robots, the difference is $800K.

Bimanual and mobile specialists

Mobile ALOHA

Fu et al., 2024. A mobile base with two ALOHA arms and whole-body teleoperation. The architecture is ACT (a conditional VAE + transformer), not a VLA — but the training trick is the headline: co-training. Mixing mobile ALOHA trajectories with static ALOHA trajectories in a single dataset improves success rates on both setups, even though the embodiments are different. The shared representation of bimanual coordination transfers across the mobile/static divide.

The co-training result is counterintuitive. A policy trained on mobile+static data outperforms one trained on mobile data alone, even on mobile tasks. The explanation: static data provides more diverse manipulation examples that the shared bimanual trunk can leverage, and the mobile base trajectories provide context that the static policy never sees. Both benefit.

ALOHA 2

Aldaco et al., 2024. Hardware iteration, not an architecture paper. Better teleoperation (lower friction, wider range of motion), better cameras (higher resolution, wider FOV), and a systematic study of data quality vs quantity. The headline result: 50 high-quality demos (smooth, consistent, no hesitation) outperform 200 mediocre demos (jittery, varied strategy). The lesson generalizes beyond ALOHA: if you're collecting data, train your teleoperators.

RDT-1B

Liu et al., 2024. A 1.2B-parameter Diffusion Transformer built specifically for bimanual manipulation. The architecture is a standard DiT (the same backbone used in image generation) adapted for action sequences: noised action chunks as input tokens, cross-attention to image and language tokens, iterative denoising at inference. Trained on over 1M bimanual episodes from multiple robot platforms.

RDT-1B's significance is proving that the DiT architecture — which was designed for image generation — works for action generation at scale. The denoising formulation handles bimanual multimodality naturally (two arms have highly multimodal coordination patterns), and the 1B parameter count is large enough to absorb cross-embodiment variation without the modular stem/head design of HPT.

The open-source generalist

Octo

Octo Team, 2024. The first serious open-source generalist robot policy. Transformer backbone (27M or 93M parameters), trained on 800K episodes from the Open X-Embodiment dataset. Supports language conditioning, goal-image conditioning, or both. Action head is a diffusion model (continuous actions) or a discrete tokenizer, selectable at fine-tuning time.

Octo's design philosophy is flexibility over performance. It is not the best policy on any single benchmark, but it is the only open model that can be fine-tuned to a new robot with 50 demos in an afternoon. The provided fine-tuning scripts, data loaders, and evaluation harness make it the practical starting point for most academic labs in 2025.

Octo: the practical starting point

Octo deserves special attention because it is the model most academic labs will actually use. The architecture is deliberately simple: a standard transformer backbone (27M or 93M parameters) with separate tokenizers for image, language, and proprioception inputs. The action head is pluggable — you can choose a diffusion head (continuous actions) or a discrete tokenizer at fine-tuning time.

The fine-tuning workflow is designed for accessibility:

Record 50–200 demonstrations on your robot using any teleoperation method.
Convert to the RLDS format (a standard data format for robot learning datasets).
Run the provided fine-tuning script: python finetune.py --data your_data --model octo-base. Typical time: 2–4 hours on a single A100.
Deploy with the inference script. The model runs at ~10Hz on an RTX 3090.

Octo is not the best policy on any single benchmark. But it is the only open model that provides the complete pipeline — from data collection to deployment — with documentation, scripts, and community support. For a lab that wants to get a VLA running on their robot in a week rather than a quarter, Octo is the answer.

The full comparison

Model	Params	Action head	Embodiments	Data scale	Open
HPT	~300M trunk	Per-embodiment MLP	50+ (stem per embodiment)	150K+ episodes	Yes
CrossFormer	~130M	Diffusion / discrete	10+ (embodiment embeddings)	900K+ episodes	Yes
TinyVLA	1B	Discrete tokens	Single (fine-tune per robot)	Fine-tune from OpenVLA data	Yes
SmolVLA	450M	Flow matching expert	Single (LoRA per task)	Fine-tune, LIBERO/SimplerEnv	Yes
Mobile ALOHA	~80M (ACT)	CVAE + transformer	Mobile + static bimanual	~800 co-trained episodes	Yes
ALOHA 2	~80M (ACT)	CVAE + transformer	Static bimanual	50–200 per task	Yes
RDT-1B	1.2B	Diffusion (DiT)	Multiple bimanual platforms	1M+ episodes	Yes
Octo	27M / 93M	Diffusion / discrete	22 robots (OXE)	800K episodes	Yes
OpenVLA	7B	Discrete tokens	Single (fine-tune)	970K episodes	Yes
π₀	3.3B	Flow matching	7+ embodiments	10K+ hours	Partial
GR00T N1	2.2B	Diffusion / flow	Humanoid-focused	Real + synthetic + video	Partial

The practical recipe: LoRA fine-tuning a VLA

The standard workflow for deploying a VLA on a new robot in 2026 is not training from scratch — it is LoRA fine-tuning a pretrained checkpoint. Low-Rank Adaptation (LoRA) freezes the pretrained weights and injects small trainable low-rank matrices into the attention layers. The result: you update ~1–5% of the parameters while preserving the pretrained knowledge.

LoRA fine-tuning recipe for a VLA

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForVision2Seq

# Load pretrained VLA checkpoint
model = AutoModelForVision2Seq.from_pretrained("openvla/openvla-7b")

# Configure LoRA: inject low-rank adapters into attention
lora_config = LoraConfig(
    r=32,                      # rank of the adapter matrices
    lora_alpha=32,              # scaling factor (alpha/r = 1.0)
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# "trainable params: 19,922,944 || all params: 7,615,616,000 || 0.26%"

# Fine-tune on your robot's demonstration data
# Typical: 50-200 demos, 2-8 hours on a single A100
trainer.train(
    train_dataset=robot_demos,
    learning_rate=2e-5,        # low LR to preserve priors
    num_epochs=20,
    batch_size=8,
    warmup_ratio=0.05,
)

The key hyperparameters: rank $r = 16$–$64$ (higher rank = more expressivity, more parameters), learning rate $2 \times 10^{-5}$ (lower than full fine-tuning to avoid catastrophic forgetting), and 10–30 epochs on small datasets. The total trainable parameter count is typically 0.2–2% of the full model. This recipe works for OpenVLA, SmolVLA, and any HuggingFace-compatible VLA checkpoint.

Worked example: HPT stem architecture. A new lab has a Stretch RE-1 robot (one arm, 6-DoF, wrist camera + base camera, joint positions as proprioception). The pretrained HPT trunk was trained on 50+ embodiments but never saw a Stretch. Step 1: Design the stem. The Stretch's observation space is: wrist image (224×224 RGB), base image (224×224 RGB), joint positions (6D), gripper width (1D). The stem must map all of these to the trunk's token format (512-dim tokens). Image stem: frozen ViT-B/14 encodes each image → take CLS token (768D) → linear projection to 512D. Two images → 2 tokens. Proprio stem: concatenate [joints(6); gripper(1)] = 7D → 2-layer MLP (7 → 256 → 512) → 1 token. Total input to trunk: 3 tokens per timestep. Step 2: Design the head. The Stretch's action space is: $\Delta$joint positions (6D) + gripper (1D) = 7D. Head: trunk output token (512D) → 2-layer MLP (512 → 256 → 7). Step 3: Fine-tune. Freeze the trunk. Train only the stem + head on 50 Stretch demonstrations. Trainable parameters: ~2M (vs. 300M in the trunk). Training time: ~1 hour on a single GPU. The trunk's pretrained knowledge of "reach, grasp, place" transfers; the stems/heads adapt it to the Stretch's specific geometry.

Worked example: Mobile ALOHA co-training. You have 200 mobile ALOHA episodes (bimanual tasks on a wheeled base) and 3,000 static ALOHA episodes (bimanual tasks on a fixed table). Training on mobile-only: 62% success. Training on static-only: N/A for mobile tasks. Co-training on both: The trick: combine both datasets into a single training set. For static episodes, the base velocity action dimensions are masked to zero. The ACT transformer processes both data sources identically — the bimanual arm tokens are the same; only the base tokens differ. Why it helps: the static dataset provides 15× more diverse manipulation examples (different objects, grasps, placements). The transformer trunk learns general bimanual coordination primitives from this larger dataset. The mobile base trajectories provide the navigational context. Both halves share the manipulation representations. Result: co-training achieves 78% success on mobile tasks — a 16-point improvement from a dataset that contains zero mobile demonstrations of its own. The static ALOHA data acted as a force multiplier for the mobile policy.

The foundation model vs specialist debate

When does scale help? The data says: scale helps when the task distribution is broad, when the robot will encounter novel objects, and when language conditioning is needed. A 7B VLA fine-tuned on 50 demos of "pick up the red block" will lose to a 3M-parameter Diffusion Policy trained on 50 demos of the same task. The VLA's advantage emerges when you ask it to pick up the blue block tomorrow, or the dinosaur next week, or to fold a shirt it has never seen.

The practical heuristic: if your deployment involves fewer than 5 distinct tasks with known objects, use a specialist. If it involves open-vocabulary instructions, novel objects, or a task set that will grow over time, use a VLA. The crossover point is roughly 10–20 tasks — below that, per-task specialists win on accuracy; above it, the VLA wins on engineering cost.

The deployment decision tree

Given the zoo of available models, how do you choose? The decision is primarily about your task distribution, data budget, and compute budget:

Worked example: choosing the right VLA for your deployment. Scenario A: Single task, 50 demos, one robot. Use a specialist policy (Diffusion Policy or ACT), not a VLA. The VLA's internet knowledge adds nothing when the task is fixed. Train time: 2 hours. Inference: <50ms. Scenario B: 5 tasks, 100 demos each, one robot. Use Octo or SmolVLA with LoRA fine-tuning per task. The shared backbone amortizes the visual representation across tasks. Train time: 4 hours total. Inference: ~100ms. Scenario C: Open-vocabulary instructions, 500 demos, one robot. Use OpenVLA-7B with LoRA. You need the language grounding from the VLM backbone. Train time: 8 hours on an A100. Inference: ~200ms (need two-system split for 10Hz). Scenario D: 3 different robot platforms, 200 demos each, shared tasks. Use HPT or CrossFormer for cross-embodiment transfer. Train per-embodiment stems on each robot's data, share the trunk. Train time: 12 hours total (stems train fast, trunk is pretrained). Scenario E: Humanoid, 35 DOF, bimanual, language instructions. Use $\pi_0$-class architecture: VLM backbone + flow matching action expert. The two-system split is mandatory at this DOF count and control frequency. Budget: A100 or H100 per robot.

The VLA zoo is converging on a common body plan: frozen VLM backbone, lightweight action expert, LoRA or adapter fine-tuning. The variation is in the action head (discrete vs diffusion vs flow), the bridge (tokens vs hidden states vs latent plan), and the training data. If you're starting a new project, pick the smallest model that covers your task distribution and fine-tune it. Don't train from scratch unless you have a novel embodiment and 100K+ demos.

12·Eπ₀ — the flow-matching VLA, in full

The most important VLA architecture of 2024–2025 deserves its own teardown. Here is every layer, every loss, and every training trick.

Section 12 introduced π₀ as one row in the VLA lineage table. Section 12·D placed it in the zoo. This section opens the hood. π₀ is not just another VLA — it is the architecture that proved three things simultaneously: (1) flow matching beats diffusion for action generation, (2) a dedicated action expert outperforms discrete tokenization, and (3) cross-embodiment training across seven robot platforms produces a single policy that generalizes to new tasks with minimal fine-tuning. Understanding π₀ in detail is understanding where the field is going.

Architecture: the two-headed transformer

The core insight behind π₀ is a separation of concerns inside a single transformer. A VLM backbone — based on PaliGemma, a 3B-parameter vision-language model — handles perception and language understanding. It processes camera images and language instructions, producing rich hidden-state representations that encode "what is in the scene" and "what the instruction means." But the VLM backbone never predicts actions directly. Instead, its hidden states are fed to a separate action expert: a dedicated set of transformer layers that speak continuous distributions instead of discrete tokens.

The action expert shares the backbone's attention layers — it participates in the same self-attention computation over image, language, and proprioception tokens — but has its own MLP weights. This is a mixture-of-experts (MoE) design: during the forward pass, each token is routed through either the VLM's MLPs or the action expert's MLPs, depending on whether it is a "perception" token or an "action" token. The attention layers are shared because cross-modal attention is what allows the action expert to "see" the scene and "hear" the instruction. The MLPs are separate because the computation needed to predict continuous action distributions is fundamentally different from the computation needed to answer VQA questions.

Why not just discretize? RT-2 and OpenVLA discretize each action dimension into 256 bins and predict bin indices autoregressively. This works, but it imposes a resolution ceiling: 256 bins over a 0.4m workspace gives ~1.6mm per bin. For insertion tasks, that is too coarse. It also destroys cross-dimension correlations: predicting bin indices independently for each joint misses the fact that joints move together in coordinated patterns. Flow matching avoids both problems by modeling the full continuous joint distribution in one shot.

The action expert: flow matching over action chunks

The action expert generates actions by learning a velocity field that transports noise to data. At training time, it learns to predict the straight-line velocity between a noise sample $x_0 \sim \mathcal{N}(0, I)$ and a ground-truth action chunk $x_1 = a_{1:H}$. At inference time, it starts from pure noise and integrates the learned velocity field forward to produce a clean action chunk.

π₀ action expert loss (conditional flow matching) $$ \mathcal{L}_{\text{action}} = \mathbb{E}_{t, x_0, x_1}\Big[\big\| v_\theta\big((1-t)x_0 + t\,x_1,\; t,\; h_{\text{vlm}}\big) - (x_1 - x_0) \big\|_2^2 \Big] $$

$t \sim U(0, 1)$ — flow time, sampled uniformly. At $t = 0$ the input is pure noise; at $t = 1$ it is pure data.
$x_0 \sim \mathcal{N}(0, I)$ — noise sample, same shape as the action chunk: $(H \times D)$ where $H$ is the chunk length and $D$ is the action dimension.
$x_1 = a_{1:H}$ — ground-truth action chunk from the demonstration. For a 7-DoF arm with $H = 50$, this is a $(50 \times 7)$ tensor.
$(1-t)x_0 + t\,x_1$ — the linear interpolant between noise and data. The input to the velocity network at flow time $t$.
$v_\theta(\cdot, t, h_{\text{vlm}})$ — the action expert's velocity prediction, conditioned on the VLM's hidden states $h_{\text{vlm}}$. The expert "sees" the scene and instruction through $h_{\text{vlm}}$.
$(x_1 - x_0)$ — the target velocity: the straight-line direction from noise to data. This is what the network must learn to predict.

In plain English: take a real action chunk from a demonstration, mix it with random noise at a random ratio, and ask the network to predict which direction leads from noise toward the real actions. Do this millions of times, and the network learns a vector field that can turn any noise sample into a plausible action chunk, conditioned on what the robot sees and hears.

PyTorch — π₀ action expert training step

# action_chunk: (B, H, D) ground-truth actions
# h_vlm: (B, N, d_model) hidden states from VLM backbone

t = torch.rand(B, 1, 1)                          # (B, 1, 1) flow time
x0 = torch.randn_like(action_chunk)                # (B, H, D) noise
x1 = action_chunk                                  # (B, H, D) data
xt = (1 - t) * x0 + t * x1                       # interpolant
v_pred = action_expert(xt, t.squeeze(), h_vlm)   # (B, H, D)
loss = F.mse_loss(v_pred, x1 - x0)               # flow matching loss

Why flow matching, not diffusion?

Diffusion policies (Section 08) denoise iteratively through $K$ noise levels, typically $K = 100$ steps with DDPM or $K = 10$–$20$ with DDIM. Each step requires a full forward pass of the denoiser. Flow matching uses straight-line paths between noise and data, which means fewer integration steps are needed to produce high-quality samples. In practice, π₀ uses 10 Euler steps at inference — 2–5× fewer than diffusion — while achieving equal or better sample quality. Fewer steps = lower latency = higher control frequency.

The second advantage is natural multimodality. When the demonstration data contains multiple valid strategies for the same observation (approach from the left or the right), the flow matching velocity field smoothly routes different noise samples toward different modes. Diffusion can do this too, but flow matching's linear interpolation makes mode separation more stable during training — the velocity targets $(x_1 - x_0)$ are always well-defined, even when $x_1$ is multimodal.

Training recipe: three stages

π₀'s training is a three-stage pipeline, each stage building on the previous:

Stage 1: VLM pre-training (inherited)

The PaliGemma backbone arrives pre-trained on internet-scale vision-language data. This is the most expensive stage — thousands of GPU-hours on billions of image-text pairs — and is done by the model provider, not the robotics lab. The backbone already understands "red cup," "wooden table," "pick up," and thousands of other visual-semantic concepts.
Provides the visual-semantic foundation that robot data alone cannot teach
Stage 2: Co-fine-tuning on web + robot data

The critical stage. The backbone is fine-tuned on a mixture of web VQA data and robot demonstration data. Every training batch contains both: ~70–80% robot samples supervised with the flow matching action loss, and ~20–30% web samples supervised with the standard VQA cross-entropy loss. The two losses are summed with a weighting ratio.

The web data prevents catastrophic forgetting. Without it, the backbone forgets what "red cup" looks like after a few thousand robot-only gradient steps. With it, the visual-semantic features stay alive while the action expert learns to use them for motor control. The ratio matters: too much web data slows convergence on robot tasks; too little and forgetting sets in. The sweet spot is empirically ~75% robot, ~25% web.
Teaches action prediction while preserving internet knowledge
Stage 3: Task-specific LoRA fine-tuning

Freeze the backbone and action expert weights. Attach LoRA adapters (rank 32, ~20M trainable parameters) to the attention layers. Fine-tune on 50–200 demonstrations of the target task on the target robot. This stage takes 1–2 hours on 8 GPUs and adapts the general-purpose policy to the specific geometry, objects, and task requirements of the deployment.
Specializes the generalist policy without destroying it

Co-fine-tuning: why mixing web and robot data helps

The intuition is worth deriving carefully. During web pre-training, the VLM backbone learns a representation where the token for "red" is close to the visual features of red objects. During robot training, the action expert learns that "pick up" means a specific motion trajectory. Co-fine-tuning lets both signals reinforce each other: the backbone maintains its semantic features (red = this visual pattern) while the action expert learns to use those features for motor control (red object at position X = reach to position X).

Without web data in the mix, the backbone's weights drift. After 10K robot-only gradient steps, the representation of "red" has been overwritten by features more useful for predicting the flow matching velocity field but less useful for distinguishing "red cup" from "blue cup." The model can still grasp objects, but it can no longer follow the instruction "pick up the red cup" because "red" no longer activates the correct visual features. This is catastrophic forgetting in action.

Without robot data, the backbone understands language and vision perfectly but has no idea how to translate that understanding into motor commands. It knows what a cup is but cannot close the gripper around one. The action expert starts from random weights and needs robot demonstrations to learn the mapping from VLM hidden states to velocity fields.

The mixing ratio acts as a regularizer strength. More web data = stronger regularization against forgetting = slower convergence on robot tasks. The practical schedule: start with a higher web ratio (40%) in the first 10K steps when the action expert is learning from scratch and the forgetting risk is highest, then decay to 20% for the remainder of training.

The π₀ evolution: 0 → 0.5 → 0.7

π₀ (2024): the base architecture

PaliGemma backbone + flow matching action expert. Trained on 10M+ demonstration steps across 7 robot platforms (Franka, UR5, ALOHA, Sawyer, xArm, Google Robot, mobile platforms). 3B parameters total (2.7B backbone + 300M action expert). Proved that a single policy can control multiple robots with different morphologies, sensors, and action spaces.

π₀.5 (2025): task-specific LoRA

Added the third training stage: LoRA fine-tuning for specific tasks and environments. The key finding was that 50–200 demonstrations plus LoRA fine-tuning on a new task achieves higher success rates than the base π₀ policy with 10x more demonstrations of that task in the pre-training mix. The per-task adapters are tiny (20M parameters each) and can be hot-swapped at inference: load the "fold towel" adapter for towel folding, the "pour water" adapter for pouring. Open-world generalization to new kitchens and bedrooms.

π₀.7 (2026): the RL Token

π₀.7 introduced two innovations. First, multi-scale memory (MEM): a hierarchical context window that stores recent observations at full resolution and older observations at compressed resolution, enabling tasks longer than 10 minutes without exceeding the transformer's context length. Second, the RL Token.

The RL Token is a special conditioning token prepended to the input sequence during online RL fine-tuning. When this token is absent, the policy behaves normally — it executes the most likely action for the given observation and instruction. When the RL Token is present, the flow matching sampling process adds an entropy bonus: instead of integrating the velocity field to its most likely endpoint, the sampler adds controlled noise at each Euler step, producing more diverse and exploratory action trajectories.

This is the mechanism that enables online RL polishing: the RL Token turns the deterministic-at-inference policy into a stochastic one, providing the exploration needed for policy gradient methods to discover better-than-demonstration behaviors. Once RL fine-tuning converges, the RL Token is removed, and the policy returns to its high-precision, low-variance mode. The beauty is that the same model weights serve both purposes — no separate exploration policy is needed.

Scale numbers

Metric	Value	Context
Total parameters	3.3B	2.7B VLM backbone + 300M action expert + 300M adapters
Training data	10M+ demo steps	Across 7 robot platforms, 100+ tasks
Robot platforms	7	Franka, UR5, ALOHA, Sawyer, xArm, Google Robot, mobile
Action chunk size	50 steps	At 50Hz = 1 second of motion per chunk
Inference latency	~100ms	10 Euler steps on A100; backbone amortized across chunks
LoRA fine-tuning	1–2 hours	8 GPUs, 50–200 demonstrations
LoRA adapter size	~20M params	0.6% of total; hot-swappable per task

Worked example: one inference step, traced end to end

Worked example: π₀ inference for "pick up the red cup." The robot is a Franka Panda with a wrist camera and a scene camera. Step 1: Image encoding. Scene camera (224×224×3) → PaliGemma ViT encoder → 256 image tokens $\in \mathbb{R}^{2048}$. Wrist camera → same encoder → 256 tokens. Total: 512 image tokens. Latency: ~12ms. Step 2: Language encoding. "Pick up the red cup" → PaliGemma tokenizer → 8 text tokens $\in \mathbb{R}^{2048}$. Latency: negligible (embedding lookup). Step 3: Proprioception encoding. Joint positions (7D) + gripper width (1D) + EE pose (6D) = 14D → MLP → 1 token $\in \mathbb{R}^{2048}$. Latency: negligible. Step 4: VLM backbone forward pass. Concatenated sequence: [8 text + 512 image + 1 proprio] = 521 tokens. Processed through 28 transformer layers. Each layer: shared attention across all tokens, then VLM-specific MLPs for the text/image/proprio tokens. Output: $h_{\text{vlm}} \in \mathbb{R}^{521 \times 2048}$. Latency: ~40ms on A100. Step 5: Action expert — flow matching generation. Initialize $x_0 \sim \mathcal{N}(0, I)$ with shape $(50, 7)$ — 50-step chunk, 7-DoF action. Flatten to 350D, project to action tokens, append to the sequence. Run 10 Euler steps:
For $k = 0, 1, \ldots, 9$:
$t_k = k / 10$
Feed $[h_{\text{vlm}};\; \text{action\_tokens}(x_{t_k})]$ through shared attention + action expert MLPs
$v_k = \text{expert\_output}$ — predicted velocity, shape $(50, 7)$
$x_{t_{k+1}} = x_{t_k} + (1/10) \cdot v_k$ — Euler integration step
Final: $x_1 = \hat{a}_{1:50}$ — the predicted 50-step action chunk. Latency: 10 × ~5ms = ~50ms. Step 6: Execute. Send the first 10 actions to the Franka's operational-space controller at 50Hz. After 200ms (10 steps), re-observe and replan — or continue executing the remaining 40 steps if confidence is high. Total latency: 12ms (vision) + 40ms (backbone) + 50ms (flow matching) = ~102ms. Well within a 5Hz replan budget.

The MoE architecture, precisely

The mixture-of-experts design is worth examining at the layer level. In a standard transformer, each layer has two sub-modules: multi-head self-attention and a feed-forward MLP. In π₀, the attention sub-module is shared across all token types — image tokens, text tokens, proprio tokens, and action tokens all attend to each other in the same attention computation. But the MLP sub-module is split: image/text/proprio tokens are routed through the VLM's original MLP weights (frozen from PaliGemma pre-training), while action tokens are routed through the action expert's MLP weights (trained from scratch on robot data).

This design has three consequences. First, the action expert can "read" the scene and instruction through attention without any information bottleneck — it has full access to every image patch and every word. Second, the VLM's MLPs are never updated by robot gradients, so internet knowledge is preserved by construction. Third, the action expert's MLPs are free to learn action-specific computations (velocity field prediction, temporal correlations across the chunk) without being constrained by the VLM's pre-trained MLP structure.

The π₀ architecture is not "a VLM with an action head bolted on." It is a shared attention backbone with two specialist MLP tracks. The attention layers are the common language; the MLPs are the specialized dialects. This is why it works better than either a pure VLM (which wastes MLP capacity on language tasks irrelevant to robotics) or a pure action model (which lacks the visual-semantic understanding that internet pre-training provides).

12·FSmolVLA — the small-VLA revolution

The counter-movement to scaling: efficient VLAs that run on consumer hardware and still match the giants on most benchmarks.

The VLA story through 2024 was one of relentless scaling: RT-2 at 55B, OpenVLA at 7B, π₀ at 3.3B. Each model assumed that more parameters meant better generalization. Then 2025 arrived, and a 450M-parameter model matched the 7B one on the benchmarks that mattered. The bottleneck was never model size. It was data quality and action head design.

SmolVLA: the architecture

Shukor et al., 2025 (HuggingFace). SmolVLA achieves comparable performance to OpenVLA-7B with only 450M parameters. Four architectural choices make this possible:

Efficient VLM backbone: SmolVLM. Instead of a 7B Llama, SmolVLA uses SmolVLM — a 2B-class vision-language model with better token efficiency. SmolVLM was designed for edge deployment from the start, with aggressive knowledge distillation from larger VLMs. The 450M-parameter count includes both the VLM backbone and the action expert.
Flow matching action expert. Borrowed directly from π₀. Instead of discretizing actions into 256 bins (which wastes capacity on the tokenization overhead), SmolVLA routes actions through a continuous flow matching expert. This is more expressive per parameter — the network's capacity goes toward modeling the action distribution, not toward learning a binning scheme.
Aggressive image token pooling. Standard VLAs produce 256 image tokens per camera (from a 224/14 ViT). SmolVLA pools these down to 64 tokens via spatial average pooling before feeding them to the backbone. Fewer tokens = quadratically less attention computation. The information loss is minimal for manipulation tasks, where fine-grained spatial detail matters less than object identity and relative position.
LoRA for task adaptation. Only ~2% of parameters are task-specific. The base model is frozen; per-task LoRA adapters (rank 16, ~9M params) handle specialization. This means the "model" is actually a 450M frozen core plus a library of 9M-parameter adapters — one per task.

Why smaller works for robotics

A 7B language model can write poetry, solve calculus, and debate philosophy. A robot manipulator needs to understand "red cup," "pick up," and "to the left of." These are a tiny fraction of the capabilities a 7B model encodes. Most of a large LLM's capacity is dedicated to linguistic and reasoning abilities that are never exercised during manipulation.

SmolVLA's 450M parameters are sufficient because tabletop manipulation requires:

Object recognition at the category level (cup, block, bowl) — not the fine-grained distinctions (Labrador vs Golden Retriever) that large vision models excel at.
Spatial reasoning over a small workspace (~1m³) — not the global reasoning (which country is this?) that large models encode.
Instruction following for simple verb-noun commands ("pick up," "place on") — not the compositional language understanding (nested clauses, sarcasm, metaphor) that large models handle.
Action prediction via the flow matching expert — which is a 100–200M parameter network regardless of backbone size.

The 450M model covers all four requirements. The remaining 6.5B parameters in a 7B VLA are paying for capabilities the robot never uses.

Deployment economics

This is where the small-VLA thesis becomes a business argument, not just a research one:

Metric	OpenVLA-7B	π₀ 3.3B	SmolVLA 450M
GPU required	A100 (80GB)	A100 (40GB)	RTX 3090 (24GB)
GPU cost	~$10,000	~$10,000	~$1,000
Inference latency	~200ms	~100ms	~60ms
Cloud inference?	Required for most setups	Required	Optional — runs on-robot
INT8 quantized?	Still needs A10G+	Fits on RTX 4090	Fits on Jetson Orin
Network dependency	Yes (cloud GPU)	Yes	No — on-robot is feasible
Cost per 100 robots	~$1M in GPUs	~$1M	~$100K

The implications cascade. On-robot inference means no network round-trip latency (saving 10–50ms per step depending on the cloud setup). It means the robot works when WiFi drops. It means deployment in warehouses, factories, and homes where cloud connectivity is unreliable or forbidden by policy. SmolVLA does not just save money — it unlocks deployment scenarios that 7B models physically cannot reach.

Training recipe

SmolVLA's training follows the same three-stage pipeline as π₀, scaled down:

Pre-train SmolVLM on web data. Vision-language pre-training on image-text pairs. This is done by the SmolVLM team, not the robotics lab. The result is a compact VLM with strong visual-semantic features.
Co-fine-tune with flow matching action expert on Open X-Embodiment + DROID. Mixed batches of web VQA and robot demonstration data. The flow matching action expert is trained from scratch; the SmolVLM backbone is updated with a low learning rate. Duration: ~48 hours on 8 GPUs.
LoRA fine-tune for target task. Freeze everything, attach rank-16 LoRA adapters, train on 50–200 task-specific demonstrations. Duration: 30–60 minutes on a single RTX 3090.

Performance comparison

Benchmark	Metric	OpenVLA-7B	π₀ 3.3B	SmolVLA 450M
LIBERO-Long	Success %	53.3	68.4	62.1
LIBERO-Spatial	Success %	78.9	85.2	81.0
SimplerEnv (visual matching)	Success %	26.1	41.7	38.8
Bridge real-world	Success %	72.0	81.0	74.5
Params	—	7,000M	3,300M	450M
Inference GPU	—	A100	A100	RTX 3090

SmolVLA at 450M matches or exceeds OpenVLA-7B on every benchmark while running on a GPU that costs 1/10th as much. It trails π₀ by 4–8 points on average — the gap coming primarily from π₀'s larger backbone and more diverse pre-training data, not from the action head design (both use flow matching).

The small-VLA zoo

SmolVLA is not alone. A wave of efficient VLA models appeared in 2024–2025, each making a different bet on how to shrink the model without losing capability:

Model	Params	Backbone	Action head	Key insight
TinyVLA	1B	Distilled VLM	Discrete tokens	Most of 7B is wasted on non-robot capabilities
SmolVLA	450M	SmolVLM	Flow matching expert	Flow matching is more parameter-efficient than discrete tokens
RDT-1B	1.2B	DiT	Diffusion	DiT architecture works for actions, not just images
Octo	27M / 93M	Custom transformer	Diffusion / discrete	Smallest viable generalist; best fine-tuning UX

The trend is clear: the next generation of VLAs will be measured not by how large they are but by how much performance they deliver per parameter and per dollar of inference hardware.

When to go small vs large

The decision tree is simpler than it looks:

Fewer than 10 tasks on one robot, known objects: SmolVLA or TinyVLA with LoRA. The 450M model handles this with room to spare. Runs on consumer hardware, fine-tunes in under an hour.
10–50 tasks, single or dual embodiment, some novel objects: SmolVLA or π₀ depending on compute budget. If you have an A100, π₀ will give you a few extra points. If you're deploying to edge hardware, SmolVLA.
Cross-embodiment generalization across 50+ tasks and 5+ robot platforms: π₀ or GR00T N1. The larger backbone's capacity is justified by the diversity of the task distribution. You need the extra parameters to encode the motor knowledge for multiple embodiments.
Real-time on cheap hardware (Jetson, consumer GPU, no cloud): SmolVLA with INT8 quantization. No alternative exists at this price point.

The 2025 plot twist: 450M parameters matches 7B on most benchmarks. The bottleneck was never model size — it was data quality and action head design. Flow matching over continuous actions is strictly more parameter-efficient than discrete tokenization. If you are starting a new VLA project and do not have a cluster of A100s, start with SmolVLA. You can always scale up later if the task distribution demands it.

13Vision encoders

The eyes of the robot. Where most policies still leave performance on the table.

The vision encoder converts pixels into tokens or feature vectors that the policy consumes. The choice of encoder is a major lever — both for sample efficiency (a good prior cuts demonstrations needed by 3–10×) and for generalization (the encoder is what determines whether "red mug" and "blue mug" share a representation).

The encoders worth knowing

Encoder	Training	Why it's used
ResNet-18	ImageNet supervised	Cheap, fast, enough for single-task BC. The ACT default.
CLIP (ViT-B/16)	Image–text contrastive on 400M pairs	Language-aligned features. Standard for VLAs and UMI.
DINOv2 (ViT-L/14)	Self-supervised distillation, 142M images	Best raw visual features. Used in OpenVLA alongside SigLIP.
SigLIP	Sigmoid contrastive image–text	Stronger language alignment than CLIP at scale.
R3M	Time-contrastive + language alignment on Ego4D	Manipulation-aligned. Strong with little data.
VC-1	MAE on Ego4D + ImageNet	Robust low-shot performance.

Three eras

ImageNet-pretrained ResNet (until ~2022). Standard ResNet-18 or ResNet-50, frozen or fine-tuned. Cheap, good enough, the backbone of ACT and most pre-VLA work.
Self-supervised on robot or egocentric video (2022–2023). R3M, VC-1, MVP. Trained on Ego4D and similar; the priors are closer to manipulation distributions than ImageNet's.
Frontier vision foundation models (2023–present). DINOv2, SigLIP, CLIP. Either used directly or distilled.

DINOv2 vs CLIP: why self-supervised beats language-aligned for manipulation

DINOv2 (Oquab et al., 2023) is a vision transformer trained purely on images via self-supervised distillation — no text, no language, no captions. It learns to produce features where visually similar regions have similar embeddings. The result: DINOv2 features are spatially discriminative — they distinguish "the left edge of the cup handle" from "the right edge of the cup handle" at the patch level.

CLIP (Radford et al., 2021) is trained via image-text contrastive learning. It learns features that align images with their captions. This makes CLIP excellent at semantic understanding ("this is a mug," "this is a sponge") but mediocre at spatial discrimination. CLIP's features are optimized to match an entire image to a sentence, not to distinguish sub-centimeter spatial differences within an image.

For manipulation, spatial discrimination is paramount. The policy needs to know where exactly on the object to place the fingers, not just what the object is. This is why frozen DINOv2 often outperforms fine-tuned CLIP for manipulation: DINOv2's self-supervised objective produces spatially richer features that the downstream policy can exploit for precise positioning.

When CLIP wins anyway. If the task requires language grounding — "pick up the red cup, not the blue one" — CLIP's language alignment becomes essential. The optimal choice for VLAs is often to use both: DINOv2 for spatial features and SigLIP/CLIP for language alignment, concatenated into a dual-encoder. This is exactly what OpenVLA does (DINOv2 + SigLIP).

Encoder comparison: the full picture

Encoder	Pretraining data	Output type	Params	Best for	Typical use
ResNet-18	ImageNet (1.3M images)	Global (avg pool)	11M	Single-task BC, speed-critical	ACT, legacy BC
CLIP ViT-B/16	400M image-text pairs	Global (CLS) + spatial (patch)	86M	Language-conditioned tasks	UMI, RT-2 family
SigLIP ViT-L	3B image-text pairs	Global + spatial	304M	Stronger language alignment at scale	OpenVLA, PaliGemma
DINOv2 ViT-L/14	142M images (self-supervised)	Spatial (per-patch)	304M	Spatial discrimination, manipulation	OpenVLA (paired), 3D policies
R3M	Ego4D + language	Global	~50M	Low-data manipulation	Small-lab BC
VC-1	Ego4D + ImageNet (MAE)	Spatial	~300M	Robust low-shot	Academic benchmarks

Freeze vs fine-tune: the decision boundary

The question of whether to freeze or fine-tune the vision encoder is primarily a function of dataset size, and the transition is sharper than most practitioners realize:

<100 demonstrations. Freeze everything. A frozen foundation model encoder with a linear probe on top. Fine-tuning any part of the encoder will overfit catastrophically — you have orders of magnitude fewer samples than the encoder has parameters.
100–1,000 demonstrations. Freeze the encoder, train a small adapter (2–3 transformer layers, ~8–16M parameters) on top. This is the sweet spot for most manipulation research. The adapter learns task-specific feature combinations without destroying the pretrained representations.
1,000–10,000 demonstrations. You can begin fine-tuning the last 2–4 layers of the encoder with a low learning rate (10× lower than the adapter). The earlier layers stay frozen — they contain low-level features (edges, textures) that are universal.
>10,000 demonstrations. Full fine-tuning is viable and often beneficial. At this scale, even ViT-L encoders improve from task-specific adaptation. But monitor for overfitting: track validation loss per epoch and stop early.

Frozen or fine-tuned?

The dominant practice in 2026 is frozen encoder + small adapter for foundation models, and full fine-tune for ResNet-scale encoders. The reasons:

Fine-tuning a 300M+ parameter ViT on a few thousand robot demonstrations destroys the pretraining priors. The robot data is too narrow to support the fine-tune.
A frozen encoder + a learnable linear probe or small transformer adapter preserves the priors and trains in hours.
For ResNet-18-scale encoders, the prior is weak enough that fine-tuning helps — and the data is abundant enough to support it.

Worked example: encoder choice decision tree. Q1: How many demos do you have? <100: Use a frozen foundation model (CLIP or DINOv2). No fine-tuning. Linear probe or small MLP on top. 100–1000: Frozen ViT + small transformer adapter (8–16M learnable params). This is the sweet spot for most manipulation. >1000: You can fine-tune ResNet-18 end-to-end, or use a frozen ViT with a larger adapter. >10,000: Fine-tune everything. At this scale, even ViTs benefit. Q2: Do you need language conditioning? Yes: Use CLIP or SigLIP (language-aligned). DINOv2 has no language alignment. No: DINOv2 gives the best raw visual features. Pair with CLIP only if you need text later. Q3: How fast does inference need to be? <10ms: ResNet-18 (single forward: ~2ms). 10–50ms: ViT-B/16 (~8ms frozen, batch-1 GPU). >50ms: ViT-L/14 (~20ms). Only viable with the slow-fast split.

What makes DINOv2 special for manipulation

DINOv2 deserves special attention because its design choices align unusually well with manipulation requirements. Three properties matter:

Patch-level spatial features. Unlike CLIP, which is optimized for image-level classification (matching an image to a caption), DINOv2's self-supervised objective (self-distillation with no labels) forces every patch token to be informative about its local region. The result: DINOv2's 256 patch tokens form a spatial map where nearby patches in the image have nearby representations in feature space. For a manipulation policy, this means the encoder preserves the fine-grained spatial structure needed to distinguish "top of the cup" from "side of the cup" at the feature level.

Robustness to viewpoint changes. DINOv2's training includes aggressive multi-crop augmentation: the student network sees small crops (covering 5–20% of the image) and must match the teacher's representation of the full image. This forces the features to be robust to scale and viewpoint changes — a property that transfers directly to manipulation, where the wrist camera's view of the object changes dramatically as the gripper approaches.

No language bias. CLIP features are biased toward the kinds of visual distinctions that language describes well ("red" vs "blue", "cat" vs "dog") and de-emphasize distinctions that language ignores (spatial layout, fine texture, sub-object structure). DINOv2 has no such bias — it treats all visual information equally. For manipulation, the spatially fine-grained information that DINOv2 preserves (edge geometry, surface normals implied by shading, grasp affordance cues) is exactly what CLIP discards.

The practical recommendation. For language-conditioned policies (VLAs), use DINOv2 + SigLIP (or CLIP) as a dual encoder — DINOv2 for spatial features, SigLIP for language alignment. For single-task BC without language, use DINOv2 alone. For speed-critical deployments where inference latency matters more than feature quality, use ResNet-18 fine-tuned end-to-end. Never use CLIP alone for manipulation unless language grounding is the primary requirement.

The adapter architecture

When using a frozen encoder, the adapter that sits between the encoder and the policy is the only learnable visual component. Two designs dominate:

Linear probe. A single linear layer from encoder dimension to policy input dimension. The simplest adapter: $z_{\text{policy}} = W z_{\text{encoder}} + b$ where $W \in \mathbb{R}^{d_{\text{policy}} \times d_{\text{encoder}}}$. Trainable parameters: $d_{\text{policy}} \times d_{\text{encoder}} \approx 200K$. Works surprisingly well with <100 demos. The linear probe is a good diagnostic: if it performs poorly, the encoder's features are not suited for the task.

Small transformer adapter. 2–4 transformer layers that process the encoder's spatial tokens and output a fixed number of "policy tokens" via cross-attention. Trainable parameters: 2–16M. This adapter can learn non-linear feature combinations and spatial aggregation patterns that a linear probe cannot. The cross-attention allows the adapter to dynamically focus on task-relevant regions of the image — attending to the gripper and object during contact, the broader scene during navigation.

Multi-camera fusion

Two strategies. Late fusion: encode each camera independently, concatenate or attention-fuse the resulting tokens before the policy. This is the standard. Early fusion: stitch images side-by-side or stack channels. Cheap but throws away camera identity.

Cross-attention works better than concatenation when one camera dominates (e.g., the wrist cam during contact). The policy can route attention to the camera that matters at each timestep.

Depth as an input channel

An alternative to full 3D point clouds is to add a depth channel to the 2D image: feed the encoder a 4-channel RGBD image instead of 3-channel RGB. This preserves the 2D pipeline (ViT, CNN) while giving the policy access to depth information. Two approaches:

Naive concatenation. Stack the depth channel alongside RGB to get a 4-channel input. The encoder's first convolutional layer must be modified to accept 4 channels (typically by copying the red channel's weights to initialize the depth channel). The rest of the network is unchanged. This is the simplest approach and works well with from-scratch training, but frozen pretrained ViTs cannot easily accept a 4th channel without re-training.

Separate depth encoder. Process RGB and depth with separate encoders, then fuse the features (concatenation or cross-attention). This preserves the pretrained RGB encoder while adding depth information. The depth encoder can be small (ResNet-18) because depth is lower-dimensional than RGB. The fusion point matters: early fusion (before the policy) gives the policy more information; late fusion (inside the policy) is more flexible.

The empirical finding: depth helps most for tasks where occlusion is the bottleneck (objects behind other objects, cluttered scenes) and least for tasks where appearance is the bottleneck (color-based sorting, texture-based grasping). For standard tabletop manipulation with an overhead camera, depth provides a 3–8% success rate improvement over RGB alone.

The wrist camera question

One of the most impactful architectural decisions in robot vision is whether to include a wrist-mounted camera. The wrist camera sees the object from the gripper's perspective — close up, during contact, at the moment when precision matters most. A scene camera 60cm above the workspace sees the broad layout but loses fine detail at the contact point.

The empirical evidence is clear: wrist cameras improve success rates by 10–25% on contact-rich tasks (insertion, precision pick, tool use) and have little effect on large-motion tasks (reaching, navigation). The reason is resolution: at a distance of 5cm, a 224×224 wrist camera image covers a ~4cm×4cm area at ~0.2mm/pixel resolution. The same object viewed by a scene camera 60cm away covers ~12×12 pixels — far too coarse for sub-millimeter positioning.

The cost is engineering: the wrist camera adds a cable to the robot's arm (risk of snagging), requires calibrating the camera-to-gripper transform, and doubles the vision encoder's compute budget. For mobile robots, cable routing is particularly painful. The practical compromise: use a wrist camera for manipulation, omit it for navigation-only tasks.

Augmentation

Three augmentations earn their seat:

Random shifts ($\pm$4 pixels) — simulates camera calibration error. Drops sim-to-real gap.
Color jitter — mild brightness, contrast, saturation. Critical for any policy that will see different lighting at deploy time.
Random crops at test time — DrQ-v2's trick: sample multiple crops at inference, average the Q-values.

Augmentations that don't earn their seat: heavy cutout, MixUp, anything that changes the geometry between the wrist camera and the gripper. The policy is not invariant to these — it depends on them.

Image preprocessing for robot vision

The standard image preprocessing pipeline for robot policies has subtle but important differences from the standard vision pipeline:

Resolution. 224×224 is the default (matches ViT-B pretraining). 336×336 for tasks requiring fine detail (insertion, threading). Never go below 128×128 — the policy loses critical spatial information.
Normalization. Match the encoder's pretraining statistics. For CLIP/SigLIP: ImageNet mean/std. For DINOv2: ImageNet mean/std (same). For a from-scratch ResNet: compute mean/std from your robot data.
Crop vs resize. Center-crop to square before resizing. Do not stretch — aspect ratio distortion confuses spatial reasoning. If the camera has a 4:3 aspect ratio, crop the top/bottom to make it 1:1.
Color space. Always RGB, never BGR. OpenCV defaults to BGR; failing to convert is a silent bug that degrades performance by 5–10% (the encoder's features become misaligned).

Worked example: vision encoder ablation. Task: pick up a randomly placed object on a table. 200 demonstrations. Franka robot with one scene camera and one wrist camera. ResNet-18, fine-tuned end-to-end: 72% success. Fast inference (4ms). Overfits slightly at 200 demos but still serviceable. The encoder's limited capacity forces the policy to learn simple spatial features. CLIP ViT-B/16, frozen + linear probe: 68% success. Surprisingly, worse than ResNet-18. CLIP's language alignment pulls features toward semantic similarity rather than spatial precision. The linear probe cannot compensate for the lost spatial information. DINOv2 ViT-B/14, frozen + 2-layer adapter: 81% success. The best result. DINOv2's patch-level features preserve fine spatial structure. The adapter learns to focus on contact-relevant patches. 8ms encoder + 2ms adapter = 10ms total. DINOv2 + SigLIP dual encoder, frozen + adapter: 80% success (no language in this task, so SigLIP adds nothing). But when language conditioning is added ("pick up the RED object"), this dual encoder reaches 85% while DINOv2-only drops to 65% (it cannot distinguish colors in its feature space). Takeaway: the encoder choice depends on the task. For spatial precision: DINOv2. For language grounding: SigLIP/CLIP. For both: dual encoder. For minimum cost: ResNet-18 fine-tuned.

The three augmentations that matter. Extensive ablation studies across manipulation benchmarks consistently find that three augmentations earn their seat. Their effects are additive and their computational cost is negligible (<1ms per image):

Random shifts. Translate the image by $\pm$4 pixels in each direction (pad edges with border pixels). This simulates camera calibration error — between sessions, a robot's cameras shift by a few pixels due to thermal expansion, vibration, or accidental bumps. Without shift augmentation, a policy trained on Monday fails on Tuesday because the pixels moved. With it, the policy is robust to $\pm$4-pixel camera motion. The cost of not using this augmentation is typically 10–15% success rate degradation in real-world deployment. Color jitter. Randomly perturb brightness (±20%), contrast (±20%), and saturation (±10%). Different rooms have different lighting; the same task under fluorescent lights vs. natural light looks dramatically different to a pixel-level policy. Color jitter forces the encoder to focus on shapes and spatial structure rather than absolute color values. Random crop at test time (DrQ-v2 trick). At inference, take $M = 2$ random crops of the input image, run the encoder on each, and average the resulting features. This is a test-time augmentation that smooths out the policy's sensitivity to exact crop position. The compute cost is $M\times$ the encoder cost, so it is only viable with fast encoders (ResNet-18) or when the encoder is already cached.

Feature caching and batch inference

A practical optimization that saves 30–50% of inference time: because the vision encoder is frozen, its outputs can be cached and reused. If the camera image has not changed significantly between control steps (which is common at 50Hz+ control rates), the encoder features from the previous step can be reused without recomputing. A simple L2 distance threshold on the raw image determines whether to recompute: if $\|I_t - I_{t-1}\|_2 / N_{\text{pixels}} < \epsilon$ (typical $\epsilon = 0.01$), reuse the cached features.

For training, the optimization is even more dramatic: pre-compute all encoder features for the entire dataset before training begins. Store them as tensors on disk. The training loop reads features directly, bypassing the encoder entirely. This cuts training time by 2–3× for large ViT encoders and reduces GPU memory (no encoder in the training graph). The only requirement is that the encoder is truly frozen — if any gradient flows through the encoder, pre-computation is invalid.

13·BLanguage-conditioned planning

The bridge between LLMs and robot actions — using language models as high-level planners that decompose tasks into primitives a low-level policy can execute.

A VLA puts language understanding and motor control inside a single forward pass. Language-conditioned planning takes the opposite approach: a large language model plans, and a separate low-level policy acts. The LLM never touches the joints. It proposes a sequence of subgoals or primitive calls, and a pre-trained controller executes each one. The appeal is compositionality: a robot that knows 20 primitives can, in principle, solve any task that decomposes into a sequence of those 20 primitives — without any new training data.

The hierarchy runs three levels deep: a VLM or LLM for "what to do" (task decomposition and common-sense reasoning), a planner for "how to do it" (sequencing, constraint satisfaction, spatial reasoning), and a low-level policy for "muscle memory" (the actual motor commands). Each level operates at a different frequency and a different level of abstraction. The open question is where to draw the boundaries.

SayCan — affordance grounding

Ahn et al., 2022. The founding paper of the paradigm. The setup: a large language model (PaLM) proposes candidate next actions in natural language ("pick up the sponge", "go to the sink", "wipe the counter"). For each candidate, a pre-trained value function scores the probability that the robot can actually execute it right now, given its current state. The selected action is the one that maximizes the product of language usefulness (from the LLM) and physical feasibility (from the value function).

In plain English: the language model plays the role of a chef calling out orders — it knows what dish to make but has no idea which ingredients are within arm's reach. The value function plays the role of the cook at the station — it knows exactly what it can grab right now but has no idea about the recipe. Multiply the two scores together and the highest-scoring action is both useful for the task and physically doable.

SayCan scoring $$ a^* = \arg\max_{a_i \in \mathcal{A}} \; \underbrace{p_{\text{LLM}}(a_i \mid \text{instruction}, \text{history})}_{\text{useful?}} \;\times\; \underbrace{V(s, a_i)}_{\text{feasible?}} $$

$\mathcal{A}$ — the set of available primitives. Each is a short natural-language description paired with a pre-trained low-level policy. Typical set: 50–100 primitives covering navigation, picking, placing, opening, closing.
$p_{\text{LLM}}(a_i \mid \text{instruction}, \text{history})$ — the language model's score for how useful action $a_i$ is toward completing the instruction, given what has already been done. This is the LLM's next-token probability for the action string.
$V(s, a_i)$ — the affordance score. A value function trained via RL or BC that estimates the probability of successfully executing primitive $a_i$ from the current state $s$. This is the robot's self-knowledge: "I can pick up the sponge from here, but I can't reach the shelf."

In code: score = llm_prob * affordance_value for each candidate skill, then best = skills[scores.argmax()]. The LLM probabilities come from the model's next-token logits over the skill name strings. The affordance values come from pre-trained per-skill value functions evaluated on the current observation. Total inference: one LLM forward pass + N value-function forward passes (N = number of skills, typically 50–100).

The product is elegant: the LLM says what's useful, the robot says what's possible. Neither alone is sufficient — the LLM doesn't know the robot's reach, and the value function doesn't know what task the human wants. Together they ground language in physical reality.

Worked example: SayCan in action. Instruction: "I spilled my drink, can you help?" The robot is in a kitchen. Step 1. LLM proposes candidates: "pick up sponge" (0.35), "go to table" (0.20), "find a towel" (0.25), "open fridge" (0.05), "pick up cup" (0.15). Step 2. Value function scores feasibility from current state (near counter): "pick up sponge" (0.92 — sponge is visible), "go to table" (0.85), "find a towel" (0.30 — no towel in view), "open fridge" (0.90), "pick up cup" (0.70). Step 3. Products: sponge = 0.35 × 0.92 = 0.322, table = 0.20 × 0.85 = 0.170, towel = 0.25 × 0.30 = 0.075, fridge = 0.05 × 0.90 = 0.045, cup = 0.15 × 0.70 = 0.105. Selected: "pick up sponge" (0.322). The robot picks up the sponge, executes that primitive, then replans. The LLM, now conditioned on "picked up sponge", scores "go to spill" highest.

SayCan scoring: the math in detail

Worked numerical example: SayCan scoring with 3 skills. The robot has 3 available skills: [pick(cup), place(table), pour(cup)]. The instruction is "fill the cup with water." LLM scores (probability that each skill is the useful next step): pick(cup) = 0.8, place(table) = 0.1, pour(cup) = 0.1. Value function scores (probability of successful execution from current state): pick(cup) = 0.9 (cup is visible and reachable), place(table) = 0.7 (table is clear), pour(cup) = 0.2 (robot is not holding anything — can't pour). Combined scores ($p_{\text{LLM}} \times V$): pick(cup) = $0.8 \times 0.9 = 0.72$ place(table) = $0.1 \times 0.7 = 0.07$ pour(cup) = $0.1 \times 0.2 = 0.02$ Selected action: pick(cup) with score 0.72. The LLM wanted to pick the cup, and the value function confirmed the robot can do it. After picking the cup, the LLM is re-queried with the updated history. Now pour(cup) gets a high LLM score (0.7) and a high value score (0.85, since the robot is now holding the cup near the faucet). The multiplicative scoring automatically sequences the task: first pick (because you can't pour without holding), then pour (because the instruction says "fill").

Inner Monologue — closed-loop language feedback

Huang et al., 2022. SayCan plans open-loop: the LLM generates the full plan, and if something goes wrong mid-execution, it doesn't know. Inner Monologue closes the loop. After each primitive execution, the robot generates a text description of what it observes (via an image captioner or object detector), and that description is appended to the LLM's context. If the primitive failed — "the sponge was not picked up, it is still on the counter" — the LLM replans.

The feedback sources are heterogeneous: success/failure detectors, scene descriptions from a VLM, human corrections typed into a chat interface. The LLM treats them all as text. This is both the strength (any sensor can contribute) and the weakness (text is a lossy representation of the world state).

Code as Policies — the policy is the program

Liang et al., 2023. Instead of scoring a fixed set of primitives, the LLM writes Python code that calls perception APIs and motion primitives directly. The "policy" is the generated program. Give the LLM a prompt with API documentation — get_obj_pos("mug"), move_to(x, y, z), grasp() — and a natural-language instruction, and it produces executable code.

Code as Policies — LLM output

# Instruction: "put the red block on top of the blue block"
# Generated by LLM:
red_pos = get_obj_pos("red block")
blue_pos = get_obj_pos("blue block")
move_to(red_pos[0], red_pos[1], red_pos[2] + 0.05)  # approach from above
grasp()
move_to(blue_pos[0], blue_pos[1], blue_pos[2] + 0.08)  # place above blue
release()

The power is combinatorial: the LLM can compose primitives in arbitrary ways, use loops and conditionals, and call perception mid-execution. The fragility is also combinatorial: one wrong coordinate, one hallucinated API name, one off-by-one in a loop, and the robot does something wrong or dangerous. There is no learned recovery — the code either works or it doesn't.

Worked example: Code as Policies for a multi-step task. Instruction: "sort the fruits into the bowl by color — red fruits in the left bowl, green fruits in the right bowl." The LLM receives the API documentation and generates:

Code as Policies — LLM-generated sorting program

# API: get_obj_pos(name), get_obj_color(name), move_to(x,y,z),
#      grasp(), release(), get_objects_on("table")

left_bowl = get_obj_pos("left bowl")
right_bowl = get_obj_pos("right bowl")

for obj in get_objects_on("table"):
    if "fruit" not in get_obj_category(obj):
        continue
    pos = get_obj_pos(obj)
    color = get_obj_color(obj)
    # Approach from above
    move_to(pos[0], pos[1], pos[2] + 0.05)
    grasp()
    # Route to correct bowl

  
    Worked example: VoxPoser 3D value map. Instruction: "move to the cup." The workspace is discretized into a 40×40×40 voxel grid (0.5m³ workspace, ~1.25cm resolution).

    Step 1: LLM generates code. The LLM receives the instruction and a list of detected objects with their 3D positions. It generates Python code that writes a scalar field:

    value_map[cup_x-5:cup_x+5, cup_y-5:cup_y+5, cup_z:cup_z+10] = 1.0

    This creates a "hot zone" of high value (1.0) in the voxels near and above the cup. All other voxels remain at 0.0.

    Step 2: LLM generates constraint map. Obstacle avoidance: constraint_map[table_surface_z-2:table_surface_z+2, :, :] = -1.0. Voxels at the table surface have negative value (repulsive).

    Step 3: Motion planner. A gradient-based planner (MPC or trajectory optimization) finds the end-effector path that maximizes cumulative value while respecting the constraint map. The robot's end-effector follows the gradient of the value map — ascending toward the cup from its current position while avoiding the table surface.

    The 3D value map serves as a "potential field" that the motion planner navigates. The LLM never generates a trajectory directly — it generates the landscape, and the planner finds the path.
  

    if color == "red":
        target = left_bowl
    else:
        target = right_bowl
    move_to(target[0], target[1], target[2] + 0.10)
    release()

This program composes perception (color detection, object enumeration), control flow (loops, conditionals), and motion primitives into a behavior that no fixed primitive set could express. The LLM effectively serves as a program synthesizer that translates natural language into executable robot code. The failure mode is also visible: if get_obj_color returns "orange" for a tomato, it goes in the wrong bowl. There is no graceful degradation.

VoxPoser — 3D value maps from language

Huang et al., 2023. A different interface between language and action. Instead of generating code that calls motion primitives, the LLM generates code that writes 3D voxel maps: a value map (where the end-effector should go) and a constraint map (where it should not go). A classical motion planner then optimizes a trajectory through the voxel space.

The key insight: 3D value maps are a natural interface between language-level semantics ("put it on the shelf") and motion-planner-level geometry ("the end-effector should converge to coordinates [0.3, 0.5, 0.8] while avoiding the obstacle at [0.3, 0.3, 0.6]"). The LLM provides the semantics; the voxel map provides the geometry; the planner provides the dynamics.

ReKep — relational keypoint constraints

Huang et al., 2024. The LLM specifies constraints not in voxel space but as relational keypoint constraints: "keypoint A (the cup handle) should be within 2cm of keypoint B (the hook), and keypoint C (the cup bottom) should be above keypoint D (the shelf surface)." A numerical optimizer finds a trajectory satisfying all constraints simultaneously.

In plain English: the LLM says "keep the gripper above the cup rim" and "bring the spout close to the mug opening." The optimizer then finds a smooth arm trajectory that satisfies all of these spatial relationships simultaneously — without the LLM ever touching a joint angle. The LLM writes the rules of the game; the optimizer plays it.

ReKep constraint optimization $$ \tau^* = \arg\min_\tau \sum_i \lambda_i \, c_i(\tau) \quad \text{s.t.} \quad c_j(\tau) \leq 0 \;\; \forall j \in \mathcal{C}_{\text{hard}} $$

$\tau$ — the robot trajectory. A sequence of end-effector poses over time.
$c_i(\tau)$ — a soft constraint cost. Penalizes violations of relational keypoint constraints specified by the LLM. Example: $\| p_A(\tau_T) - p_B \|_2^2$ (keypoint A should reach keypoint B at the final timestep).
$\mathcal{C}_{\text{hard}}$ — the set of hard constraints: collision avoidance, joint limits, stability.
$\lambda_i$ — priority weights for soft constraints, also specified by the LLM. "The cup must not spill" gets a higher weight than "approach from the left."

What this means for your system: the LLM generates Python cost functions like lambda traj: (traj[-1].ee_pos - hook_pos).norm(), and the optimizer (typically scipy.optimize.minimize or a shooting method) minimizes the weighted sum. Latency is dominated by the optimizer: 50–200ms per replan, which is acceptable for 1–2Hz subgoal planning but too slow for servo-rate control. Keypoint detection accuracy is the primary failure mode — a 2cm localization error on "the cup handle" propagates directly into a 2cm placement error.

ReKep's advantage over VoxPoser: keypoint constraints are more interpretable, easier for the LLM to specify correctly, and more sample-efficient for the optimizer. The cost is that the keypoints must be detected in the scene — which requires a vision model that can localize "the cup handle" and "the hook" from a language description.

CoPa and SpatialVLM — VLM-based spatial planners

The newest wave replaces the LLM + separate perception pipeline with a single VLM that can reason about 3D space directly. CoPa (Huang et al., 2024) uses a VLM to generate manipulation plans by reasoning over object contact points and post-contact trajectories. SpatialVLM (Chen et al., 2024) trains a VLM on spatial reasoning data so it can answer quantitative questions ("how far is the mug from the edge?") and use those answers to parameterize actions. The direction is clear: collapse the LLM + perception stack into a single model that sees and reasons simultaneously.

The comparison

Method	Interface	Closed-loop?	Spatial reasoning	Failure mode
SayCan	Score fixed primitive set	Open-loop (per step)	Via value functions only	Missing primitive = stuck
Inner Monologue	Score primitives + text feedback	Yes (text observations)	Via captioner	Captioner error → bad replan
Code as Policies	Generated Python code	Optional (re-call LLM)	Via perception APIs	Bad code = bad action
VoxPoser	3D voxel value/constraint maps	Replan per phase	Explicit 3D voxel grid	Coarse voxels = imprecise
ReKep	Relational keypoint constraints	Replan on failure	Keypoint coordinates	Bad keypoint detection = wrong target
CoPa / SpatialVLM	VLM-generated contact plans	Yes (VLM observes)	Native VLM spatial reasoning	VLM hallucination

The latency problem

Every language-conditioned planning method shares a fundamental timing constraint: LLM inference is slow. A single forward pass through a 7B model takes 500ms–2s, depending on hardware and prompt length. For a 10Hz control loop (100ms per step), you cannot call the LLM at every timestep. The latency budget simply does not fit.

The standard solution is a hierarchical frequency split:

High-level planner (LLM): 0.1–1 Hz. Called once per primitive (every 2–10 seconds). Selects the next subgoal or writes the next code snippet. The 500ms–2s latency is acceptable because the planner is not on the inner control loop.
Low-level policy: 10–50 Hz. Executes the selected primitive reactively. Does not involve the LLM at all. Runs a pre-trained BC or RL policy conditioned on the subgoal.

This is the same System 2 / System 1 split as the VLA architecture (Section 12), but with the boundary drawn at the language-action interface rather than inside a single model. The LLM does high-level reasoning at human decision speed; the policy does motor control at robot servo speed.

Worked example: latency budget for SayCan. The robot must execute "bring me a coke from the fridge." Decomposition (LLM, called 6 times): 1. "Navigate to fridge" — LLM scores + selects (1.2s). Low-level nav policy executes (8s). 2. "Open fridge door" — LLM scores + selects (1.0s). Low-level policy executes (3s). 3. "Pick up coke can" — LLM scores + selects (1.1s). Low-level policy executes (4s). 4. "Close fridge door" — LLM scores + selects (0.9s). Low-level policy executes (3s). 5. "Navigate to user" — LLM scores + selects (1.0s). Low-level policy executes (6s). 6. "Hand over coke" — LLM scores + selects (0.8s). Low-level policy executes (2s). Total LLM time: 6 × ~1s = 6s. Total execution time: 26s. Total task time: 32s. The LLM adds ~18% overhead to the task. If the LLM were called at every 10Hz control step (320 calls), it would add 320s of latency — longer than the task itself.

Limitations

Language-conditioned planning is powerful for long-horizon multi-step tasks where the planning horizon exceeds what a single policy can handle. But the limitations are real and should inform when you reach for this approach vs. an end-to-end VLA:

Latency. An LLM call takes 200ms–2s. If you're replanning every primitive (5–10 seconds), this is fine. If you need reactive control at 10Hz, it's not. The two-system VLA split solves this by amortizing the LLM call over many control steps.
Brittleness. One bad code generation, one hallucinated coordinate, one misidentified object, and the entire plan fails. There is no graceful degradation — the failure mode is binary. End-to-end VLAs degrade more smoothly because the policy is a continuous function, not a program.
The "last mile" problem. Language is too coarse for fine manipulation. "Pick up the needle" does not tell the policy how to orient the fingers, how hard to squeeze, or how to compensate for the needle's flex. The low-level policy must handle all of this, and the planner cannot help.
Primitive coverage. SayCan-style methods are limited to the primitives that have been pre-trained. If the task requires a motion the primitive set doesn't cover, the system is stuck. Code-based methods (Code as Policies, VoxPoser) are more flexible but shift the coverage problem to the API surface.

System integration: the full stack in 2026

The convergent architecture uses language-conditioned planning as one layer in a multi-layer system:

Layer	Frequency	Component	Input	Output
Task decomposition	0.1–0.5 Hz	LLM / VLM	Language instruction + scene image	Sequence of subgoals
Subgoal planning	1–2 Hz	VoxPoser / ReKep / Code-as-Policies	Current subgoal + scene	Spatial targets or constraint maps
Motor execution	10–200 Hz	Diffusion Policy / VLA action expert	Spatial target + observation	Joint or EE commands

Each layer operates at a different timescale and abstraction level. The task decomposition layer runs once per task or per major phase transition. The subgoal planning layer runs once per primitive (every few seconds). The motor execution layer runs at servo rate. Information flows downward (higher layers condition lower layers); feedback flows upward (motor failures and scene changes trigger replanning at higher layers).

When to use what

Use language-conditioned planning when: the task has 5+ sequential stages, requires common-sense reasoning ("the cup goes in the dishwasher, not the trash"), or must generalize to instructions the policy has never seen. Use an end-to-end VLA when: the task is short-horizon, requires precise manipulation, or latency matters. Use both when: the VLM plans at 1Hz and the VLA executes at 50Hz — which is, increasingly, the dominant architecture.

The future: unified VLM planners

The separation between "planning" (LLM-based, discrete, symbolic) and "execution" (policy-based, continuous, learned) is an artifact of the current technology stack. The convergent architecture is already collapsing these layers: Gemini Robotics 1.5 interleaves reasoning tokens with action tokens in a single model. The LLM does not "plan" and then "hand off" — it thinks and acts in the same token stream. The reasoning is causally upstream of the actions, providing the same compositional benefits as a separate planner but without the latency and integration overhead of a two-model system.

This trend — from separate LLM + policy to unified VLM that reasons and acts — is likely to make the SayCan/Code-as-Policies paradigm obsolete within 2–3 years. But the concepts (affordance grounding, spatial value maps, relational constraints) will survive as inductive biases or training objectives for the unified models. Understanding them now is essential for understanding what comes next.

Language-conditioned planning is not a competitor to VLAs. It is a layer above them. The convergent architecture of 2026 uses a VLM for task decomposition, a language-conditioned planner for sequencing, and a VLA or low-level policy for execution. The debate is not "which one" but "where to draw the boundaries."

14PPO

The locomotion workhorse. The reason simulator-trained quadrupeds walk.

Proximal Policy Optimization (Schulman et al., 2017) is an on-policy actor-critic algorithm that became the dominant RL method for robotics-in-simulation.

The objective

In plain English: imagine you have a room full of robot arms, and each one tried a slightly different approach to the task. Some did well, some did badly. PPO says: for the ones that did well, make that behavior a little more likely next time — but not too much more, or you'll overcommit to a fluke. For the ones that did badly, freely make that behavior less likely. The "clipping" is the speed limit on how fast you can change your mind.

PPO clipped surrogate $$ \mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t\Big[\min\!\big( r_t(\theta)\, \hat A_t,\; \mathrm{clip}(r_t(\theta),\, 1-\epsilon,\, 1+\epsilon)\, \hat A_t \big)\Big]$$ $$ r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_\text{old}}(a_t \mid s_t)}, \qquad \hat A_t = \text{GAE}(\lambda)$$

$r_t(\theta)$ — the importance ratio. Measures how much more (or less) likely the current policy $\pi_\theta$ is to take action $a_t$ compared to the old policy $\pi_{\theta_\text{old}}$ that actually collected the data. $r_t = 1$ means no change; $r_t = 2$ means the action is now twice as likely.
$\pi_\theta(a_t \mid s_t)$ — the current policy's probability of taking action $a_t$ in state $s_t$. This changes as we update $\theta$.
$\pi_{\theta_\text{old}}(a_t \mid s_t)$ — the old policy's probability (frozen snapshot from before the current update). The data was collected under this policy.
$\hat A_t$ — the estimated advantage of action $a_t$ in state $s_t$. Positive means "better than average," negative means "worse than average." Computed via Generalized Advantage Estimation (GAE), which blends 1-step, 2-step, ..., $n$-step TD errors with exponential weighting $\lambda \approx 0.95$.
$\epsilon$ — the clip range, typically 0.2. Limits the policy ratio to $[0.8, 1.2]$, preventing any single update from changing the policy too dramatically. This is PPO's replacement for TRPO's hard KL constraint.
$\min(\cdot, \cdot)$ — the pessimistic bound. Takes the lower of the clipped and unclipped objective. When advantage is positive (good action), this prevents the policy from increasing the action's probability too aggressively. When advantage is negative (bad action), the policy can freely decrease the probability.

In code: ratio = torch.exp(log_prob_new - log_prob_old), then loss = -torch.min(ratio * adv, torch.clamp(ratio, 1-eps, 1+eps) * adv).mean(). Three lines. The ratio is computed in log-space for numerical stability. The advantage adv comes from a backward GAE loop over the trajectory (see the worked example below). In Isaac Gym with 4096 parallel environments, one PPO update takes ~200ms and processes ~65K transitions.

Three pieces:

Importance ratio $r_t$ — corrects for the fact that the data was collected by the old policy.
Advantage $\hat A_t$ — typically generalized advantage estimation, a $\lambda$-weighted blend of $n$-step TD errors. $\lambda \approx 0.95$ is standard.
Clipping — when $r_t$ exceeds $1 \pm \epsilon$ (typically $\epsilon = 0.2$), the surrogate flattens.

Derivation: the policy gradient theorem

In plain English: try random things, observe what happens, and do more of what worked. The policy gradient says: if an action led to a high reward, nudge the policy to make that action more likely. If it led to a low reward, nudge the other way. The beauty is that you never need to know the physics of the environment — you only need to know which actions you took and how much reward you got.

Goal. Show that $\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a \mid s) \cdot Q^{\pi_\theta}(s, a)\right]$.

Step 1. The objective is $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$ where $R(\tau) = \sum_t \gamma^t r_t$. The trajectory distribution is $p_\theta(\tau) = \mu(s_0) \prod_t \pi_\theta(a_t \mid s_t) p(s_{t+1} \mid s_t, a_t)$.

Step 2. Take the gradient. The dynamics $p(s_{t+1} \mid s_t, a_t)$ do not depend on $\theta$, so:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\tau}\left[R(\tau) \cdot \sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\right] $$

This is the REINFORCE estimator. It has high variance because $R(\tau)$ includes rewards from before and after action $a_t$.

Step 3. Apply the "future-only" simplification: rewards before time $t$ don't depend on $a_t$, so they add zero-mean noise. Replace $R(\tau)$ with $Q^{\pi}(s_t, a_t) = \mathbb{E}[\sum_{k \geq t} \gamma^{k-t} r_k \mid s_t, a_t]$. Subtracting a baseline $V(s_t)$ gives the advantage $A_t = Q(s_t, a_t) - V(s_t)$, which further reduces variance without changing the expectation.

Derivation: why clipping works

The issue with vanilla policy gradient: a single gradient step can change the policy drastically, especially when the advantage is large. TRPO solved this with a hard KL constraint, which requires second-order optimization. PPO replaces this with a simpler mechanism.

The surrogate objective without clipping is $\mathbb{E}[r_t \hat A_t]$ where $r_t = \pi_\theta / \pi_{\theta_{\text{old}}}$. This objective has the same gradient as the true policy gradient at $r_t = 1$ (i.e., when the new policy equals the old), but far from $r_t = 1$ it can be misleading.

Clipping removes the incentive to move $r_t$ far from 1. When $\hat A_t > 0$ (good action), $\min(r_t \hat A_t, (1+\epsilon)\hat A_t)$ caps the benefit of increasing $r_t$ above $1+\epsilon$. When $\hat A_t < 0$ (bad action), $\min(r_t \hat A_t, (1-\epsilon)\hat A_t)$ still allows full correction — there is no clip on the corrective side. This asymmetry is the key: the policy can always run away from bad actions, but it cannot rush toward good ones.

unclipped surrogate PPO clipped objective flat region

Interactive: PPO clipping visualization

Advantage A: 1.5 ε: 0.20

unclipped r·A PPO L_CLIP clip boundary

The full PPO loss adds a value function regression term and an entropy bonus:

Full PPO loss $$ \mathcal{L} = -\mathcal{L}^{\text{CLIP}} + c_1 \cdot \mathcal{L}^{\text{VF}} - c_2 \cdot \mathcal{H}[\pi_\theta(\cdot \mid s_t)]$$

$-\mathcal{L}^{\text{CLIP}}$ — the negated clipped surrogate. Negated because we minimize the total loss but want to maximize the clipped objective (i.e., increase probability of good actions).
$\mathcal{L}^{\text{VF}} = (V_\theta(s_t) - V_t^{\text{target}})^2$ — the value function loss. Trains the critic to predict expected returns. $V_t^{\text{target}}$ is typically the GAE-based return estimate. $c_1 = 0.5$ is standard.
$\mathcal{H}[\pi_\theta(\cdot \mid s_t)]$ — the entropy of the policy's action distribution at state $s_t$. For a Gaussian policy, $\mathcal{H} = \frac{1}{2}\log(2\pi e \sigma^2)$ per dimension.
$c_2$ — the entropy coefficient, typically 0.0 to 0.01 for continuous control. The $-c_2 \mathcal{H}$ term (note the minus sign) rewards exploration by penalizing overly deterministic policies. Higher $c_2$ = more exploration, slower convergence.

In code: loss = ppo_clip_loss + 0.5 * F.mse_loss(v_pred, v_target) - 0.01 * entropy.mean(). The three terms are computed on the same batch and summed. The GAE advantages are computed in a backward loop over the trajectory before the PPO update, not inside it. Typical training: 4096 parallel environments in Isaac Gym, 24 steps per rollout, 4 PPO epochs per batch = ~400K transitions per update.

Worked example: GAE advantage computation. Consider 4 timesteps with discount $\gamma = 0.99$ and GAE $\lambda = 0.95$. Rewards: $r_0 = 0.1$, $r_1 = 0.0$, $r_2 = 0.5$, $r_3 = 1.0$. Value estimates: $V(s_0) = 2.0$, $V(s_1) = 2.1$, $V(s_2) = 2.5$, $V(s_3) = 3.0$, $V(s_4) = 0$ (terminal). TD errors: $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$. $\delta_0 = 0.1 + 0.99 \times 2.1 - 2.0 = 0.1 + 2.079 - 2.0 = 0.179$. $\delta_1 = 0.0 + 0.99 \times 2.5 - 2.1 = 2.475 - 2.1 = 0.375$. $\delta_2 = 0.5 + 0.99 \times 3.0 - 2.5 = 0.5 + 2.97 - 2.5 = 0.97$. $\delta_3 = 1.0 + 0.99 \times 0 - 3.0 = -2.0$. GAE advantages (computed backwards): $\hat A_3 = \delta_3 = -2.0$. $\hat A_2 = \delta_2 + \gamma\lambda \hat A_3 = 0.97 + 0.99 \times 0.95 \times (-2.0) = 0.97 - 1.881 = -0.911$. $\hat A_1 = \delta_1 + \gamma\lambda \hat A_2 = 0.375 + 0.9405 \times (-0.911) = 0.375 - 0.857 = -0.482$. $\hat A_0 = \delta_0 + \gamma\lambda \hat A_1 = 0.179 + 0.9405 \times (-0.482) = 0.179 - 0.453 = -0.274$. Interpretation: the early actions ($t=0, 1$) have mildly negative advantage — they led to a trajectory that ended poorly ($r_3 = 1.0$ but $V(s_3) = 3.0$, so the terminal outcome was much worse than expected). PPO will decrease the probability of these actions.

Why PPO and not policy gradient

Vanilla policy gradient (REINFORCE) is high-variance. TRPO (the predecessor) is correct but uses second-order optimization that's painful to implement at scale. PPO replaces the trust region with a clipped first-order objective that recovers most of the stability benefit at a fraction of the engineering cost.

Where PPO shines

Locomotion in simulation. Isaac Gym, MuJoCo MJX, Brax. With 4096+ parallel environments, billions of timesteps cost an afternoon.
Sim-to-real with domain randomization.

Where PPO loses

Real-world robots. On-policy means every gradient step throws away old data.
Sparse rewards. PPO needs reward signal; without shaping, it doesn't explore well.

15SAC and the off-policy family

Maximum-entropy reinforcement learning. The default for sample-efficient RL.

Soft Actor-Critic (Haarnoja et al., 2018) is an off-policy actor-critic that adds an entropy bonus to the reward.

The maximum-entropy objective

In plain English: be good at the task, but also keep your options open. A standard RL agent that finds one way to grasp the cup commits to it fully — and breaks when the cup moves 2cm. A max-entropy agent maintains several viable grasping strategies simultaneously. The entropy bonus is the mathematical expression of "don't put all your eggs in one basket."

Max-entropy RL objective $$ J(\pi) = \mathbb{E}\Big[\sum_t \gamma^t \big( r_t + \alpha \cdot \mathcal{H}[\pi(\cdot \mid s_t)] \big)\Big]$$

$J(\pi)$ — the objective to maximize. Standard RL maximizes cumulative reward; max-entropy RL adds a bonus for being "random in a useful way."
$\gamma$ — the discount factor, typically 0.99. Weights future rewards: a reward $k$ steps from now is worth $\gamma^k$ as much as an immediate reward. $\gamma = 0.99$ means the agent cares about roughly the next 100 steps.
$r_t$ — the reward received at timestep $t$. Defined by the task (e.g., +1 for grasping the object, -0.01 per step as a time penalty).
$\alpha$ — the temperature parameter. Controls the exploration-exploitation trade-off. High $\alpha$ = more exploration (the agent values entropy nearly as much as reward). Low $\alpha$ = near-greedy (standard RL). Can be learned automatically (see below).
$\mathcal{H}[\pi(\cdot \mid s_t)] = -\mathbb{E}_\pi[\log \pi(a \mid s_t)]$ — the entropy of the policy at state $s_t$. High entropy means the policy spreads probability over many actions (exploratory); low entropy means it concentrates on one action (exploitative). The entropy bonus prevents premature collapse to a deterministic policy.

Derivation: the soft Bellman equation

Goal. Show that the optimal soft Q-function satisfies a modified Bellman equation with an entropy term.

In plain English: the robot dreams about what will happen next, considers all possible next actions (not just the single best one), and values states where it has many good options more than states where it has only one. The "soft" Bellman equation is the standard one but with a blurry maximum that keeps multiple strategies alive.

In standard RL, the Bellman equation is $Q^*(s,a) = r(s,a) + \gamma \mathbb{E}_{s'}[\max_{a'} Q^*(s', a')]$. The max-entropy version replaces the hard max with a soft max (log-sum-exp):

Soft Bellman equation $$ Q_{\text{soft}}^*(s,a) = r(s,a) + \gamma \, \mathbb{E}_{s'}\left[\alpha \log \sum_{a'} \exp\!\left(\frac{Q_{\text{soft}}^*(s', a')}{\alpha}\right)\right] $$

$Q_{\text{soft}}^*(s,a)$ — the optimal soft Q-function. Like the standard Q-function, it represents the value of taking action $a$ in state $s$ and acting optimally thereafter — but "optimally" now includes the entropy bonus.
$r(s,a)$ — the immediate reward for taking action $a$ in state $s$.
$\mathbb{E}_{s'}[\cdot]$ — expectation over the next state $s'$, drawn from the environment dynamics $p(s' \mid s, a)$.
$\alpha \log \sum_{a'} \exp(Q/\alpha)$ — the soft maximum (log-sum-exp). This is a smooth, differentiable approximation to $\max_{a'} Q(s', a')$. When $\alpha \to 0$, it becomes a hard max (standard Bellman). When $\alpha$ is larger, it "softens" the max, encouraging the agent to keep multiple near-optimal actions viable rather than committing to one.

The optimal policy is the Boltzmann distribution: $\pi^*(a \mid s) \propto \exp(Q_{\text{soft}}^*(s,a) / \alpha)$. With this policy, we can write the soft value as $V_{\text{soft}}(s) = \alpha \log \sum_a \exp(Q(s,a)/\alpha) = \mathbb{E}_\pi[Q(s,a) - \alpha \log \pi(a \mid s)]$.

Substituting back gives SAC's practical Bellman target: $y_t = r_t + \gamma \mathbb{E}_{a' \sim \pi}[\min_j \bar Q_j(s', a') - \alpha \log \pi(a' \mid s')]$. The $\min$ over two Q-networks prevents overestimation; the $-\alpha \log \pi$ term is the entropy bonus.

Derivation: automatic $\alpha$ tuning

The temperature $\alpha$ controls the exploration-exploitation trade-off. Setting it manually is fragile. The automatic tuning formulation solves a constrained optimization: maximize the expected return subject to $\mathcal{H}[\pi(\cdot \mid s_t)] \geq \bar{\mathcal{H}}$ for all $s_t$. By duality, this gives the loss:

α auto-tuning $$ \mathcal{L}(\alpha) = -\alpha \cdot \big(\log \pi(a \mid s) + \bar{\mathcal{H}}\big), \qquad \bar{\mathcal{H}} = -\dim(\mathcal{A})$$

$\alpha$ — the learnable temperature. Treated as an optimization variable with its own loss and learning rate. Updated alongside the actor and critic.
$\log \pi(a \mid s)$ — the log-probability of the sampled action under the current policy. Very negative values mean the policy is spread out (high entropy); near-zero values mean it is concentrated (low entropy).
$\bar{\mathcal{H}} = -\dim(\mathcal{A})$ — the target entropy. For a 7-DoF action space, $\bar{\mathcal{H}} = -7$. This is a heuristic: it roughly corresponds to the entropy of a unit Gaussian per dimension. If the policy's entropy drops below this, $\alpha$ increases to encourage more exploration.
$-\alpha \cdot (\log \pi + \bar{\mathcal{H}})$ — the loss pushes $\alpha$ up when $\log \pi + \bar{\mathcal{H}} < 0$ (policy too deterministic) and down when $\log \pi + \bar{\mathcal{H}} > 0$ (policy too stochastic). At equilibrium, $\mathbb{E}[\log \pi] = \bar{\mathcal{H}}$.

When the policy is too deterministic ($\mathcal{H} < \bar{\mathcal{H}}$, meaning $\log \pi$ is very negative), the loss drives $\alpha$ up, increasing the entropy bonus. When the policy is too stochastic, $\alpha$ decreases. The target entropy $-\dim(\mathcal{A})$ is heuristic but works: it roughly corresponds to a uniform distribution over a unit hypercube in each action dimension.

What this means for your system: you never have to hand-tune the exploration rate. Set target_entropy = -action_dim and let $\alpha$ adjust itself. In practice, $\alpha$ starts high (0.2–0.5) when the policy is random at initialization, drops as the policy improves, and stabilizes around 0.01–0.05 in late training. If $\alpha$ stays high throughout training, the reward signal is too weak and the policy cannot find a good exploitation strategy.

Twin Q and target networks

Two tricks borrowed from TD3:

Twin Q-networks with the $\min$ operator combat overestimation bias in the Q-learning target.
Target networks updated as a slow EMA of the online networks ($\tau \approx 0.005$) stabilize the bootstrap target.

Why a single Q-network overestimates

The Bellman target for Q-learning is $y = r + \gamma \max_{a'} Q(s', a')$. The $\max$ is the problem. Suppose the true Q-values for four actions in state $s'$ are all exactly 5.0, but the network's estimates have zero-mean noise: $\hat{Q} = [4.7, 5.3, 4.9, 5.1]$. The true max is 5.0, but $\max(\hat{Q}) = 5.3$. This is Jensen's inequality applied to the max operator: $\mathbb{E}[\max_a \hat{Q}(s',a)] \geq \max_a \mathbb{E}[\hat{Q}(s',a)]$. The max of noisy estimates is a biased-upward estimate of the max of the true values.

This is not a one-time error. The overestimated target is used to update the Q-network, which produces even more overestimated values at the next step, which produces an even more biased target. The feedback loop compounds: $Q$ values drift upward, the policy chases phantom high-Q actions that don't actually lead to high reward, and the whole system diverges. This is the overestimation spiral that killed early Q-learning methods (DDPG was notoriously unstable for this reason).

The twin-Q fix

Train two Q-networks $Q_{\phi_1}$ and $Q_{\phi_2}$ with independent initializations and independent gradient updates (same data, different random mini-batch ordering). Use the minimum for the Bellman target:

Twin-Q target $$ y = r + \gamma \left( \min(Q_{\bar\phi_1}(s', a'), Q_{\bar\phi_2}(s', a')) - \alpha \log \pi(a' \mid s') \right), \quad a' \sim \pi(\cdot \mid s') $$

Why does $\min$ help? Each network has independent estimation error. The probability that both networks simultaneously overestimate the same action is much lower than the probability that one does. Taking the minimum is conservative: it slightly underestimates on average (the symmetric counterpart of the overestimation bias). But underestimation is benign — the policy becomes slightly cautious, which is safe. Overestimation is catastrophic — the policy becomes delusional, which is not.

Worked example: twin-Q vs. single-Q overestimation. True Q-values for 3 actions in state $s'$: $[5.0, 5.0, 5.0]$. Network noise: $\sigma = 0.5$. Single Q-network: estimates $\hat{Q}_1 = [4.6, 5.4, 5.1]$. Target uses $\max = 5.4$. Overestimation: $+0.4$. Twin Q-networks: $\hat{Q}_1 = [4.6, 5.4, 5.1]$, $\hat{Q}_2 = [5.3, 4.8, 5.2]$. For the actor's sampled action $a'$ (say $a_2$): $\min(5.4, 4.8) = 4.8$. Underestimation: $-0.2$. The policy learns a slightly conservative value, but does not spiral. Over 100k updates, single-Q drifts $Q$-values to $\sim 50$ (when the true max is 5). Twin-Q stays within $[4.5, 5.5]$. This is the difference between a training run that converges and one that diverges.

Worked example: $\alpha$ auto-tuning dynamics. 7-DOF arm, so target entropy $\bar{\mathcal{H}} = -\dim(\mathcal{A}) = -7$. Current $\alpha = 0.15$. Scenario 1 — policy too deterministic. Current entropy $\mathcal{H}(\pi) \approx -4$ (the policy concentrates on a narrow action range). Average log-probability of sampled actions: $\mathbb{E}[\log \pi(a \mid s)] \approx -4$. The $\alpha$ loss: $\mathcal{L}(\alpha) = -\alpha (\log \pi + \bar{\mathcal{H}}) = -0.15 \times (-4 + (-7)) = -0.15 \times (-11)$. But wait — $\bar{\mathcal{H}} = -7$, so the inner term is $\log \pi - (-7) = -4 + 7 = 3$. Since $3 > 0$, the gradient $\partial \mathcal{L}/\partial \alpha = -(+3) = -3$. Minimizing: $\alpha \leftarrow \alpha - \eta \times (-3) = \alpha + 3\eta$. $\alpha$ increases. Higher $\alpha$ means stronger entropy bonus, pushing the policy to explore more. Correct — the policy was too deterministic. Scenario 2 — policy too stochastic. $\mathbb{E}[\log \pi] \approx -10$. Inner term: $-10 + 7 = -3 < 0$. Gradient: $-(-3) = +3$. $\alpha \leftarrow \alpha - 3\eta$. $\alpha$ decreases. Less entropy bonus, policy sharpens. Correct. At equilibrium: $\mathbb{E}[\log \pi] = \bar{\mathcal{H}} = -7$. The policy maintains entropy roughly equal to a unit Gaussian in 7 dimensions.

Full SAC update step — PyTorch

def sac_update(batch, actor, critic1, critic2,
              target_critic1, target_critic2,
              log_alpha, target_entropy,
              actor_opt, critic_opt, alpha_opt,
              gamma=0.99, tau=0.005):
    s, a, r, s_next, done = batch
    alpha = log_alpha.exp().detach()

    # ---- Critic update ----
    with torch.no_grad():
        a_next, logp_next = actor.sample(s_next)  # reparameterized
        q1_targ = target_critic1(s_next, a_next)
        q2_targ = target_critic2(s_next, a_next)
        q_targ = torch.min(q1_targ, q2_targ) - alpha * logp_next
        y = r + gamma * (1 - done) * q_targ

    loss_q1 = F.mse_loss(critic1(s, a), y)
    loss_q2 = F.mse_loss(critic2(s, a), y)
    critic_opt.zero_grad()
    (loss_q1 + loss_q2).backward()
    critic_opt.step()

    # ---- Actor update ----
    a_new, logp_new = actor.sample(s)
    q1_new = critic1(s, a_new)
    q2_new = critic2(s, a_new)
    q_new = torch.min(q1_new, q2_new)
    loss_actor = (alpha * logp_new - q_new).mean()
    actor_opt.zero_grad()
    loss_actor.backward()
    actor_opt.step()

    # ---- Alpha (temperature) update ----
    loss_alpha = -(log_alpha * (logp_new.detach() + target_entropy)).mean()
    alpha_opt.zero_grad()
    loss_alpha.backward()
    alpha_opt.step()

    # ---- Soft target update ----
    for tp, p in zip(target_critic1.parameters(), critic1.parameters()):
        tp.data.mul_(1 - tau).add_(p.data, alpha=tau)
    for tp, p in zip(target_critic2.parameters(), critic2.parameters()):
        tp.data.mul_(1 - tau).add_(p.data, alpha=tau)

    return {'q1': loss_q1.item(), 'q2': loss_q2.item(),
            'actor': loss_actor.item(), 'alpha': alpha.item()}

The off-policy family

SAC sits alongside its cousins:

DDPG — deterministic policy gradient. Predecessor; less stable than SAC because the policy isn't stochastic.
TD3 — DDPG with twin Q, delayed actor updates, target policy smoothing. Strong baseline.
REDQ — large Q-ensemble (10 critics), high update-to-data ratio (UTD = 20). Vastly more sample-efficient at the cost of compute.
DroQ — REDQ with dropout instead of an ensemble. Comparable performance with one critic.

The losses

SAC critic loss $$ \mathcal{L}_Q(\phi_i) = \mathbb{E}\Big[\big( Q_{\phi_i}(s_t, a_t) - y_t \big)^2\Big]$$ $$ y_t = r_t + \gamma \, \mathbb{E}_{a' \sim \pi}\Big[ \min_{j} \bar Q_{\phi_j}(s_{t+1}, a') - \alpha \log \pi(a' \mid s_{t+1}) \Big]$$

$Q_{\phi_i}(s_t, a_t)$ — the $i$-th Q-network's prediction of the value of action $a_t$ in state $s_t$. SAC trains two Q-networks ($i \in \{1, 2\}$) to combat overestimation.
$y_t$ — the Bellman target. What the Q-value "should be" according to the one-step bootstrap: the immediate reward plus the discounted future value.
$\min_j \bar Q_{\phi_j}(s_{t+1}, a')$ — the pessimistic target Q-value. Takes the minimum of the two target Q-networks (slow EMA copies, denoted by the bar) to prevent optimistic extrapolation.
$-\alpha \log \pi(a' \mid s_{t+1})$ — the entropy bonus in the target. Actions with lower probability (more exploratory) get a bonus, baked into the Q-target itself.
$a' \sim \pi$ — next action sampled from the current policy. This is what makes SAC off-policy: the data $(s_t, a_t, r_t, s_{t+1})$ can be old, but $a'$ is always fresh from the latest policy.

SAC actor loss $$ \mathcal{L}_\pi(\theta) = \mathbb{E}\Big[ \alpha \log \pi_\theta(a \mid s) - \min_j Q_{\phi_j}(s, a) \Big]$$

$\alpha \log \pi_\theta(a \mid s)$ — the entropy penalty. Pushes the policy to stay stochastic by penalizing high-probability (low-entropy) actions. Without this term, the actor would collapse to a deterministic greedy policy.
$-\min_j Q_{\phi_j}(s, a)$ — the negated Q-value (negated because we minimize the loss, so maximizing Q-value means minimizing its negation). Pushes the policy toward high-value actions.
$a$ is sampled via the reparameterization trick: $a = \tanh(\mu_\theta(s) + \sigma_\theta(s) \odot \epsilon)$ with $\epsilon \sim \mathcal{N}(0, I)$. The $\tanh$ squashes actions to $[-1, 1]$; this requires a log-det-Jacobian correction in $\log \pi$.

Worked example: SAC actor update. State $s$, action dimension $d = 7$. The policy $\pi_\theta$ is a squashed Gaussian: sample $\epsilon \sim \mathcal{N}(0, I)$, compute $a_{\text{raw}} = \mu_\theta(s) + \sigma_\theta(s) \odot \epsilon$, then $a = \tanh(a_{\text{raw}})$. Suppose $\mu_\theta(s) = [0.3, -0.1, 0.5, 0.0, 0.2, -0.4, 0.1]$, $\sigma_\theta(s) = [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]$, $\epsilon = [0.5, -0.3, 0.8, 0.1, -0.6, 0.2, -0.4]$. $a_{\text{raw}} = [0.4, -0.16, 0.66, 0.02, 0.08, -0.36, 0.02]$. $a = \tanh(a_{\text{raw}}) = [0.380, -0.159, 0.578, 0.020, 0.080, -0.345, 0.020]$. $\log \pi$: sum of log-Gaussian density of $a_{\text{raw}}$ minus the tanh correction $\sum_i \log(1 - \tanh^2(a_i))$. $\min(Q_1(s, a), Q_2(s, a)) = 3.45$. Current $\alpha = 0.2$. Actor loss = $\alpha \cdot \log \pi(a \mid s) - 3.45$. The gradient pushes the policy toward actions with high Q-values while maintaining entropy (the $\alpha \log \pi$ term penalizes overly deterministic policies).

If you have ten thousand environment steps to spend, SAC is your default. If you have one thousand, you want REDQ or DroQ. If you have a hundred, you want offline pretraining first.

16Sim-to-real

The reality gap is the central engineering problem of pure-RL robotics.

Training in simulation is fast, free, and produces policies that fail spectacularly when deployed on a real robot — unless you do specific things to close the gap.

Training in simulation is fast, free, and produces policies that fail spectacularly when deployed on a real robot — unless you do specific things to close the gap. The interventions cluster into four categories: domain randomization, system identification, real-to-sim, and online adaptation.

Domain randomization

The dominant technique. At each simulator reset, sample physical and visual parameters from a wide distribution: friction, mass, motor gains, latency, observation noise, lighting, textures, camera pose, gravity. The policy is forced to learn a control law robust across the distribution; the real world is treated as one more sample from it.

Three regimes:

Static randomization. Fixed ranges, sampled once per episode. Simple, works for many tasks.
Adversarial randomization. Sample parameters that the policy currently fails on. Faster to converge, requires more infrastructure.
Automatic Domain Randomization (ADR). Start narrow, widen the range when success rate exceeds a threshold. OpenAI's Rubik's cube paper. Gives a curriculum for free.

The randomization that matters

Not all parameters are equal. Empirical priorities, in rough order:

Motor / actuator dynamics. Latency, PD gains, torque limits, deadbands.
Mass and inertia. Especially for objects being manipulated.
Friction. Both ground and contact.
Observation noise and latency. A policy trained on perfect proprioception fails on a real robot with 5ms IMU latency and quantization.
Visuals. For pixel-based policies, lighting and texture randomization are mandatory. Include random camera FOV, exposure, and white balance.

Why domain randomization works: the robustness argument. Let $\xi$ denote the physical parameters (friction, mass, latency, etc.) and let $\pi(a \mid o; \xi)$ be the optimal policy for a specific parameter setting. Domain randomization trains a single policy $\pi_\theta(a \mid o)$ that minimizes the expected loss over the parameter distribution: $$ \min_\theta \mathbb{E}_{\xi \sim p(\xi)}\left[\mathcal{L}(\pi_\theta, \xi)\right] $$ The resulting policy is not optimal for any single $\xi$, but it is robust to the entire distribution. If the real world $\xi_{\text{real}}$ falls within the support of $p(\xi)$, the policy handles it. This is an instance of distributionally robust optimization — the policy optimizes for the worst case within the distribution rather than the best case at a single point. The key engineering insight: wider randomization always helps robustness but hurts peak performance. A policy trained on friction $\mu \in [0.5, 0.6]$ outperforms one trained on $\mu \in [0.1, 1.5]$ when the real friction is $\mu = 0.55$. But the narrow policy fails catastrophically at $\mu = 0.3$, while the wide policy degrades gracefully. Since you rarely know the real parameters precisely, wider is usually the right bet.

System identification: the alternative to DR

Instead of training over a wide distribution, measure the real-world parameters and set the simulator to match. Drop a calibration object on the table, record the bounce, fit the coefficient of restitution. Run a motor sweep, fit the actuator model. System identification is cheaper than DR when it works — one simulation with accurate parameters is better than ten thousand with random ones. But it doesn't cover unknown unknowns: the cable that catches on the table edge, the motor backlash that varies with temperature, the slight camera misalignment. DR covers these because they fall somewhere in the randomized range even if you never modeled them explicitly.

In practice, the two techniques combine. System identification narrows the randomization ranges to a plausible neighborhood. DR widens them enough to absorb the residual uncertainty. The hybrid is strictly better than either alone.

A practical example: for a Franka arm grasping task, you can system-identify the PD gains by running a step-response test (command a joint to move to a target and measure the rise time and overshoot). This pins the nominal $K_p$ to 620 and $K_d$ to 48. Then you randomize $K_p \in [530, 720]$ and $K_d \in [38, 58]$ — a $\pm$15% range centered on the measured values, rather than the $\pm$30% range you would need without system identification. Narrower ranges mean the policy can learn a tighter, more precise control law while still being robust to the remaining uncertainty.

RMA: Rapid Motor Adaptation

RMA (Kumar et al., 2021) is the bridge between system identification and domain randomization. The idea: train two modules in simulation.

Privileged policy. Train a policy $\pi(a \mid o, e)$ where $e$ is a vector of ground-truth environment parameters (friction, mass, terrain slope, motor strength, etc.) that the simulator provides but the real world doesn't. The policy learns to adapt its behavior to these parameters — walk differently on ice vs. gravel, push harder when the object is heavy.
Adaptation module. Train a separate network $\hat{e} = g_\phi(o_{t-H:t})$ that infers $e$ from a short history ($H \approx 50$ timesteps) of proprioceptive observations (joint positions, velocities, torques). This module learns to identify the environment from its effects: if the robot slips, friction is low; if the robot is slow to respond, there is latency.

At deployment, the privileged parameters $e$ are unavailable. The adaptation module fills in: $\pi(a \mid o, g_\phi(o_{t-H:t}))$. Within 50 timesteps (~0.5s at 100Hz), the adaptation module converges on an estimate of the real-world parameters, and the policy adapts accordingly. This is why RMA-trained quadrupeds can walk on grass, gravel, and stairs without retraining — the adaptation module identifies the terrain in real time.

The sim-to-real gap checklist

When your sim-trained policy fails on the real robot, the cause is almost always one of these five. Diagnose in this order:

Contact dynamics. Simulators approximate contacts with penalty forces or LCP solvers. Neither models deformation, stiction, or surface micro-texture. Symptom: the policy grasps perfectly in sim, drops objects in real. Fix: increase contact friction randomization; add stochastic slip events; test with softer and harder objects than the sim default.
Observation gap. Sim images look different from real images — different lighting, textures, no reflections, no motion blur. Symptom: a pixel-based policy that works in sim but freezes or oscillates in real. Fix: visual domain randomization (random textures, lighting, camera noise); or use a frozen foundation-model encoder (DINOv2, CLIP) that generalizes across the gap.
Action delay. Real motors have latency (5–25ms between command and execution). Communication buses add more. If the sim policy assumes instant execution, the real robot overshoots. Symptom: jerky, oscillatory motion. Fix: randomize action delay in sim (uniform in [1, 3] control steps); add action history to the observation.
State estimation error. Sim knows exact joint positions. Real robots have encoder quantization, drift, and cable routing that adds compliance. Symptom: gradual drift or occasional large errors. Fix: add Gaussian noise + bias to joint observations in sim; include IMU latency.
Unmodeled dynamics. Cables, hoses, table compliance, air resistance on large objects. Sim ignores them; the real world does not. Symptom: the policy fails on a task that should be easy. Fix: widen DR ranges; add random external forces; consider a real-to-sim calibration step.

Domain randomization config — Isaac Lab

# Domain randomization for manipulation — Isaac Lab (NVIDIA)
# Applied at each environment reset

domain_rand = {
    # Physics parameters
    "friction_range":        [0.5, 1.5],      # unitless, nominal ~0.8
    "object_mass_range":     [0.05, 2.0],     # kg
    "object_com_offset":     [-0.02, 0.02],   # m, center-of-mass shift
    "gravity_noise":         [-0.5, 0.5],     # m/s^2, added to 9.81
    "restitution_range":     [0.0, 0.3],      # bounciness

    # Actuator parameters
    "motor_strength_range":  [0.7, 1.3],      # fraction of nominal torque
    "kp_range":              [420, 780],       # PD position gain (nom 600)
    "kd_range":              [35, 65],         # PD velocity gain (nom 50)
    "action_delay_steps":    [0, 3],           # control steps of latency

    # Observation noise
    "joint_pos_noise":       0.005,           # rad, Gaussian std
    "joint_vel_noise":       0.1,             # rad/s
    "proprio_latency_steps": [0, 2],           # steps of observation delay

    # Visual randomization (pixel-based policies)
    "camera_pos_noise":      0.02,            # m, Gaussian std per axis
    "camera_rot_noise":      5.0,             # degrees
    "light_color_temp":      [3000, 6500],    # Kelvin
    "light_intensity_range": [0.5, 2.0],      # multiplier
    "texture_randomize":     True,            # random colors/patterns
    "table_color_hsv": [(0,0,0.2), (360,0.3,0.8)],
}

Real-to-sim and digital twins

Build a simulator that matches your specific real environment — calibrated geometry, measured friction, characterized actuators. Useful when one specific deployment matters more than generality. Less helpful when the goal is broad generalization.

Online adaptation

Fine-tune in the real world after sim training. Sometimes via RL (slow, dangerous), sometimes via fast supervised correction signals (preferred). The unifying lesson: sim training gets you to 80%, and the last 20% is real data.

Worked example: domain randomization ranges for a Franka arm. A typical sim-to-real setup for a Franka Panda manipulator randomizes: Motor parameters: PD gains ±30% from nominal (K_p = 600 ± 180, K_d = 50 ± 15). Actuation latency: uniform in [5ms, 25ms]. Torque limits: 87 Nm ± 10%. Physics: Object mass: 0.1–2.0 kg (vs nominal 0.5 kg). Friction: μ = 0.3–1.2 (table), 0.5–1.5 (object). Gravity: 9.81 ± 0.5 m/s². Observation noise: Joint position: ±0.005 rad. Joint velocity: ±0.1 rad/s. Camera: ±4 pixel shift, ±5° rotation, brightness ±30%. The real robot is one sample from this distribution. If the ranges are too narrow, the policy fails on the real robot. If too wide, the policy is too conservative. ADR (automatic domain randomization) tunes these ranges automatically during training.

Domain randomization ranges: what, how much, and why

Not all DR parameters are created equal. Some dominate transfer performance; others are noise. The following table consolidates empirical ranges from published sim-to-real work (OpenAI Rubik's cube, Legged Gym, HORA dexterous, IsaacGym manipulation). Each range is chosen to bracket the plausible real-world variation with margin:

Parameter	Randomization range	Nominal value	Why this range
Surface friction ($\mu$)	0.3–1.5	~0.7	Real surfaces range from smooth laminate (~0.3) to high-grip rubber (~1.5). Under-randomizing friction is the #1 cause of sim-trained grasps that slip in the real world.
Object mass	±50% of nominal	Task-dependent	A "500g object" might actually weigh 350g (hollow) or 750g (wet, or different material). Mass affects both inertia during transport and required grip force.
Camera pose (translation)	±3cm per axis	Calibrated position	Camera mounts shift due to vibration, bumps during operation, or imprecise installation. A 3cm error in extrinsics can shift pixel coordinates by 50+ pixels at working distance.
Camera pose (rotation)	±5° per axis	Calibrated orientation	Mounting imprecision and bracket flex. Even 2° of tilt shifts object centroids in the image by 10–20 pixels, enough to throw off a pixel-based policy.
Lighting color temperature	3000–7000 K	~5000K (daylight)	Real environments range from warm tungsten (2700K) to cool fluorescent (6500K) to daylight. Policies trained on one lighting fail under another if the vision encoder is not robust.
Lighting intensity	±40% of nominal	Task-dependent	Overhead lights turn on/off, windows let in variable sunlight, shadows from people passing by. Intensity shifts change image brightness, contrast, and shadow patterns.
Action delay	0–3 control steps	1 step	Communication bus latency (USB: 1–5ms, EtherCAT: <1ms), motor controller delays, OS scheduling jitter. At 50Hz control, 3 steps = 60ms — enough to overshoot by several mm.
Joint damping	±30% of nominal	Motor-specific	Damping varies with temperature (cold motors are stiffer), wear (lubricant degradation), and cable routing (added compliance). Under-damped joints oscillate; over-damped joints are sluggish.
Joint position noise	±0.005 rad (Gaussian σ)	0	Encoder quantization on a 13-bit encoder is ~0.0008 rad, but cable routing, gear backlash, and flex add 3–5× more effective noise.
Center-of-mass offset	±2cm per axis	Geometric center	Real objects have non-uniform density (a mug's handle shifts its CoM). A 2cm CoM error changes the torque required to hold the object stable by ~20%.

The ranges above are starting points. The right process: start with these ranges, train a policy, deploy it on the real robot, identify the failure mode, and then widen the range for the parameter that explains the failure. If the gripper slips: widen friction. If the arm overshoots: widen action delay and damping. If the image-based policy freezes: widen visual randomization. This iterative tightening is how you converge on the minimal DR set for your specific task.

A common failure mode: randomizing too many parameters simultaneously. If you randomize 15 parameters each with wide ranges, the resulting distribution is so broad that the policy cannot find a control law that works across all samples. The training loss plateaus at a high value and the policy is mediocre everywhere. The fix is progressive widening: start with narrow ranges (or zero randomization), train until convergence, then widen the parameter that the real-world deployment suggests is the bottleneck. This is manual ADR — less elegant than automatic ADR but more debuggable and faster for a single task.

Real2Sim2Real: the iterative calibration loop

Pure domain randomization treats the real world as a black box: randomize everything and hope reality falls within the distribution. Real2Sim2Real is a more targeted approach: use real-world failures to improve the simulator, then retrain in the improved simulator. The loop has six steps:

Worked example: Real2Sim2Real for a peg-insertion task. The goal: insert a cylindrical peg (8mm diameter) into a hole (8.5mm diameter) on a PCB fixture. Tolerance: 0.25mm lateral, 2° angular. Iteration 1 — Naive sim training. Train PPO in Isaac Gym with default physics parameters and moderate DR. Deploy on the real Franka. Result: 45% success rate. Failure analysis from 55 failure videos: 30 failures are "peg catches on hole edge and jams" (contact dynamics mismatch), 15 are "peg approaches at wrong angle" (camera pose error), 10 are "arm oscillates near contact" (action delay mismatch). Iteration 2 — Tune sim to reproduce failures. Collect 100 real-world failure trajectories with full state logging (joint positions, torques, end-effector wrench). Replay these trajectories in simulation with different physics parameters. Use Bayesian optimization to find the sim parameters that best reproduce the real contact forces during jamming: • Real contact stiffness: ~5000 N/m. Default sim: 2000 N/m. Fix: set sim stiffness to 4500–5500 N/m. • Real friction during peg-hole contact: ~0.4. Default sim: 0.8. Fix: set friction range to 0.3–0.6. • Real action delay: ~40ms (2 control steps at 50Hz). Default sim: 0 steps. Fix: randomize 1–3 steps. Retrain with calibrated sim parameters. Result: 62% success rate. Improvement: +17 points. Iteration 3 — Add failure scenarios to training. The remaining 38% failures cluster into two modes: (a) the peg approaches at a slight angle and the chamfer doesn't correct it, (b) the robot hesitates at contact and the PD controller oscillates. Add to training: • Adversarial initial peg orientations: tilt 0–5° from vertical (vs. 0–2° previously). • Contact event curriculum: 20% of training episodes start with the peg already touching the hole edge, forcing the policy to learn recovery. • PD gain randomization widened from ±10% to ±30%. Retrain. Result: 78% success rate. Improvement: +16 points. Total improvement: 45% → 62% → 78% across three iterations. Each iteration required ~4 hours of sim training + 2 hours of real-robot data collection + 2 hours of analysis. The remaining 22% failure rate is dominated by edge cases (unusual peg orientations, fixture wear) that would require either more DR breadth or a small amount of real-world RL fine-tuning to close.

The Real2Sim2Real loop converges when the sim failures match the real failures in distribution: if the policy fails at the same rate and in the same ways in sim and real, the simulator is well-calibrated and further sim training will transfer. The practical signal: track the failure mode distribution (not just the success rate) across sim and real. When the pie charts match, you are done calibrating.

Domain randomization config — YAML format

# dr_config.yaml — Domain randomization for peg insertion
# Each parameter: [low, high] for uniform sampling per episode

physics:
  friction_range: [0.3, 0.6]        # calibrated from real contact data
  contact_stiffness: [4500, 5500]  # N/m, from force-replay matching
  object_mass_range: [0.008, 0.015] # kg, peg mass variation
  restitution: [0.0, 0.1]          # metal-on-metal, low bounce

actuator:
  kp_range: [420, 780]              # PD position gain (nominal 600)
  kd_range: [35, 65]                # PD velocity gain (nominal 50)
  action_delay_steps: [1, 3]        # from real latency measurement
  torque_limit_frac: [0.85, 1.0]   # fraction of nominal max torque

observation:
  joint_pos_noise_std: 0.003       # rad, Gaussian
  joint_vel_noise_std: 0.08        # rad/s
  ee_force_noise_std: 0.5          # N, wrist F/T sensor noise

visual:
  camera_pos_noise_std: 0.03       # m per axis, from mount calibration
  camera_rot_noise_std: 5.0        # degrees per axis
  light_color_temp: [3000, 7000]   # Kelvin
  light_intensity_frac: [0.6, 1.4] # multiplier on nominal
  texture_randomize: True          # random table/object colors

curriculum:
  initial_tilt_range: [0, 5]       # degrees, peg approach angle
  contact_start_frac: 0.2         # 20% of episodes start at contact

When to stop iterating the Real2Sim2Real loop. The loop has diminishing returns after 3–4 iterations. The first iteration (naive sim → real deployment) reveals the largest gaps and typically yields a 15–20 point improvement. The second iteration (calibrated physics) yields 10–15 points. The third (failure-scenario augmentation) yields 5–10 points. After that, the remaining failures are either irreducible stochasticity (the peg slips in a way no sim can model) or require real-world RL to close. The practical stopping criterion: when the failure mode distribution in sim matches the failure mode distribution in real (same failure types in similar proportions), your simulator is well-calibrated and further sim-only improvements will transfer. If the failure modes diverge (sim fails on X, real fails on Y), you have an unmodeled gap that calibration cannot fix — switch to residual RL or HIL-SERL for the final push.

17Pixel-based RL

Learning from images is harder than learning from state, in ways that are now well-understood.

State-based RL — where the agent observes a low-dimensional state vector — has been a solved problem in many simulated benchmarks since 2018. Pixel-based RL — observing only RGB frames — was a much harder problem until DrQ (Kostrikov et al., 2020) demonstrated that aggressive data augmentation closes most of the gap.

The DrQ family

DrQ

Augment image observations with random shifts ($\pm 4$ pixels), then run SAC. Average $K$ Q-values per state-action pair, computed on $K$ different augmentations. Shockingly, this single change closed the gap to state-based RL on DeepMind Control benchmarks.

DrQ-v2

Replaces SAC with DDPG (deterministic policy + exploration noise schedule), drops the ensembling. Faster, simpler, better on DeepMind Control. The standard pixel-RL baseline since 2021. Key ingredients: n-step returns (3-step), large replay buffer (1M), exploration noise schedule linearly decayed from 1.0 to 0.1 over 500k steps.

DrM

Adds a dormant-neuron reset and a layer-norm tweak. Marginal gains but the right diagnostic frame for why pixel-RL is unstable: large fractions of the network become inactive during training and stop contributing.

The augmentation insight

Augmentations work in pixel-RL for the same reason they work in supervised vision: they enforce a useful invariance and act as a regularizer. But the deeper reason is that RL targets are noisy; without augmentation, the network overfits to whatever artifacts the noise produces. Random shifts force the encoder to be translation-equivariant and starve the network of the specific pixel-coordinate features it would otherwise memorize.

Why random crop works for RL specifically (the DrQ insight)

In supervised learning, augmentation creates more training data. In RL, the mechanism is different and more fundamental. The Q-function is a regression target, and that target is already noisy (it is a one-step bootstrap estimate, not a ground-truth label). Without augmentation, the Q-network overfits to the noise: it memorizes "when the red block is at pixel (43, 67), the Q-value is 3.7." This is a spurious correlation — the Q-value should be the same whether the block is at pixel (43, 67) or pixel (47, 63).

Random crops enforce exactly this invariance. Two augmented views of the same observation produce slightly different pixel features but must map to the same Q-value. This acts as a consistency regularizer on the encoder: features that are stable under shifts (object positions relative to each other, shapes, colors) survive training, while features sensitive to absolute pixel position (memorized coordinates) do not.

Representation learning for pixel RL

Random crop is the minimum viable augmentation. A family of methods goes further, learning visual representations explicitly useful for control:

CURL (Srinivas et al., 2020) — contrastive learning. Two augmented views of the same frame should be close in embedding space; views from different frames should be far apart. Uses a momentum encoder and InfoNCE loss, exactly like MoCo.
SPR (Schwarzer et al., 2021) — self-predictive representations. The encoder must predict its own future representations: $f(o_{t+k}) \approx g(f(o_t), a_t, \ldots, a_{t+k-1})$. Forces the encoder to capture dynamics-relevant features.
ATC (Stooke et al., 2021) — augmented temporal contrast. Combines contrastive learning with a temporal component: the encoder must match representations of temporally adjacent frames under different augmentations.

All three share the same principle: give the encoder an auxiliary objective that forces it to learn features useful for predicting the future, rather than features useful for memorizing the past. On standard benchmarks (DMC-100k, DMC-500k), these methods provide diminishing returns over plain DrQ-v2 when the augmentation is strong. Their value shows on harder tasks: sparse rewards, visual complexity, long horizons.

When pixel RL beats state RL

Counterintuitive — why would learning from images ever be better than learning from ground-truth state? Three cases:

Poor state estimation. If the "state" is noisy or incomplete (no object tracking, no contact sensing), pixels carry strictly more information. A camera sees the object; the state vector might not include its position.
Visual features matter for the task. Sorting by color, quality inspection (detect defects), food handling (ripe vs. unripe). The state vector would need a classifier; the pixel policy learns one implicitly.
Sim-to-real with visual DR. Pixel policies trained with visual domain randomization generalize to new environments. The image encoder serves as a de facto state estimator that works across scenes.

The CURL objective, unpacked

CURL (Contrastive Unsupervised Representation for RL) applies the InfoNCE contrastive loss directly to the RL encoder. The idea: two different random augmentations of the same observation should produce similar representations, while augmentations of different observations should produce dissimilar ones. This is the same principle behind SimCLR and MoCo in computer vision, adapted to the RL setting where the "dataset" is a replay buffer of transitions.

CURL contrastive loss (InfoNCE) $$ \mathcal{L}_{\text{CURL}} = -\log \frac{\exp(q^\top k_+ / \tau)}{\exp(q^\top k_+ / \tau) + \sum_{j=1}^{K} \exp(q^\top k_j^- / \tau)} $$

$q = f_\theta(\text{aug}_1(o_t))$ — the query: the encoder applied to one random augmentation of observation $o_t$. This is the representation we want to be useful for control.
$k_+ = f_{\bar\theta}(\text{aug}_2(o_t))$ — the positive key: the momentum encoder applied to a different augmentation of the same observation $o_t$. The momentum encoder $f_{\bar\theta}$ is an exponential moving average of $f_\theta$ (update rate $\alpha = 0.01$), exactly as in MoCo.
$k_j^-$ — negative keys: representations of augmented views from other observations in the same minibatch. Typically $K = 127$ negatives (batch size 128, one positive per query).
$\tau$ — the temperature. Controls the sharpness of the softmax. $\tau = 0.1$ is standard. Lower temperature makes the loss more sensitive to hard negatives.

The loss forces the encoder to produce representations that are invariant to augmentation (the two views of the same frame map to the same point) and discriminative across time (different frames map to different points). This is exactly the invariance that matters for control: the encoder should not change its output when the camera jiggles by 4 pixels, but should change dramatically when the object moves.

In practice, CURL adds ~10% training overhead (one extra forward pass through the momentum encoder per batch) and provides the largest gains in the low-data regime (DMC-100k: +15% over DrQ on hard tasks like Walker-Walk). At DMC-500k and beyond, the augmentation-only baseline (DrQ-v2) catches up, because with enough data the encoder learns useful features from the RL objective alone.

The CURL momentum encoder is critical. Without it (using the same encoder for both query and key), the contrastive loss collapses: the encoder learns to map everything to a constant vector, which trivially satisfies the contrastive objective. The momentum encoder, updated as $\bar\theta \leftarrow \alpha \bar\theta + (1 - \alpha) \theta$ with $\alpha = 0.99$, provides a slowly-changing reference that prevents this collapse. This is the same stabilization trick used in MoCo and BYOL in self-supervised learning.

When pixel RL beats state RL — three concrete scenarios

Counterintuitive — why would learning from images ever be better than learning from ground-truth state? Three cases where pixels carry strictly more information than a hand-designed state vector:

Scenario 1: Poor state estimation in manipulation. You are building a cloth-folding policy. The "state" vector contains joint positions and a single point-cloud centroid of the cloth. But the cloth's shape is a 50,000-dimensional mesh — the centroid captures almost none of it. A single RGB image captures the cloth's full visible geometry: folds, wrinkles, edges, overlap regions. Pixel RL on a 84×84 image (7,056 values) carries more task-relevant information than the 7-dimensional state vector. Empirical result: pixel RL achieves 72% fold success vs. 35% for state RL on the same task (DeformableRaven benchmark).
Scenario 2: Visual features are the task itself. You are sorting ripe from unripe tomatoes on a conveyor belt. Ripeness is determined by color, texture, and surface blemishes — information that exists only in pixels. The state vector (object position, velocity) tells you where the tomato is, not what it looks like. You would need a separate classification pipeline feeding into the state, at which point the pixel policy is simpler: it learns perception and control jointly from a single reward signal ("pick only red tomatoes"). This is common in food handling, quality inspection, and any task where the decision depends on visual appearance.
Scenario 3: Sim-to-real with visual domain randomization. You train a picking policy in simulation with aggressive visual DR (random textures, lighting, camera perturbation). Deploy on a real robot in a warehouse with lighting conditions and backgrounds never seen in sim. The pixel-based encoder, forced to be invariant to visual distractors by DR, acts as a robust state estimator that works across scenes. A state-based policy trained in sim transfers perfectly to real if the state estimation pipeline works in the real environment. But state estimation (object detection + pose estimation) is its own fragile pipeline that often fails on novel objects and scenes. The pixel policy sidesteps this entirely — the encoder is the state estimator, and DR has made it robust.

Encoder architecture for pixel RL

The standard encoder is surprisingly small: 4 convolutional layers, channels [32, 32, 32, 32], kernel size 3x3 with stride 2 on the first two layers. Input: 84x84x9 (3 stacked frames). Output: a 50-dimensional feature vector after a linear projection. Total: ~0.5M parameters. This is much smaller than a ViT-B (86M) or even a ResNet-18 (11M). Why?

RL does millions of gradient updates. Each update uses a tiny batch (256 transitions) compared to supervised learning (thousands of images). A large encoder would overfit catastrophically — it would memorize the replay buffer within a few thousand steps. The small CNN has just enough capacity to extract spatial features (object positions, gripper state) but not enough to memorize textures. This is also why dropout and layer norm matter more in pixel RL than in supervised vision: they are the regularization that prevents the encoder from collapsing.

The architecture in detail, layer by layer:

Layer	Type	Channels	Kernel	Stride	Output shape	Parameters
Input	—	9	—	—	84 × 84 × 9	0
Conv1	Conv2d + ReLU	32	3 × 3	2	41 × 41 × 32	2,624
Conv2	Conv2d + ReLU	64	3 × 3	2	20 × 20 × 64	18,496
Conv3	Conv2d + ReLU	128	3 × 3	2	9 × 9 × 128	73,856
Conv4	Conv2d + ReLU	256	3 × 3	2	4 × 4 × 256	295,168
Flatten	—	—	—	—	4,096	0
LayerNorm	LayerNorm	—	—	—	4,096	8,192
Linear	Linear + Tanh	—	—	—	50	204,850
Total						~603K

This is 1000× smaller than a ViT-B (86M parameters). The disparity is not a flaw — it is a design constraint imposed by the RL training regime. RL performs 1M+ gradient updates on a replay buffer of ~100K transitions. A ViT-B would memorize the entire replay buffer in under 10K updates and produce Q-values that are perfect on stored transitions but meaningless on new ones. The small CNN is the right inductive bias: local spatial features (edges, objects, gripper position) generalize; global attention patterns do not, at this data scale.

DrQ-v2 random shift augmentation

import torch
import torch.nn.functional as F

def random_shift(imgs, pad=4):
    # imgs: (B, C, H, W) e.g. (256, 9, 84, 84)
    B, C, H, W = imgs.shape
    imgs = F.pad(imgs, (pad,)*4, mode='constant')
    h0 = torch.randint(0, 2*pad+1, (B,))
    w0 = torch.randint(0, 2*pad+1, (B,))
    out = torch.stack([
        imgs[i, :, h0[i]:h0[i]+H, w0[i]:w0[i]+W]
        for i in range(B)
    ])
    return out  # (B, C, H, W) shifted by +-4px

Interactive: random shift augmentation

Shift (px): 4

original 84×84 crop shifted crop object position

Why pixel-RL is hard, structurally

Sample efficiency. The network has to learn perception, value estimation, and control jointly from a single scalar reward. Any of these tasks alone is hard.
Representation collapse. The encoder can converge to features that are temporally smooth but task-irrelevant. The Q-network reports low loss (it predicts the correct bootstrapped target) but the encoder has learned to map all observations to nearly the same point in feature space. The policy then takes the same action everywhere. This is distinct from Q-value overestimation — it is an encoder failure, not a value failure.
Exploration. Random actions in a high-dimensional control space rarely produce useful images; you need either a curiosity bonus, a strong prior, or both.
Non-stationarity. In supervised learning, the dataset is fixed. In RL, the data distribution changes as the policy improves. The encoder must track a moving target: features that were useful for the initial random policy may be useless for the semi-competent policy at 100K steps. This is why replay ratios and target-network update rates matter more in pixel RL than in state RL — they control how fast the representation is allowed to drift.

The representation collapse diagnostic

How do you detect encoder collapse before it ruins a training run? Two cheap signals:

Feature norm variance. Compute the standard deviation of the L2 norm of the encoder output across a batch of 256 observations. In a healthy encoder, this variance is at least 10% of the mean norm. If it drops below 2%, the encoder is collapsing — all observations map to nearly the same feature vector.
Dormant neuron ratio. Count the fraction of ReLU neurons in the encoder that have zero output for > 95% of a batch. If > 30% of neurons are dormant, the encoder is effectively lower-capacity than you designed. This is the DrM diagnostic. The fix: periodically reset dormant neurons (re-initialize their weights) and add layer normalization after each conv layer.

Log both metrics every 10K gradient steps. A healthy training run shows stable feature norm variance (within 50% of its initial value) and dormant neuron ratio below 20% throughout training. If either metric degrades sharply, the encoder is collapsing and training should be restarted with stronger regularization (more aggressive augmentation, layer normalization, or a lower learning rate for the encoder).

These diagnostics cost less than 1% of training time (a single forward pass on a held-out batch every 10K steps) and can save hours of wasted training by catching collapse early.

The augmentation-regularization tradeoff

Stronger augmentation improves generalization but hurts sample efficiency. Random shifts of $\pm 4$ pixels are the sweet spot for most DMC tasks: enough to prevent memorization, not so much that the observation becomes ambiguous (a $\pm 12$ shift can move a small object entirely out of frame in an 84×84 image). For manipulation tasks with larger images (224×224), the shift should be proportionally larger ($\pm 10$–$20$ pixels) to maintain the same effective invariance. Color jitter and random erasing help on tasks with visual complexity (multiple objects, textured backgrounds) but add no benefit on simple tasks (single object, solid background). The diagnostic: if adding augmentation does not improve eval performance, the task is not visually complex enough to benefit.

Beyond random shifts, three augmentation strategies have been validated for pixel RL:

Random convolution. Apply a small random convolutional filter (3×3, random weights) to the observation. This perturbs textures without changing spatial structure. Useful for sim-to-real: the random conv teaches the encoder to ignore texture details that differ between sim and real.
Color jitter. Random brightness, contrast, saturation, and hue shifts. Standard in supervised vision, but must be applied carefully in RL: extreme hue shifts can make a red object look green, confusing color-conditioned tasks. Limit hue to $\pm 10\%$.
Cutout / random erasing. Mask a random rectangular patch (10–30% of the image) with gray. Forces the encoder to use distributed spatial features rather than relying on a single salient region. Particularly useful for manipulation tasks where the encoder might over-attend to the robot arm (large, always present) rather than the small object being manipulated.

The combination matters. DrQ-v2 uses only random shifts. Adding color jitter and cutout on top of random shifts typically adds 5–10% return on visually complex tasks (e.g., robot manipulation with cluttered backgrounds) but adds nothing on simple tasks (e.g., single-object locomotion in DMC). The diagnostic: train with shifts only, then train with all augmentations, and compare performance at 100K environment steps. If the gap is less than 5%, the extra augmentations are not worth the added hyperparameter complexity.

Why random crop outperforms all other augmentations in pixel RL. Consider what each augmentation destroys: color jitter destroys color information, cutout destroys local spatial information, random crop destroys absolute position information. In RL, the Q-function's most common failure mode is memorizing absolute pixel coordinates: "when the red block is at (43, 67), the return is 3.7." Random crop is the only augmentation that directly attacks this failure mode, because it shifts the entire image by up to $\pm 4$ pixels, making absolute coordinates unreliable. Color jitter and cutout attack different failure modes (color memorization, local patch memorization) that are less common in practice. This is why random crop is the single augmentation that works across all pixel-RL benchmarks, while the others provide task-dependent gains.

The pretraining shortcut

Replace the encoder with a frozen visual foundation model (CLIP, DINOv2, R3M). The RL problem becomes "learn a policy on a 768-dim feature vector," which is much closer to state-based RL. This is the dominant pattern in 2026 — pure pixel-RL from scratch is rare; pixel-RL on top of a frozen foundation model is common.

The frozen-encoder approach has a second benefit beyond sample efficiency: it decouples visual generalization from policy learning. A DINOv2 encoder trained on 142M images provides features that generalize across lighting, backgrounds, and object instances. The RL policy on top only needs to learn the mapping from features to actions, which is a low-dimensional regression problem. The result: RL policies built on frozen foundation-model encoders transfer across visual domains (sim-to-real, lab-to-kitchen) without any visual domain randomization. The encoder handles visual generalization; the RL handles motor generalization.

Worked example: pixel RL training budget comparison. Task: DMC Walker-Walk (locomotion from pixels). All methods use the same 84×84×9 observation. DrQ-v2 (learned encoder): 500K environment steps, ~2M gradient updates, 2 hours on 1 GPU. Final return: 920/1000. Encoder: 600K params, trained end-to-end. The encoder learns basic spatial features (limb positions, ground contact) but is fragile to visual perturbations. CURL (contrastive encoder): 100K environment steps, ~400K gradient updates, 45 minutes. Final return: 880/1000. Reaches 90% of DrQ-v2's performance in 5× fewer steps. The contrastive objective provides a better learning signal in the low-data regime. Beyond 200K steps, DrQ-v2 catches up and surpasses CURL. Frozen DINOv2 + SAC: 100K environment steps, ~400K gradient updates, 30 minutes. Final return: 940/1000. The frozen encoder provides 768-dim features that are already informative; SAC only trains a 2-layer MLP critic and actor (~200K params). Converges faster, generalizes better, and does not suffer from encoder collapse. The downside: the DINOv2 features are not optimized for the specific task, so there is a ceiling on tasks that require perceiving fine-grained task-specific visual cues.

Worked example: DrQ-v2 augmentation. Input image: 84×84×3 (stacked 3 frames = 84×84×9). Random shift: pad the image by 4 pixels on each side (92×92), then randomly crop back to 84×84. This shifts the image by at most ±4 pixels in any direction. The shift simulates small camera calibration errors and forces the encoder to learn features that are robust to exact pixel positions. At inference, DrQ-v2 takes a single center crop (no randomness). The training augmentation acts as a regularizer that prevents the Q-network from memorizing pixel-specific features. Without it, the Q-network memorizes the exact pixel coordinates of objects in the replay buffer and generalizes poorly to new positions. Computational cost: essentially zero — a pad-and-crop is two lines of PyTorch and adds <0.1ms per batch. This is why random shift is the universal default: it costs nothing and prevents the most common failure mode.

18World models

Imagine the future, plan inside the imagination, hope the imagination is right.

A world model is a learned dynamics model — a network that predicts $p(s_{t+1} \mid s_t, a_t)$ — plus the apparatus to use it.

The Dreamer family

The recurrent state-space model (RSSM)

The world model factorizes the state into a deterministic component $h_t$ (a GRU's hidden state) and a stochastic component $z_t$ (a categorical or Gaussian latent).

Derivation: the RSSM

Why two components? The deterministic path $h_t$ captures the predictable dynamics — inertia, gravity, trajectory continuation. The stochastic path $z_t$ captures irreducible uncertainty — contact outcomes, object slippage, unobserved state. Together they form a sufficient representation for both prediction and control.

In plain English: the robot dreams about what will happen next. It keeps a "confident prediction" (the deterministic state — like knowing that a thrown ball will keep going up) and an "uncertain guess" (the stochastic state — like not knowing whether the ball will bounce or stick on landing). Together, these two components let the model hallucinate realistic future trajectories, and the policy learns entirely inside this hallucination.

The RSSM defines four distributions:

RSSM (Dreamer) $$\begin{aligned} h_t &= f_\theta(h_{t-1}, z_{t-1}, a_{t-1}) \quad &&\text{deterministic recurrence}\\ z_t &\sim q_\phi(z_t \mid h_t, x_t) \quad &&\text{posterior (encoder)}\\ \hat z_t &\sim p_\theta(\hat z_t \mid h_t) \quad &&\text{prior (predicted)}\\ \hat x_t &\sim p_\theta(\hat x_t \mid h_t, z_t) \quad &&\text{decoder} \end{aligned}$$

$h_t \in \mathbb{R}^{600}$ — the deterministic hidden state, computed by a GRU. Captures the predictable, inertial dynamics (momentum, gravity, trajectory trends). Think of it as "what the model is confident will happen."
$z_t$ — the stochastic latent. In DreamerV3, this is 32 categorical variables each with 32 classes (= 1024-dim one-hot). Captures irreducible uncertainty — contact outcomes, slippage, hidden state not visible in the image.
$f_\theta$ — the recurrence function (a GRU). Takes the previous deterministic state, stochastic state, and action as input. This is the backbone that carries temporal context forward.
$q_\phi(z_t \mid h_t, x_t)$ — the posterior (encoder). Sees the actual observation $x_t$ (camera image) and the deterministic state $h_t$, and outputs a distribution over $z_t$. Used during training only — gives the "correct" stochastic state because it can peek at the real image.
$p_\theta(\hat z_t \mid h_t)$ — the prior (predictor). Must predict $z_t$ from $h_t$ alone, without seeing the image. Used during imagination rollouts at inference. The KL loss trains this to match the posterior.
$x_t$ — the observation (camera image). The decoder $p_\theta(\hat x_t \mid h_t, z_t)$ reconstructs it, ensuring the latent state $(h_t, z_t)$ contains enough information about the scene.
$a_{t-1}$ — the action taken at the previous timestep. The recurrence must know what action caused the current state transition.

The training loss combines three terms: (1) image reconstruction $-\log p_\theta(x_t \mid h_t, z_t)$, (2) reward prediction $-\log p_\theta(r_t \mid h_t, z_t)$, and (3) KL divergence $\mathrm{KL}(q_\phi(z_t \mid h_t, x_t) \| p_\theta(z_t \mid h_t))$ which forces the prior to predict the posterior without seeing the image.

Worked example: RSSM forward pass. At time $t-1$, the model has deterministic state $h_{t-1} \in \mathbb{R}^{600}$, stochastic state $z_{t-1}$ (32 categoricals × 32 classes), and the agent took action $a_{t-1} \in \mathbb{R}^{7}$. Recurrence: $h_t = \text{GRU}(h_{t-1}, [z_{t-1}; a_{t-1}])$. The GRU takes a concatenated input of $z_{t-1}$ (32×32 = 1024 dim after one-hot) and $a_{t-1}$ (7 dim), total 1031 dim. Output: $h_t \in \mathbb{R}^{600}$. Prior: $p_\theta(z_t \mid h_t) = \text{MLP}(h_t) \to$ logits for 32 categoricals, each 32 classes. During imagination, we sample from this. Posterior: $q_\phi(z_t \mid h_t, x_t) = \text{MLP}([h_t; \text{enc}(x_t)]) \to$ same 32×32 categoricals. During training with real observations, we use the posterior (it sees the image). KL: Between two categorical distributions. For category $j$ with posterior probs $q_j$ and prior probs $p_j$: $\mathrm{KL}_j = \sum_c q_{jc} \log(q_{jc}/p_{jc})$. Summed across 32 categories. DreamerV3 uses KL balancing: $\beta_{\text{post}} \cdot \mathrm{KL}(\text{sg}[q] \| p) + \beta_{\text{prior}} \cdot \mathrm{KL}(q \| \text{sg}[p])$ where $\text{sg}$ = stop gradient. Decoder: $p_\theta(x_t \mid h_t, z_t) = \text{ConvTranspose}([h_t; z_t])$, reconstructing the 64×64 image. The reconstruction loss is MSE in pixel space.

Derivation: imagination rollouts

Once trained, the world model enables "dreaming." Starting from a real observation $(h_0, z_0)$:

The actor selects action $a_0 = \pi_\psi(h_0, z_0)$.
The deterministic recurrence gives $h_1 = f_\theta(h_0, z_0, a_0)$.
The prior samples $\hat z_1 \sim p_\theta(\hat z_1 \mid h_1)$ — no real image needed.
The reward predictor estimates $\hat r_1$.
Repeat for 15–50 imagined steps.

The actor and critic are trained purely on these imagined trajectories. The actor maximizes the sum of predicted rewards (with the entropy bonus for DreamerV3); the critic estimates the value function. No real environment interaction is required during this phase. This is the core sample-efficiency mechanism: one real trajectory generates thousands of imagined ones.

deterministic h posterior (sees image) KL prior↔posterior policy training

DreamerV3 details that matter

Symlog transformations on rewards and values: $\text{symlog}(x) = \text{sign}(x)\log(|x|+1)$. Compresses the dynamic range so the same loss works across tasks with rewards in [-1,1] or [0, 1000].
Two-hot encoding of returns: predict a categorical over a discretized return range and decode with a soft target. Stabilizes value learning.
Categorical latents with straight-through gradients: 32-dim categorical with 32 classes per dim, instead of Gaussian latents. Empirically more stable.
KL balancing: separate scaling for the "make posterior close to prior" and "make prior close to posterior" terms of the KL. Prevents posterior collapse.

DayDreamer

Wu et al., 2022. Dreamer applied to four real robots. The headline result was not the algorithm — it was the framing: an A1 quadruped learned to walk in 1 hour from scratch, on real hardware, with no simulator. Dreamer's sample efficiency made real-world RL plausible.

Where world models help

Sample-efficient real-world RL when the dynamics model is easier to learn than the policy.
Transfer: a world model trained on one task can be reused for a related task.
Long-horizon credit assignment: imagined rollouts can be 50+ steps without reset costs.

Where world models struggle

Contact-rich manipulation, where prediction errors compound fast and the model can't track sliding contacts.
Open-ended environments where reconstruction loss spends capacity on irrelevant background pixels.

Interactive: latent imagination rollout

Imagination horizon: 15 Model error: 0.03

true trajectory imagined (from model) divergence

TD-MPC / TD-MPC2: planning in latent space

The Dreamer family learns a world model and then trains a policy inside the dream. TD-MPC (Hansen et al., 2022) takes a different bet: learn a latent dynamics model, but don't reconstruct observations and don't train a policy end-to-end. Instead, plan at test time using Model Predictive Path Integral (MPPI) control directly in the learned latent space.

The key insight: pixel reconstruction wastes model capacity on visual details that are irrelevant to control. Shadows, textures, background clutter — none of it affects the optimal action. TD-MPC's model only needs to be accurate in the policy-relevant part of state space: the part that predicts rewards and value.

Architecture

In plain English: compress the camera image into a compact code, then learn to predict how that code changes when you take an action — without ever trying to reconstruct the image. The model learns a "physics simulator" that operates entirely on learned representations, not pixels. At test time: sample 512 candidate action plans, simulate them all in the learned model, and pick the best one. Total planning time: ~10ms.

Five learned components, all operating in latent space:

TD-MPC components $$\begin{aligned} h &= f_\theta(o) \quad &&\text{encoder}\\ h' &= d_\theta(h, a) \quad &&\text{latent dynamics}\\ \hat r &= R_\theta(h, a) \quad &&\text{reward predictor}\\ \hat V &= v_\theta(h) \quad &&\text{value predictor}\\ a &\sim \pi_\theta(h) \quad &&\text{policy prior} \end{aligned}$$

$f_\theta(o)$ — the encoder. Maps a raw observation $o$ (image or proprioceptive state) to a compact latent representation $h$. No decoder — the model never reconstructs the observation.
$d_\theta(h, a)$ — the latent dynamics model. Predicts the next latent state $h'$ given the current latent and action. This is where the "world model" lives — but it operates entirely in learned representation space, not pixel space.
$R_\theta(h, a)$ — the reward predictor. Estimates the immediate reward for being in latent state $h$ and taking action $a$. Trained with a standard regression loss against observed rewards.
$v_\theta(h)$ — the value predictor. Estimates the expected discounted return from latent state $h$. This is the terminal value used at the end of planning rollouts — it bootstraps value beyond the planning horizon.
$\pi_\theta(h)$ — the policy prior. A learned policy that provides the initial action distribution for MPPI sampling. Without it, MPPI would sample random actions, which in high-dimensional continuous spaces is catastrophically inefficient. The policy prior focuses the search around good actions; MPPI refines from there.

The training loss combines three objectives: (1) latent dynamics consistency via a joint-embedding loss (the predicted next latent must match the encoded next observation), (2) reward prediction, and (3) temporal-difference value learning. No reconstruction. No KL. No decoder.

MPPI planning at test time

At each real timestep, TD-MPC runs a planning loop entirely inside the learned model:

Encode the current observation: $h_0 = f_\theta(o_t)$.
Sample $N$ action sequences of length $H$ from the policy prior $\pi_\theta$, with added Gaussian noise.
For each candidate sequence, roll forward through $d_\theta$ to get $h_1, h_2, \ldots, h_H$.
Score each trajectory: sum of predicted rewards $\sum_{k=0}^{H-1} \gamma^k R_\theta(h_k, a_k)$ plus the terminal value $\gamma^H v_\theta(h_H)$.
Weight trajectories by exponentiated returns and compute the weighted mean action.
Execute only the first action. Replan from scratch at the next step.

Derivation: the MPPI update

Why importance-weighted averaging? MPPI is a sampling-based approximation to the optimal control law. Instead of solving a Bellman equation, it approximates the optimal action as a soft maximum over sampled trajectories, weighted by their returns. The temperature $\lambda$ controls how greedy the weighting is: $\lambda \to 0$ converges to the max-return trajectory; large $\lambda$ averages across all candidates.

In plain English: sample N candidate plans, simulate them all in the learned model, score each plan by its predicted total reward, then take a weighted average where the best plans get exponentially more vote. Execute only the first action of the winning plan, then replan from scratch. This is "planning by committee" where the committee members that scored highest dominate the vote.

Given $N$ sampled action sequences $\{a^{(i)}_{0:H-1}\}_{i=1}^N$, each producing a return estimate $S^{(i)}$:

MPPI update $$w^{(i)} = \frac{\exp\!\big(\frac{1}{\lambda}\, S^{(i)}\big)}{\sum_{j=1}^N \exp\!\big(\frac{1}{\lambda}\, S^{(j)}\big)}, \qquad a^*_0 = \sum_{i=1}^N w^{(i)}\, a^{(i)}_0$$

$S^{(i)} = \sum_{k=0}^{H-1} \gamma^k R_\theta(h^{(i)}_k, a^{(i)}_k) + \gamma^H v_\theta(h^{(i)}_H)$ — the return estimate for trajectory $i$. Accumulated predicted rewards plus bootstrapped terminal value.
$w^{(i)}$ — the importance weight. A softmax over returns with temperature $\lambda$. High-return trajectories get exponentially more weight.
$\lambda$ — the temperature. Controls the sharpness of the weighting. $\lambda = 0.5$ is typical. Lower values make MPPI more greedy (closer to "pick the best trajectory"); higher values average more broadly.
$a^*_0$ — the executed action. The importance-weighted mean of the first actions across all $N$ trajectories. Only this single action is executed; the rest of the planned sequence is discarded and replanning happens at $t+1$.

In code: weights = F.softmax(returns / lam, dim=0), then action = (weights.unsqueeze(-1) * first_actions).sum(0). With N=512 candidates and H=5 planning horizon, the full MPPI loop (sample, rollout, weight, average) runs in ~8ms on GPU — fast enough for 50Hz control. The policy prior seeds the samples so most candidates are reasonable; without it, random sampling in 7D action space wastes 99% of candidates on garbage trajectories.

Worked example: MPPI with 5 trajectories. Suppose we have planning horizon $H = 4$, discount $\gamma = 0.99$, temperature $\lambda = 0.5$, and 5 sampled trajectories with return estimates:

$S^{(1)} = 3.2,\; S^{(2)} = 1.8,\; S^{(3)} = 4.1,\; S^{(4)} = 2.5,\; S^{(5)} = 3.9$

Step 1. Compute unnormalized weights: $\exp(S^{(i)}/\lambda)$ with $\lambda = 0.5$:

$\exp(3.2 / 0.5) = \exp(6.4) = 602$
$\exp(1.8 / 0.5) = \exp(3.6) = 37$
$\exp(4.1 / 0.5) = \exp(8.2) = 3641$
$\exp(2.5 / 0.5) = \exp(5.0) = 148$
$\exp(3.9 / 0.5) = \exp(7.8) = 2441$

Step 2. Normalize: sum $= 602 + 37 + 3641 + 148 + 2441 = 6869$.

$w^{(1)} = 602 / 6869 = 0.088$
$w^{(2)} = 37 / 6869 = 0.005$
$w^{(3)} = 3641 / 6869 = 0.530$
$w^{(4)} = 148 / 6869 = 0.022$
$w^{(5)} = 2441 / 6869 = 0.355$

Trajectory 3 ($S = 4.1$) gets 53% of the weight. Trajectory 2 ($S = 1.8$) gets 0.5%. The exponential weighting is aggressive: mediocre trajectories are effectively ignored. Step 3. Suppose the first actions were $a^{(1)}_0 = 0.3,\; a^{(2)}_0 = -0.1,\; a^{(3)}_0 = 0.7,\; a^{(4)}_0 = 0.4,\; a^{(5)}_0 = 0.6$ (scalar for simplicity). $a^*_0 = 0.088 \times 0.3 + 0.005 \times (-0.1) + 0.530 \times 0.7 + 0.022 \times 0.4 + 0.355 \times 0.6 = 0.026 - 0.001 + 0.371 + 0.009 + 0.213 = 0.618$. The executed action is pulled toward the best trajectories. With more samples ($N = 512$ is typical in TD-MPC2), the estimate concentrates tightly around the optimum.

TD-MPC2: one model, 80+ tasks

TD-MPC2 (Hansen et al., 2024) is the multi-task scaling version. A single 317M-parameter model trained across 80+ continuous control tasks from DeepMind Control, Meta-World, and MyoSuite. The first world model that is truly multi-task — one set of weights, no task-specific heads, no finetuning.

Key changes from TD-MPC to TD-MPC2:

Task embeddings. A learned task embedding vector $e_\tau$ is concatenated to the latent state everywhere. The dynamics, reward, value, and policy networks all condition on $e_\tau$.
Larger model. Encoder becomes a 5-layer MLP with layer norm. Dynamics model uses a 2-layer MLP with residual connections. 317M parameters total — about 100x larger than TD-MPC.
Normalized latent space. SimNorm (simplex normalization) on the latent state prevents collapse and keeps the representation stable across tasks with wildly different observation scales.
No reconstruction loss ever. The model is trained purely on reward prediction, value prediction, and latent consistency. This is the key architectural commitment: the latent space is shaped entirely by what matters for control.

FOWM: finetuning offline world models

FOWM (Yu et al., 2023) asks a natural question: can you pretrain a world model on diverse offline data and then finetune it for a specific task? The answer is yes, and the key insight is that world model pretraining transfers better than policy pretraining. A world model trained on 10 different manipulation tasks captures general physics — object permanence, gravity, contact dynamics — that is reusable. A policy trained on 10 tasks captures 10 specific behavior patterns that may not compose.

The recipe: pretrain the latent dynamics model and reward predictor on a broad offline dataset (mixed quality, mixed tasks), freeze the dynamics, then finetune only the policy and value heads on target-task data. The pretrained dynamics acts as a learned physics simulator.

Diffusion-based world models

The most recent wave treats world modeling as a video generation problem. Instead of learning compact latent dynamics, these models generate full future frames — and planning happens by conditioning the generation on desired outcomes.

UniSim

Yang et al., 2024. A video diffusion model trained to simulate how the world changes in response to actions. Given a frame and an action description (language or discrete control), UniSim generates the next frames. It is a universal simulator in the sense that the same model handles navigation, manipulation, and human-object interaction. The cost: inference is orders of magnitude slower than latent-space models. The benefit: the "world model" generalizes to novel scenes because it inherits the generalization of large-scale video pretraining.

Genie / Genie 2

Bruce et al., 2024. Learned world models from internet video. Genie learns a latent action space from unlabeled video — it discovers that certain latent codes correspond to "move left" or "jump" without ever seeing labels. Genie 2 scales this to photorealistic 3D environments with consistent geometry, generating long-horizon interactive worlds from a single image prompt. The relevance to robotics: if you can build an action-controllable world model from internet-scale data, you have a free simulator for any environment that appears in video.

Genesis

Xian et al., 2024. A generative, open-source physics engine with differentiable simulation. Unlike the learned models above, Genesis combines classical physics solvers (rigid body, soft body, fluid, cloth) with a differentiable rendering pipeline. The world model is not learned — it is engineered — but it is differentiable end-to-end, so you can backpropagate through the simulation to optimize policies or system parameters. It sits at the intersection of classical simulation and learned world models.

World model comparison

Method	Approach	Planning	Multi-task	Sim-to-real
DreamerV3	RSSM latent + pixel reconstruction	Learned policy in imagination	Single-task (per model)	DayDreamer: A1 walking in 1hr
TD-MPC2	Latent dynamics, no reconstruction	MPPI in latent space	80+ tasks, single 317M model	Demonstrated on real WidowX
UniSim	Video diffusion as simulator	Conditioning on desired outcomes	Broad (inherits video pretraining)	Not yet; inference too slow
FOWM	Pretrained latent dynamics, finetune policy	Learned policy + MPC hybrid	Transfer via pretraining	Demonstrated on real xArm

The world model landscape is splitting into two camps: compact latent models (Dreamer, TD-MPC2) that are fast enough for real-time control and generative models (UniSim, Genie 2) that inherit the generalization of internet-scale pretraining but are too slow for closed-loop control. The bet for 2026–2027 is whether distillation can bridge the gap — training a fast latent model to match the predictions of a slow generative one.

19Offline RL

Reinforcement learning when you cannot interact.

Offline RL learns a policy from a fixed dataset of transitions $\{(s, a, r, s')\}$, with no further interaction. It is what you do when you have demonstrations and rewards but no robot to run on. The central failure mode is distributional shift in the value target: the Bellman backup queries Q at out-of-distribution actions, and the network's extrapolation there is unreliable.

The three responses

CQL — Conservative Q-Learning

Penalize Q-values for OOD actions. The intuition: standard Q-learning overestimates OOD actions because the network extrapolates optimistically. CQL adds a regularizer that pushes Q down everywhere except on actions seen in the data:

CQL regularizer $$ \mathcal{L}_{\text{CQL}} = \mathbb{E}_{s \sim \mathcal{D}}\Big[\log \sum_a \exp Q(s,a) - \mathbb{E}_{a \sim \pi_\beta}[Q(s,a)] \Big]$$

$\mathbb{E}_{s \sim \mathcal{D}}$ — expectation over states from the offline dataset. CQL only touches states that were actually visited.
$\log \sum_a \exp Q(s,a)$ — a soft-max over all actions (log-sum-exp). Minimizing this pushes Q-values down everywhere in action space. It is largest when the Q-function assigns high values to many actions — penalizing overestimation broadly.
$\mathbb{E}_{a \sim \pi_\beta}[Q(s,a)]$ — the expected Q-value under the behavior policy $\pi_\beta$ (i.e., the policy that collected the data). Subtracting this term pulls Q-values up for actions seen in the dataset. Net effect: Q is suppressed on out-of-distribution (OOD) actions and maintained on in-distribution ones.
$\pi_\beta$ — the behavior policy: whatever policy generated the offline dataset. In practice, estimated by the empirical action distribution in the data.

In code: penalty = Q(s, random_actions).logsumexp(dim=1).mean() - Q(s, dataset_actions).mean(). The random_actions are sampled uniformly or from the current policy to approximate the log-sum-exp integral. Add alpha * penalty to the standard Bellman loss. Typical $\alpha$: 1.0–10.0. If the policy is too timid (refuses to act), lower $\alpha$. If Q-values diverge, raise it.

The first term pushes Q down everywhere, the second pulls it up on the data distribution. Net effect: Q is suppressed on OOD actions. Works; sometimes over-conservative.

IQL — Implicit Q-Learning

Avoid evaluating Q at OOD actions entirely. Use expectile regression on the value function and fit the policy to weighted behavior cloning of dataset actions, weighted by their advantage.

IQL value regression and policy extraction $$ \mathcal{L}_V = \mathbb{E}\big[ L_2^\tau(Q(s,a) - V(s)) \big], \quad L_2^\tau(u) = |\tau - \mathbb{1}(u<0)|\, u^2 $$ $$ \mathcal{L}_\pi = -\mathbb{E}\big[ \exp(\beta(Q(s,a) - V(s))) \log \pi(a \mid s) \big]$$

$V(s)$ — the state value function. IQL's key idea: learn $V(s)$ using expectile regression instead of taking a max over actions. This avoids ever evaluating $Q$ on out-of-distribution actions.
$L_2^\tau(u)$ — the asymmetric (expectile) squared loss. When $u > 0$ (Q exceeds V), the weight is $\tau$; when $u < 0$ (Q below V), the weight is $1 - \tau$. With $\tau = 0.7$, positive residuals are weighted 2.3× more than negative ones — so $V(s)$ is pulled toward the upper quantiles of $Q(s, a)$, approximating the value of the best in-distribution actions.
$\tau$ — the expectile parameter. $\tau = 0.5$ gives the mean (standard MSE). $\tau = 0.7$ gives roughly the 70th percentile of Q-values. Higher $\tau$ extracts more "optimistic" policies from the data. Typical range: 0.7–0.9.
$\mathbb{1}(u < 0)$ — an indicator function that is 1 when $u < 0$ and 0 otherwise. Implements the asymmetric weighting.
$Q(s,a) - V(s)$ — the advantage. Positive advantage means action $a$ is better than the average action in state $s$.
$\exp(\beta(Q - V))$ — the advantage weight for policy extraction. Actions with high advantage get exponentially more weight. $\beta$ controls temperature: higher $\beta$ concentrates weight on the best actions. Typical $\beta = 3$–$10$.
$\log \pi(a \mid s)$ — log-probability under the learned policy. The loss is advantage-weighted behavior cloning: imitate the dataset, but imitate high-advantage actions more.

In code: The expectile loss is diff = q_val - v_pred; weight = torch.where(diff > 0, tau, 1 - tau); loss_v = (weight * diff.pow(2)).mean(). The policy extraction is advantage-weighted BC: adv = q_val - v_pred; weights = torch.exp(beta * adv); loss_pi = -(weights.detach() * log_pi).mean(). This is the modern default for offline RL — no log-sum-exp trick, no sampling random actions for the penalty, just asymmetric MSE and weighted imitation.

Expectile $\tau \approx 0.7$. The policy is extracted by advantage-weighted behavior cloning — never queries Q on OOD actions. Strong, simple, the modern default.

CQL: deriving the conservative penalty from first principles

The core problem: standard Q-learning computes $y = r + \gamma \max_{a'} Q(s', a')$. The $\max$ queries Q at whatever action maximizes it — which is almost certainly an action not in the dataset. The network's Q-value at that out-of-distribution action is pure extrapolation, and neural networks extrapolate badly. The result: Q is overestimated at OOD actions, the policy chases those phantom values, and the policy diverges.

CQL's fix: add a penalty that pushes Q down everywhere, then add a counter-term that pushes Q up on actions actually in the data. The net effect: Q is conservative (under-estimated) at OOD actions and accurate at in-distribution actions.

CQL penalty (expanded) $$ \mathcal{L}_{\text{CQL}} = \alpha \cdot \mathbb{E}_{s}\Big[\underbrace{\log \sum_a \exp Q(s,a)}_{\text{push all Q down}} - \underbrace{\mathbb{E}_{a \sim \pi_\beta}[Q(s,a)]}_{\text{pull dataset Q up}} \Big] + \frac{1}{2}\,\mathbb{E}_{s,a,s'}\big[(Q(s,a) - \hat{B}\pi Q)^2\big]$$

The first term, $\log \sum_a \exp Q(s,a)$, is the log-sum-exp over all actions. Minimizing it pushes Q-values down on average across all actions. Actions with the highest Q get pushed down the most (because $\exp(Q)$ is largest for them). The second term, $\mathbb{E}_{a \sim \pi_\beta}[Q(s,a)]$, is the expected Q under the dataset's behavior policy. Maximizing it (subtracting it from the loss) pulls Q up for actions that actually appear in the data. The balance: OOD actions are suppressed, in-distribution actions are preserved.

The hyperparameter $\alpha$ controls how conservative the resulting Q-function is. Too high: the policy becomes overly cautious and never exploits high-value opportunities. Too low: the conservative regularizer is too weak and Q still overestimates at OOD actions. Typical range: $\alpha \in [1, 10]$, tuned on a validation set.

IQL: expectile regression from scratch

Standard regression minimizes $\mathbb{E}[(y - f(x))^2]$ — this targets the mean of $y$. But for policy extraction, we want the maximum of $Q(s,a)$ over in-distribution actions, not the mean. Taking a literal max would require evaluating Q at actions not in the dataset — the exact thing we want to avoid.

Expectile regression targets a specific quantile using an asymmetric squared loss:

Expectile loss $$ L_2^\tau(u) = |\tau - \mathbf{1}(u < 0)| \cdot u^2 $$

When the residual $u = Q(s,a) - V(s)$ is positive (Q exceeds V, meaning this action is better than V predicts), the weight is $\tau$. When $u$ is negative, the weight is $1 - \tau$. With $\tau = 0.9$, positive residuals are weighted 9 times more than negative ones. The function V converges not to the mean of Q but to a high quantile — approximately the 90th percentile. This is close to $\max_a Q(s,a)$ without ever evaluating Q at an OOD action.

Worked example: expectile regression with 5 Q-values. State $s$ has 5 actions in the dataset with Q-values: $Q = [2.0, 4.5, 3.0, 7.0, 5.5]$. Current $V(s) = 4.0$. Expectile $\tau = 0.9$. Residuals: $u = Q - V = [-2.0, +0.5, -1.0, +3.0, +1.5]$. Weights: $u_1 = -2.0 < 0$: weight $= 1 - \tau = 0.1$. Loss $= 0.1 \times 4.0 = 0.4$. $u_2 = +0.5 \geq 0$: weight $= \tau = 0.9$. Loss $= 0.9 \times 0.25 = 0.225$. $u_3 = -1.0 < 0$: weight $= 0.1$. Loss $= 0.1 \times 1.0 = 0.1$. $u_4 = +3.0 \geq 0$: weight $= 0.9$. Loss $= 0.9 \times 9.0 = 8.1$. $u_5 = +1.5 \geq 0$: weight $= 0.9$. Loss $= 0.9 \times 2.25 = 2.025$. Total loss = 10.85. The gradient is dominated by $u_4 = +3.0$ (the best action), which contributes 75% of the loss. The gradient pushes $V(s)$ strongly upward toward $Q = 7.0$. At convergence, $V(s) \approx 6.2$ — close to the max (7.0), well above the mean (4.4). This is the IQL trick: $V$ approximates the value of the best available action without ever querying Q at a new action.

When offline RL exceeds demonstration quality: trajectory stitching

This is the key advantage of value-based offline RL over behavior cloning. BC matches the average quality of your demonstrations. If your demos are 60% optimal, BC produces a 60%-optimal policy. Offline RL with a value function can exceed the best single demo by stitching together the best parts of different trajectories.

Concretely: suppose you have two demonstrations for a navigation task. Demo A takes the optimal path through room 1, then gets lost in room 2 (total return: 5). Demo B fumbles through room 1 but navigates room 2 perfectly (total return: 6). The optimal policy would take A's path through room 1 and B's path through room 2 (total return: 11). BC cannot do this — it copies either A or B. IQL can, because it learns per-state values: $V(s_{\text{room1}})$ reflects the best continuation from that state across all demos, and the advantage-weighted policy extraction picks the best action at each state independently.

The stitching requires coverage: the offline dataset must contain transitions near the stitching point. If demo A never visits a state close to where demo B enters room 2, the value function has no basis for connecting them. This is why dataset diversity matters more than dataset quality for offline RL — diverse mediocre data enables stitching, while a single expert trajectory does not.

AWAC, AWR — Advantage-Weighted regression

The general family: estimate advantage from data, do BC weighted by $\exp(\beta A)$. AWAC adds an explicit Q-function update; AWR doesn't. The simplest member of the family.

When offline RL helps over BC

Mixed-quality data. If your demonstrations include some failures or sub-optimal trajectories, BC trains on the average. Offline RL trains toward the best.
Reward-labeled play data. If you have task-agnostic interaction with reward labels, BC has nothing to imitate. Offline RL extracts a task-specific policy.

When offline RL doesn't help

If your dataset is uniformly expert demonstrations, BC matches offline RL and is simpler. If your dataset is small and narrow, offline RL is hard to tune and unreliable. The big breakthroughs in robot learning over the last three years were data, not offline RL.

Worked example: CQL penalty computation, step by step. You have a state $s$ and 5 actions in the offline dataset: $a \in \{0.1, 0.3, 0.5, 0.7, 0.9\}$. The current Q-network assigns Q-values: $Q(s, a) = [2.0, 3.5, 4.0, 3.0, 1.5]$. Step 1: Log-sum-exp over all actions. This is the "push all Q down" term: $$ \log \sum_a \exp Q(s,a) = \log(\exp(2.0) + \exp(3.5) + \exp(4.0) + \exp(3.0) + \exp(1.5)) $$ $= \log(7.39 + 33.12 + 54.60 + 20.09 + 4.48) = \log(119.67) = 4.78$. Step 2: Dataset mean Q-value. This is the "pull dataset Q up" term. Under the empirical behavior policy (uniform over the 5 dataset actions): $$ \mathbb{E}_{a \sim \pi_\beta}[Q(s,a)] = \frac{2.0 + 3.5 + 4.0 + 3.0 + 1.5}{5} = 2.8 $$ Step 3: CQL penalty. $$ \mathcal{L}_{\text{CQL}}(s) = 4.78 - 2.8 = 1.98 $$ Interpretation: The penalty is positive (1.98), which means the loss will push Q-values down overall. The log-sum-exp is dominated by the highest Q-values ($Q = 4.0$ contributes $\exp(4.0) = 54.6$, which is 46% of the sum). This means the gradient pressure is strongest on the actions with the highest Q — exactly the OOD actions that standard Q-learning would overestimate. What happens during training: The CQL penalty is added to the standard Bellman loss with weight $\alpha$. With $\alpha = 5$: the total penalty contribution is $5 \times 1.98 = 9.9$. This is large compared to typical Bellman errors (1–3), so the Q-network is strongly incentivized to keep Q-values low everywhere except on actions that actually appear in the data. The result: if the policy tries to take an action not in $\{0.1, 0.3, 0.5, 0.7, 0.9\}$, the Q-value there has been actively suppressed, and the policy avoids it. The over-conservatism problem: If $\alpha$ is too large, the Q-values are pushed so far down that even good in-distribution actions look bad. The policy becomes paralyzed — it refuses to take any action confidently. This is why CalQL (Calibrated CQL) was introduced: it sets a floor on Q-values based on the Monte Carlo return of the dataset trajectories, preventing Q from being pushed below the actual observed return.

Offline RL methods: head-to-head comparison

Method	Data requirement	Compute	Stitching?	OOD safety	Best use case
BC	Expert demos only	Low (1 model, MSE loss)	No — copies demos	No constraint	Uniformly expert data, simple tasks
CQL	Mixed-quality OK	Medium (Q-net + penalty)	Yes	Strong (conservative Q)	Mixed-quality data where you want safe exploitation
IQL	Mixed-quality OK	Medium (V + Q + π)	Yes	Moderate (implicit)	General offline RL; modern default for simplicity
CalQL	Mixed-quality OK	Medium-high	Yes	Strong (calibrated)	When CQL is too conservative; needs MC return estimates
TD3+BC	Mixed-quality OK	Low (TD3 + BC term)	Limited	Weak	Quick baseline; few hyperparameters
Decision Transformer	Return-labeled trajectories	High (Transformer)	No — sequence model	No explicit constraint	When you want to condition on desired return

The stitching column is the key differentiator. BC and Decision Transformer are trajectory-level methods: they reproduce entire demonstrated trajectories. CQL and IQL are state-level methods: they learn per-state values and can compose the best parts of different trajectories into a novel plan that exceeds any single demonstration. If your dataset contains diverse sub-optimal trajectories whose good segments could be combined into a better-than-demonstrated policy, CQL or IQL will find it; BC will not.

The offline RL practitioner's decision tree

Given a fixed dataset and no further robot access, which method should you use? The decision depends on three properties of your dataset:

Is the data uniformly expert? If yes, use BC. It is simpler, faster to train, and will match or slightly beat offline RL on uniformly expert data. Offline RL's advantage is in extracting value from mixed-quality data, which uniformly expert data does not have.
Does the data have reward labels? If no (only demonstration trajectories without per-step rewards), you must either assign rewards retroactively (hindsight relabeling, success/failure labels) or use BC. IQL requires $(s, a, r, s')$ tuples; it cannot train on unlabeled demos.
Does the data cover diverse behaviors? If the dataset contains multiple strategies for the same task (some fast and risky, some slow and careful), IQL can stitch together the best parts of each. If all demonstrations follow the same strategy, stitching has nothing to combine, and the advantage over BC vanishes.

The most common mistake: applying offline RL to a small, narrow, expert-only dataset and expecting it to outperform BC. It will not. Offline RL is a tool for extracting maximal value from large, diverse, mixed-quality datasets. On small expert datasets, BC is the right tool, and the additional complexity of CQL/IQL buys you nothing but tuning headaches.

Worked example: IQL expectile regression. Suppose we have three data transitions from the same state $s$, with actions $a_1, a_2, a_3$ and Q-values $Q(s, a_1) = 2.0$, $Q(s, a_2) = 5.0$, $Q(s, a_3) = 8.0$. The current value estimate is $V(s) = 4.5$. The expectile loss with $\tau = 0.7$: For $a_1$: $u = Q(s,a_1) - V(s) = 2.0 - 4.5 = -2.5$. Since $u < 0$: weight = $|\tau - 1| = 0.3$. Loss = $0.3 \times (-2.5)^2 = 1.875$. For $a_2$: $u = 5.0 - 4.5 = 0.5$. Since $u \geq 0$: weight = $\tau = 0.7$. Loss = $0.7 \times 0.5^2 = 0.175$. For $a_3$: $u = 8.0 - 4.5 = 3.5$. Since $u \geq 0$: weight = $0.7$. Loss = $0.7 \times 3.5^2 = 8.575$. Total = 10.625. The gradient pushes $V(s)$ upward (toward the high-Q actions), because the $\tau = 0.7$ weighting penalizes under-estimation 2.3× more than over-estimation. At convergence, $V(s)$ approximates the 70th percentile of the Q-distribution — biased toward the good actions.

Reward design for offline RL in robotics

Offline RL requires reward labels, but most robot demonstration datasets do not have per-step rewards. Three practical approaches to retroactive labeling:

Binary success/failure. The simplest: $r = 1$ at the final step if the task was completed, $r = 0$ otherwise. All intermediate steps get $r = 0$. This creates an extremely sparse reward that offline RL can handle (IQL is designed for exactly this regime). The downside: no gradient signal for partial progress, so the value function does not distinguish "almost succeeded" from "never tried."
Distance-to-goal shaping. Compute the distance between the end-effector (or manipulated object) and the goal position at each timestep. $r_t = -\|p_t - p_{\text{goal}}\|$. This gives dense signal but requires knowing the goal position, which may not be available in all datasets. For language-conditioned tasks, "goal position" must be inferred from the instruction — an additional complexity.
Learned reward model. Train a classifier on (observation, language instruction) pairs to predict success probability. Use the classifier's logit as a dense reward: $r_t = \sigma^{-1}(P(\text{success} \mid o_t, \ell))$. This scales to diverse tasks but introduces reward model error, which offline RL can amplify if the Q-function exploits errors in the reward model.

The empirical finding: for robot manipulation with offline RL, binary sparse reward + IQL with high expectile ($\tau = 0.9$) is the simplest recipe that works reliably. Dense shaped rewards help convergence speed but require task-specific engineering that rarely generalizes across tasks.

Offline RL hyperparameter sensitivity

Offline RL methods are notoriously sensitive to hyperparameters, and the right settings depend on the dataset. The key hyperparameters and their interaction with dataset properties:

Hyperparameter	Method	Typical range	Sensitive to
$\alpha$ (CQL penalty weight)	CQL	1.0–10.0	Dataset coverage. Narrow data → higher $\alpha$. Broad data → lower $\alpha$.
$\tau$ (expectile)	IQL	0.7–0.9	Data quality. Expert data → $\tau = 0.7$. Mixed data → $\tau = 0.9$.
$\beta$ (AWR temperature)	IQL policy	3.0–10.0	Action space dimensionality. High-dim → lower $\beta$.
Discount $\gamma$	All	0.99–0.999	Horizon length. Long tasks → $\gamma$ closer to 1.
Q-ensemble size	CQL, SAC	2–10	OOD severity. More diverse data → fewer critics needed.

The practical recipe: start with IQL ($\tau = 0.7$, $\beta = 3.0$, $\gamma = 0.99$) and evaluate. If the policy is too conservative (refuses to attempt the task), increase $\tau$ toward 0.9. If the policy is too aggressive (attempts impossible actions), decrease $\tau$ toward 0.5 or switch to CQL. Tune on a validation set of held-out trajectories, not on real-robot deployment — offline RL hyperparameter sweeps on real hardware are prohibitively expensive.

19·5RL as sequence modeling

What if reinforcement learning is just next-token prediction on trajectories?

Every RL method we have seen so far — PPO, SAC, CQL, IQL — learns a value function or a policy by solving the Bellman equation in some form. In 2021, three papers asked the same heretical question at nearly the same time: what if we skip the Bellman equation entirely and just train a transformer on trajectory data? The answer turned out to be surprisingly effective, and it created a new paradigm that connects offline RL directly to the language-modeling infrastructure.

Decision Transformer

Decision Transformer (Chen et al., 2021) reframes offline RL as conditional sequence generation. The core idea: a trajectory is a sequence of (return-to-go, state, action) tokens. Train a causal transformer to predict the next action, conditioned on the desired return-to-go. At inference time, set the return-to-go to a high value — and the model generates actions consistent with achieving that return. arXiv:2106.01345

In plain English: feed the desired return as a token — "I want total reward = 10" — and the model outputs actions that achieve it. The transformer has seen thousands of trajectories during training, some good (reward 10) and some bad (reward 2). By conditioning on the desired return, you tell the model "generate actions from the part of the distribution where things went well." It is a quality dial for behavior generation.

The input sequence at each timestep $t$ is:

DT input sequence $$ \tau = \big(\hat{R}_1, s_1, a_1, \;\hat{R}_2, s_2, a_2, \;\ldots,\; \hat{R}_t, s_t \big)$$

$\hat{R}_t = \sum_{t'=t}^{T} r_{t'}$ — the return-to-go at timestep $t$. This is the sum of all future rewards from $t$ onward. It is the "steering signal" — it tells the model how much total reward we want from this point forward. High $\hat{R}_t$ → the model generates actions characteristic of high-performing trajectories.
$s_t$ — the state at timestep $t$. In robotics, this is the observation (joint angles, images, proprioception). Each modality is embedded by a separate encoder and projected to the transformer's hidden dimension.
$a_t$ — the action at timestep $t$. This is the prediction target. The transformer outputs a distribution over actions, and we train with cross-entropy (discrete) or MSE (continuous).
$T$ — the episode length. The return-to-go decreases across the trajectory as rewards are collected. At $t = T$, $\hat{R}_T = r_T$.

Each token type (return, state, action) gets its own learned embedding layer. Timestep information is added via a learned positional embedding shared across the three token types at each step. The transformer is causal — it only attends to tokens at or before the current position, exactly like a language model.

The training objective

DT is trained with supervised learning on offline trajectory data. For continuous action spaces (the standard in robotics):

DT training objective (continuous actions) $$ \mathcal{L}_{\text{DT}} = \mathbb{E}_{\tau \sim \mathcal{D}}\left[\sum_{t=1}^{T} \big\| a_t - f_\theta(\hat{R}_t, s_t, \tau_{<t}) \big\|^2\right]$$

$f_\theta(\hat{R}_t, s_t, \tau_{<t})$ — the transformer's predicted action given the return-to-go, current state, and all prior context. The model outputs a deterministic action vector; the loss is MSE against the ground-truth action from the dataset.
$\tau_{<t}$ — all prior tokens in the trajectory: $(\hat{R}_1, s_1, a_1, \ldots, \hat{R}_{t-1}, s_{t-1}, a_{t-1})$. The context window is typically the last $K$ timesteps (e.g., $K = 20$), not the full trajectory.
$\tau \sim \mathcal{D}$ — a trajectory sampled from the offline dataset. DT trains on the same data as offline RL methods like CQL or IQL — the difference is that DT does not learn a Q-function or solve any Bellman equation.

In code: pred_actions = dt(returns_to_go, states, actions, timesteps); loss = F.mse_loss(pred_actions, gt_actions). At inference, set returns_to_go[0] = max_return and the model generates expert-quality actions. No value function, no policy gradient, no Bellman equation — just supervised learning on trajectory data with a conditioning signal. Training uses the same infrastructure as GPT fine-tuning.

For discrete action spaces (Atari), replace MSE with cross-entropy over action bins. The architecture is identical — only the output head and loss change.

Inference: steering with return-to-go

Worked example: DT inference on a pick-and-place task. The offline dataset contains trajectories with returns ranging from 0 (failure) to 10 (success). We want a successful policy. Step 1. Set the initial return-to-go to the maximum: $\hat{R}_1 = 10$. Step 2. Observe the current state $s_1 = [\text{joint angles, gripper open, block at (0.3, 0.2)}]$. Step 3. Feed $(\hat{R}_1, s_1)$ into the transformer. It predicts $a_1 = [0.02, 0.01, -0.03, \ldots]$ (move toward the block). Step 4. Execute $a_1$. Receive reward $r_1 = 0$ (no success yet). Update return-to-go: $\hat{R}_2 = \hat{R}_1 - r_1 = 10$. Step 5. Feed $(\hat{R}_1, s_1, a_1, \hat{R}_2, s_2)$ into the transformer. It predicts $a_2$. Step 6. Continue until the task completes or the episode times out. The key insight: by setting $\hat{R}_1 = 10$, we tell the model "generate a trajectory that achieves total reward 10." The model has seen such trajectories in training and generates actions consistent with them. Setting $\hat{R}_1 = 5$ would generate a mediocre trajectory. This is return-conditioned policy extraction — no value function, no policy gradient, just conditional generation.

Trajectory Transformer

Trajectory Transformer (Janner et al., 2021) takes the idea further. Instead of predicting only the next action, it discretizes the entire trajectory — states, actions, and rewards — into tokens and predicts everything autoregressively. At inference, it uses beam search to plan: sample multiple candidate trajectory continuations, evaluate them by their predicted returns, and execute the best one. arXiv:2106.02039

The discretization: each dimension of state, action, and reward is binned into $V$ discrete values (typically $V = 100$). A single timestep $(s_t, a_t, r_t)$ with $d_s$-dimensional state and $d_a$-dimensional action becomes $d_s + d_a + 1$ tokens. The vocabulary is $V \cdot (d_s + d_a + 1)$ tokens with unique IDs per dimension.

Gato: the generalist agent

Gato (Reed et al., 2022) is the logical endpoint: tokenize everything — Atari frames, robot proprioception, text, images — into a single sequence, and train one transformer on all of it. 1.2B parameters. 604 tasks across Atari, robotics, image captioning, and dialogue. arXiv:2205.06175

The tokenization scheme: continuous values (actions, proprioception) are mu-law encoded and discretized into 1024 bins. Images are encoded as 16×16 patches via a ResNet, then tokenized into 1024-dimensional embeddings. Text is SentencePiece with a 32K vocabulary. A single context window might contain: [image tokens, proprioception tokens, action tokens, text tokens] — and the model predicts the next token regardless of modality.

Gato proved the architecture is universal. It did not prove that the architecture is competitive — on any single task, a specialist beats Gato. The contribution is the existence proof: a single set of weights can play Atari, control a robot arm, and hold a conversation. The question for 2026's VLAs is whether scale resolves the specialist gap.

Why this paradigm matters

Infrastructure reuse. The entire language-modeling stack — KV-cache, FlashAttention, tensor parallelism, speculative decoding — transfers directly. No custom RL training loop.
Scaling laws. If trajectory prediction obeys the same scaling laws as language, then throwing more compute at a bigger model should improve the policy. Early evidence is mixed but trending positive.
Pre-training. A trajectory transformer can be pre-trained on diverse multi-task data, then fine-tuned on a specific task with minimal data — the same recipe that works for LLMs.
Unified interface. Language instructions, goals, and rewards can all be tokens in the same sequence. No separate conditioning mechanism needed.

The stitching problem

DT's main limitation is trajectory stitching. Classical offline RL (CQL, IQL) can combine the beginning of one trajectory with the end of another to find a better policy than any single trajectory in the dataset. DT cannot — it generates trajectories that look like trajectories in the dataset. If the dataset contains no single trajectory that achieves the maximum reward, DT will not discover the optimal policy by composing partial trajectories.

Concretely: suppose the dataset has two trajectories for a navigation task. Trajectory A reaches a waypoint efficiently but then fails. Trajectory B starts poorly but finishes well. IQL can stitch the efficient start of A with the successful finish of B. DT will reproduce either A or B in full, because it was trained to imitate whole sequences. This is the fundamental trade-off: DT gives up compositional generalization in exchange for stable, simple training.

Worked example: DT inference step by step. Task: pick-and-place. Dataset returns range from 0 (failure) to 10 (perfect). We want the best possible behavior. Step 0. Set desired return-to-go: $\hat{R}_1 = 10$ (we want a perfect trajectory). Step 1. Observe $s_1 = [\text{joint angles, gripper open, block at (0.3, 0.2)}]$. Feed $(\hat{R}_1 = 10, s_1)$ into the transformer. Output: $a_1 = [0.02, 0.01, -0.03, 0.0, 0.0, 0.0, 0.0]$ (move toward block). Step 2. Execute $a_1$. Receive reward $r_1 = 0$ (no success yet). Update: $\hat{R}_2 = \hat{R}_1 - r_1 = 10 - 0 = 10$. Step 3. Observe $s_2$. Feed $(\hat{R}_1, s_1, a_1, \hat{R}_2 = 10, s_2)$. Output: $a_2$ (continue approaching). Steps 4–15. The robot approaches, grasps, lifts. Each step: execute, observe reward, update return-to-go. At step 12, the robot places the block: $r_{12} = 8.0$. Now $\hat{R}_{13} = 10 - 0 - 0 - \ldots - 8.0 = 2.0$. Step 16. The remaining return-to-go is 2.0. The model generates "finishing" actions (release gripper, retract arm) consistent with earning 2.0 more reward. The key mechanism: the return-to-go acts as a "quality dial." Setting $\hat{R}_1 = 10$ tells the transformer "generate actions from the part of the training distribution where total reward was 10." Setting $\hat{R}_1 = 3$ would generate a mediocre trajectory. The model never sees an explicit reward signal during inference — it just conditions on the desired outcome.

Worked example: why DT cannot stitch. Two demonstrations for a block-stacking task: Demo A: States $[s_1, s_2, s_3, s_4]$, actions $[a_1^A, a_2^A, a_3^A, a_4^A]$, return = 5. Good approach ($s_1 \to s_2$), sloppy grasp ($s_2 \to s_3$), failed placement ($s_3 \to s_4$). Demo B: States $[s_1, s_5, s_6, s_7]$, actions $[a_1^B, a_5^B, a_6^B, a_7^B]$, return = 8. Slow approach ($s_1 \to s_5$), solid grasp ($s_5 \to s_6$), clean placement ($s_6 \to s_7$). Optimal stitched policy: $s_1 \xrightarrow{a_1^A} s_2$ (good approach from A) $\to$ somehow transition to $s_6$ (solid grasp from B) $\to s_7$ (clean placement from B). Return could be 13. Why DT fails: If we condition on $\hat{R}_1 = 13$, the transformer has never seen a trajectory with return 13 starting from $s_1$. It saw return=5 from $s_1 \to s_2 \to \ldots$ and return=8 from $s_1 \to s_5 \to \ldots$. It will either generate A-like or B-like actions, not a hybrid. The context window processes sequences as wholes; it has no mechanism for value-based per-state composition. Why IQL succeeds: IQL learns $V(s_2) \approx \text{best continuation from } s_2$. If the dataset includes any transition near $(s_2, \cdot, s_6)$ or if value generalization connects them, IQL extracts $\pi(s_2) = a^*$ that leads toward $s_6$. The stitching happens through the value function, not through sequence matching.

Decision Transformer forward pass (simplified)

class DecisionTransformer(nn.Module):
    def __init__(self, state_dim, act_dim, hidden=256, n_layers=4, n_heads=4, max_len=20):
        super().__init__()
        # Separate embeddings for each token type
        self.embed_return = nn.Linear(1, hidden)
        self.embed_state  = nn.Linear(state_dim, hidden)
        self.embed_action = nn.Linear(act_dim, hidden)
        self.embed_timestep = nn.Embedding(max_len, hidden)
        # Causal transformer (GPT-style)
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=hidden, nhead=n_heads, batch_first=True),
            num_layers=n_layers
        )
        self.predict_action = nn.Linear(hidden, act_dim)

    def forward(self, returns, states, actions, timesteps):
        # returns: (B, T, 1), states: (B, T, state_dim), actions: (B, T, act_dim)
        B, T, _ = states.shape
        pos = self.embed_timestep(timesteps)  # (B, T, hidden)
        # Embed each modality + add shared positional encoding
        r_emb = self.embed_return(returns) + pos   # (B, T, hidden)
        s_emb = self.embed_state(states) + pos     # (B, T, hidden)
        a_emb = self.embed_action(actions) + pos    # (B, T, hidden)
        # Interleave: [R1, s1, a1, R2, s2, a2, ...]
        tokens = torch.stack([r_emb, s_emb, a_emb], dim=2)  # (B, T, 3, hidden)
        tokens = tokens.reshape(B, 3*T, -1)                 # (B, 3T, hidden)
        # Causal mask: each token attends only to itself and prior tokens
        mask = torch.triu(torch.ones(3*T, 3*T, dtype=torch.bool), diagonal=1)
        out = self.transformer(tokens, src_mask=mask)  # (B, 3T, hidden)
        # Extract state token positions (indices 1, 4, 7, ...)
        state_out = out[:, 1::3, :]  # (B, T, hidden)
        return self.predict_action(state_out)  # (B, T, act_dim)

Comparison: sequence RL vs. classical offline RL

Method	Bellman equation?	Trajectory stitching?	Training	Best for
Decision Transformer	No	No	Supervised (MSE / CE)	High-quality demos, simple training
IQL	Yes (expectile)	Yes	Offline RL	Mixed-quality data, needs stitching
CQL	Yes (conservative)	Yes	Offline RL	Large datasets, risk-averse deployment
BC (vanilla)	No	No	Supervised (MSE / CE)	Expert-only data
Trajectory Transformer	No	Partial (beam search)	Supervised (CE on tokens)	Planning-heavy tasks
Gato	No	No	Supervised (CE on tokens)	Multi-task generalist

Decision Transformer did not beat IQL on benchmarks. That was never the point. The point was proving that a policy can be a language model — and that the infrastructure, scaling laws, and pretraining recipes of LLMs transfer to RL. Every VLA that tokenizes actions is a descendant of this insight.

20Hybrid: BC + RL

The polishing step that makes specialists out of generalists.

BC gives you a policy that does roughly the right thing. RL gives you a policy that does the right thing reliably.

RL fine-tuning of BC

Train a BC model, initialize an RL run with its weights. Two tricks:

KL constraint against the BC prior: add a $\mathrm{KL}(\pi_\theta \| \pi_{\text{BC}})$ regularizer.
Entropy clipping: bound the policy's stochasticity below the BC's.

Residual RL

Freeze a BC base policy $\pi_{\text{BC}}$. Train a small RL "correction" policy $\pi_\Delta$ that outputs an action delta. The deployed action is $a = \pi_{\text{BC}}(o) + \pi_\Delta(o)$. The RL problem is much easier — the BC prior already does most of the task, and the correction lives in a small action-magnitude box.

The residual formulation in detail

In plain English: the BC policy does the heavy lifting — reaching, approaching, orienting. The RL agent adds a tiny correction on top, like a surgeon fine-tuning a robot arm that a nurse has already roughly positioned. The correction is bounded so the RL can never override the base behavior entirely.

The total action is:

Residual RL action $$ a_t = \pi_{\text{BC}}(o_t) + \underbrace{\pi_\Delta(o_t)}_{\text{RL correction}}, \qquad \|\pi_\Delta\| \leq \delta$$

The RL agent sees the observation $o_t$ and outputs the correction $\pi_\Delta(o_t)$ as its action. The environment receives the full action $a_t = \pi_{\text{BC}}(o_t) + \pi_\Delta(o_t)$, but the RL reward function evaluates the composite: $r'(s_t, \pi_{\text{BC}}(o_t) + \pi_\Delta(o_t))$. The RL agent only controls the delta — it cannot override the base policy, only nudge it. The bound $\|\pi_\Delta\| \leq \delta$ (typically $\delta \approx 0.01$ in end-effector space, or 10% of the base policy's action magnitude) prevents the correction from dominating.

Why this is easier than training RL from scratch: the BC policy already solves the gross motion problem. Reaching toward the right object, moving to the right area, orienting the gripper roughly correctly — all of this is handled. The RL correction only needs to learn the fine motion: the last 3mm of insertion, the force modulation during contact, the precise timing of the gripper close. This is a much smaller action space with much denser reward signal, so RL converges in orders of magnitude fewer steps.

The initialization matters: $\pi_\Delta$ is initialized with near-zero weights (e.g., the final linear layer scaled by 0.01). At the start of training, the composite policy is essentially pure BC. As RL training progresses, the correction grows from zero, and the base policy's behavior is smoothly refined rather than disrupted.

Worked example: residual RL for PCB insertion. The BC base policy positions the peg above the hole with ~3mm accuracy. The RL residual learns the final 3mm of insertion, including contact-force modulation. BC action: $\pi_{\text{BC}}(o) = [0.002, -0.001, -0.015, 0.0, 0.0, 0.0, 0.85]$ (slow descent, gripper mostly closed). RL correction: $\pi_\Delta(o) = [0.0005, 0.001, -0.003, 0.002, -0.001, 0.0, 0.0]$ (small lateral + rotational adjustment). Deployed action: $a = [0.0025, 0.0, -0.018, 0.002, -0.001, 0.0, 0.85]$. The correction is bounded: $\|\pi_\Delta\| \leq \delta$ where $\delta = 0.01$ in EE space. This prevents the RL from overriding the base policy and causing unsafe behavior. The result: the BC base gets the robot to within 3mm, and the residual closes the last 3mm with learned compliance — something the BC alone couldn't learn from demonstrations that didn't have consistent force feedback.

HIL-SERL: the 5-component recipe

Luo et al., 2024. The current state of the art for sample-efficient real-world RL on manipulation. HIL-SERL (Human-in-the-Loop Sample-Efficient RL) is not a single algorithm — it is a carefully assembled pipeline of five components, each essential:

Pre-trained vision encoder (frozen). ResNet-10 or R3M, pretrained on diverse manipulation data. The encoder converts 480x640 RGB images into a 512-dim feature vector. Frozen during RL training — this reduces the RL problem from "learn perception + control" to "learn control on a fixed representation."
Offline pretraining from demos. Collect 20–50 teleoperated demonstrations. Pretrain the Q-network and policy on this data using an offline RL objective (conservative Q-learning or simple BC + Q-regression). This gives the policy a reasonable starting behavior — it can attempt the task, even if imperfectly.
Online RL with SAC. Deploy the pretrained policy on the real robot and fine-tune with SAC. The Q-ensemble (2–10 critics) with high update-to-data ratio (UTD = 20) squeezes maximum learning from every real-world transition.
Human-in-the-loop interventions. A human operator watches the robot via camera feed. When the policy is about to fail (e.g., the gripper is about to drop the object, or the arm is heading toward a collision), the human presses a button and takes over via teleop. The human guides the robot through the difficult part, then releases control back to the policy.
Intervention data goes into the replay buffer. Both the autonomous data (successes and near-failures) and the human intervention data (corrections near failure states) are stored in the SAC replay buffer. The intervention data is crucial: it provides exactly the transitions the policy needs most — how to recover from states near failure. Without interventions, the policy would have to discover recovery behaviors through random exploration, which is dangerous and slow.

The key insight: human interventions are not just a safety mechanism. They are a data collection strategy. The interventions target exactly the states where the policy is weakest, providing "negative examples" near failure boundaries. This is the data RL needs most — transitions at the edge of success and failure, where the Q-function's gradient is steepest.

The result: 100% success on contact-rich tasks (PCB insertion, Jenga manipulation, connector insertion) in under two hours of real-world training. This is the only RL recipe in 2026 that is competitive with BC + lots of data on real robots.

When to use which hybrid approach

Scenario	Recommended	Why
BC is 75%+ and you have a sim	RL fine-tuning in sim	Cheap, safe, unlimited data. Add KL constraint against BC prior.
BC is 75%+ and you need real-world polish	Residual RL	Bounded corrections, safe deployment, fast convergence on fine motion.
BC is 50% and the task is contact-rich	HIL-SERL	Human interventions provide recovery data. High UTD compensates for small data.
No BC at all, only a simulator	PPO from scratch + sim-to-real	On-policy RL scales with parallel envs. DR + RMA for transfer.
VLA foundation model + task-specific deployment	DPPO / RL fine-tuning of generative policy	The VLA is the BC; RL refines it on the specific task and robot.
Limited real data, no sim, no robot access	Offline RL (IQL)	Extract the best policy from the fixed dataset without interaction.

The residual RL initialization trick

The most common failure mode of residual RL is "the RL correction grows too large too fast, overrides the BC base, and the combined policy collapses." The fix is embarrassingly simple: initialize the RL correction network's final layer to output near-zero actions.

Concretely: set the final linear layer's weights to $\mathcal{N}(0, 0.01)$ and biases to 0. At the start of RL training, $\pi_\Delta(o) \approx \mathbf{0}$, so the total action is:

Residual RL with near-zero initialization $$ a_t = \pi_{\text{BC}}(o_t) + \underbrace{\pi_\Delta(o_t)}_{\approx\, \mathbf{0}\text{ at init}} \approx \pi_{\text{BC}}(o_t) $$

$\pi_{\text{BC}}(o_t)$ — the frozen base policy output. This provides the gross motion: reaching, approaching, orienting. It is never updated by RL gradients.
$\pi_\Delta(o_t)$ — the RL correction network. A small MLP (2–3 layers, 256 hidden units) that takes the same observation as the BC policy. Its output is clipped: $\|\pi_\Delta\| \leq \delta$.
$\delta$ — the correction bound. Typically 5–10% of the BC action magnitude. For a 7-DOF robot with EE-delta actions in the range $[-0.05, 0.05]$ m/step, $\delta \approx 0.005$ m. This prevents the RL from overriding the base policy entirely.

The near-zero initialization means the RL agent starts by executing the BC policy perfectly, and then gradually discovers which small corrections improve the reward. This is analogous to LoRA in LLM fine-tuning: start from the pretrained model and add small rank-1 corrections. The RL agent never "forgets" the BC behavior because it never had to learn it in the first place — the BC policy is frozen and always contributes its full output.

ResidualPolicy — PyTorch

import torch
import torch.nn as nn

class ResidualPolicy(nn.Module):
    def __init__(self, bc_policy, obs_dim, act_dim, delta_max=0.005):
        super().__init__()
        self.bc = bc_policy
        for p in self.bc.parameters():
            p.requires_grad = False  # freeze BC
        self.delta_net = nn.Sequential(
            nn.Linear(obs_dim, 256), nn.ReLU(),
            nn.Linear(256, 256), nn.ReLU(),
            nn.Linear(256, act_dim),
        )
        # Near-zero init: final layer outputs ~0 at start
        nn.init.normal_(self.delta_net[-1].weight, std=0.01)
        nn.init.zeros_(self.delta_net[-1].bias)
        self.delta_max = delta_max

    def forward(self, obs):
        with torch.no_grad():
            a_bc = self.bc(obs)
        delta = self.delta_net(obs)
        delta = torch.clamp(delta, -self.delta_max, self.delta_max)
        return a_bc + delta

Worked example: HIL-SERL for USB insertion. The task: insert a USB-A connector into a port. The connector must be oriented correctly (no flip) and aligned to ±0.5mm. The initial BC policy, trained on 30 teleoperated demonstrations, achieves 62% success. Failure analysis of the BC policy (38 failures out of 100 trials): • 18 failures: connector aligned but slightly too high/low (misses the port opening by ~1mm). • 12 failures: connector oriented correctly but rotated ~3° around the insertion axis (catches on the shield). • 8 failures: approach trajectory too fast, overshoots the pre-insertion waypoint. HIL-SERL setup. A human operator watches the robot through a side-mounted camera. They hold a 6-DOF SpaceMouse. When the robot is about to fail, the human takes over and guides the connector into the port. The takeover typically lasts 2–4 seconds (the fine alignment phase). Both autonomous and intervention trajectories go into the SAC replay buffer. Minute 0–5: The policy is mostly BC. Success rate: ~62%. The human intervenes on ~40% of trials. Each intervention generates 20–40 transitions at exactly the states where the policy struggles most (near the port opening, at contact). Replay buffer: ~500 transitions. Minute 5–10: SAC with UTD=20 has already done 10,000 gradient updates on the replay buffer. The Q-ensemble (5 critics) is learning that fine lateral adjustments near the port yield high reward. Success rate climbs to ~75%. Human interventions drop to ~25% of trials. Minute 10–15: The policy has learned the fine alignment motion from the intervention data. It now self-corrects when the connector catches on the shield (the most common failure mode). Success rate: ~87%. Human interventions: ~12% of trials. Minute 15–20: The remaining failures are edge cases: unusual USB port orientations, connector wear. The human intervenes on the rare hard cases, providing the exact transitions needed. Final success rate: 94%. Total human interventions: ~50 over 20 minutes (~100 trials total, ~30 seconds per intervention). The data budget: 20 minutes of real-robot time generated ~3,000 transitions (2,000 autonomous + 1,000 from interventions). SAC with UTD=20 performed ~60,000 gradient updates. The policy went from 62% to 94% — a 32-point improvement — in less time than it takes to collect 30 more teleoperation demonstrations.

The Physical Intelligence π₀ → π_0.5 → π_0.7 recipe

Physical Intelligence's progression from π₀ to π_0.7 is the most complete public example of the BC → fine-tune → RL pipeline applied at scale. Each stage adds a specific capability:

Stage 1: π₀ (foundation model). Pre-train a VLA with flow matching on a large-scale diverse dataset (cross-embodiment, multi-task). The flow matching objective generates continuous action trajectories rather than discrete tokens. The result is a generalist policy that can attempt hundreds of tasks on multiple robots, but does none of them reliably (∼40–60% success on most tasks).

Stage 2: π_0.5 (task-specific fine-tuning). Fine-tune π₀ on 50–200 demonstrations of the target task using LoRA (rank 16–32, applied to the attention layers of the VLA). The fine-tuning takes 2–4 hours on 8 GPUs. LoRA preserves the foundation model's general knowledge while adapting the action distribution to the specific task. Success rate improves to 65–80% on the target task, with some degradation on non-fine-tuned tasks (the catastrophic forgetting is mild because LoRA modifies only ~2% of the parameters).

Stage 3: π_0.7 (RL polish with the RL Token). This is the novel contribution. Add a special RL Token to the VLA's input context — a binary token that, when set to 1, activates an exploration mode. During RL fine-tuning:

The RL Token is set to 1. The VLA adds an entropy bonus to its action distribution, encouraging exploration around the fine-tuned behavior.
Online RL (PPO variant) runs for 30–60 minutes per task on the real robot. The reward is sparse (task success/failure) plus shaped sub-rewards (distance to goal, contact events).
Only the LoRA adapters and a small RL head are updated; the foundation model backbone remains frozen.
At deployment, the RL Token is set to 0, and the policy produces deterministic actions with the RL-refined LoRA weights.

The result: π_0.7 achieves 10–25% absolute improvement on dexterous tasks (laundry folding, object reorientation, connector insertion) over π_0.5. The improvement is largest on tasks with contact-rich phases where the BC demonstrations were inherently inconsistent (humans demonstrate slightly different force profiles each time).

Why the RL Token works. The fundamental problem with RL fine-tuning of a foundation model is balancing exploration against catastrophic forgetting. Standard RL exploration (entropy bonus on the full action distribution) causes the VLA to produce random, incoherent actions that deviate wildly from the pretrained distribution. The RL Token provides a conditional exploration mechanism: when RL Token = 1, the model adds stochasticity only to the action expert's output, not to the VLM backbone's language and vision processing. The VLM still produces coherent scene understanding and task decomposition; only the low-level motor commands are perturbed. This is analogous to adding noise to the actor in SAC while keeping the critic deterministic — except here the "critic" is the VLM's scene understanding. At deployment, RL Token = 0 and the action expert runs in its deterministic (mode-seeking) configuration. The RL-refined LoRA weights encode the improved policy; the token simply controls whether exploration noise is added during online learning.

RLHF for robots

Human preference labels over pairs of trajectories train a reward model; the reward model trains a policy with RL. The bottleneck is preference-label collection at scale; the technique is mature, the data isn't.

The hybrid landscape, visualized

Method	BC data	RL interaction	Best for
Pure BC	Yes (expert)	None	When data is plentiful and expert-quality
BC + RL fine-tune	Yes (initialize)	On-policy (sim or real)	Closing the last 10-20% gap
Residual RL	Frozen base	Small correction only	When base is good but needs polish
HIL-SERL	Small seed	Real-world + human safety	Contact-rich tasks, production quality
Offline RL	Mixed-quality	None (fixed dataset)	When no further interaction is possible
VLA + RL Token	Foundation model	Online RL at deploy	Continual improvement post-deployment

The pattern to notice: every successful hybrid method constrains the RL component. Residual RL bounds the correction magnitude. HIL-SERL adds a human safety net. KL-constrained fine-tuning penalizes deviation from the BC prior. The RL Token restricts exploration to the action expert while keeping the VLM backbone frozen. Unconstrained RL from scratch on a real robot is still impractical in 2026 — the search space is too large, the hardware too expensive, and the failure modes too dangerous. The hybrid recipe is: BC provides the prior, RL provides the polish, and the constraint prevents the polish from becoming sandpaper.

The open question for 2026–2027: how much RL budget does each task need? Contact-rich insertion tasks converge in 20 minutes of HIL-SERL. Open-world navigation may need hours. Deformable manipulation (cloth, rope) remains an open challenge for RL fine-tuning because the reward signal is ambiguous (what does "successfully folded" mean for a wrinkled towel?) and the physics are hard to simulate. The hybrid stack is mature; the reward engineering for complex tasks is not.

The meta-lesson of this section: RL is not competing with BC. RL is the second stage of a pipeline where BC is the first. The two techniques are complements, not substitutes. The field spent 2018–2023 debating "BC vs RL." The field in 2026 uses both, in sequence, constrained, on every task that matters. The only remaining debate is how much RL budget each task needs, not whether to use it at all.

If you have a working BC policy and you want it to be 95% rather than 75%, you do not need a new architecture. You need RL fine-tuning with a small budget of real-world interaction and either a human safety net or a calibrated simulator. This is the closing argument of 2026's playbook.

20·5DPPO & RL fine-tuning of generative policies

The missing piece: how to apply policy gradients when your policy generates actions through iterative denoising.

Section 20 showed the general recipe for RL fine-tuning of BC policies. But diffusion and flow-matching policies pose a unique problem: you cannot easily compute $\log \pi_\theta(a \mid o)$. The action is the output of a multi-step denoising chain, not a single forward pass through a network with a tractable density. Without $\log \pi$, standard policy gradient methods (PPO, SAC) don't apply directly. This section covers the growing family of methods that solve this problem.

The core challenge

A Gaussian policy outputs $a \sim \mathcal{N}(\mu_\theta(o), \sigma^2)$; computing $\log \pi_\theta(a \mid o)$ is one line of code. A diffusion policy generates $a$ by iterating $K$ denoising steps from pure noise $a^{(K)} \sim \mathcal{N}(0, I)$ through $a^{(K-1)}, a^{(K-2)}, \ldots, a^{(0)}$. The final action $a = a^{(0)}$ is a deterministic function of the initial noise and the $K$ network evaluations. The marginal density $\pi_\theta(a \mid o) = \int p(a^{(K)}) \prod_{k} p_\theta(a^{(k-1)} \mid a^{(k)}, o)\, da^{(K:1)}$ is intractable — it requires marginalizing over all intermediate noise samples.

DPPO: Diffusion Policy Policy Optimization

DPPO (Ren et al., 2024) resolves this by reframing the denoising chain as a multi-step MDP. Each denoising step $k \to k-1$ is treated as an "action" in an inner MDP. The "state" at inner step $k$ is the current noisy action $a^{(k)}$ plus the observation $o$. The "action" at inner step $k$ is the denoiser's output that produces $a^{(k-1)}$. The reward is zero for all intermediate steps; the environment reward $r$ arrives only after the final denoised action $a^{(0)}$ is executed. arXiv:2409.00588

This reframing makes each individual denoising step a tractable Gaussian transition — and PPO can be applied to the chain.

The denoising chain as an inner MDP

Formally, DPPO defines:

Inner state $\tilde{s}_k = (a^{(k)}, o)$. The noisy action at denoising step $k$, plus the observation.
Inner action $\tilde{a}_k = \epsilon_\theta(a^{(k)}, k, o)$. The noise prediction at step $k$.
Inner transition $a^{(k-1)} = f(a^{(k)}, \tilde{a}_k, k)$. The DDPM/DDIM update rule.
Inner reward $\tilde{r}_k = 0$ for $k > 0$; $\tilde{r}_0 = r(o, a^{(0)})$. Reward only at the end.

Because $\epsilon_\theta$ outputs a Gaussian (or is treated as a deterministic function with added Gaussian exploration noise), we can compute $\log \pi_\theta(\tilde{a}_k \mid \tilde{s}_k)$ at each denoising step. PPO's clipped surrogate objective applies to each step individually.

DPPO's modified PPO objective

In plain English: PPO but applied to each denoising step of the diffusion policy. The diffusion model takes 16 steps to go from pure noise to a clean action. DPPO treats each of those 16 steps as a separate decision point and applies PPO independently to each one. The environment reward only arrives at the end (after the clean action is executed), but GAE propagates the reward signal backward through all 16 steps so every denoising step gets a gradient.

DPPO objective over the denoising chain $$ \mathcal{L}_{\text{DPPO}} = \sum_{k=0}^{K-1} \mathbb{E}_{\tilde{s}_k, \tilde{a}_k}\left[\min\!\Big(\tilde{r}_k(\theta)\,\hat{A}_k, \;\text{clip}\big(\tilde{r}_k(\theta), 1-\epsilon, 1+\epsilon\big)\,\hat{A}_k \Big)\right]$$

$k$ — the denoising step index, running from $K-1$ (noisiest) to $0$ (cleanest). Each step is treated as a separate "timestep" in the inner MDP.
$\tilde{r}_k(\theta) = \frac{\pi_\theta(\tilde{a}_k \mid \tilde{s}_k)}{\pi_{\theta_\text{old}}(\tilde{a}_k \mid \tilde{s}_k)}$ — the importance ratio at denoising step $k$. Same as standard PPO, but computed for the noise prediction at step $k$, not the final action.
$\hat{A}_k$ — the advantage at denoising step $k$. Because the only reward comes at $k = 0$, the advantage must be propagated backward through the chain. DPPO uses GAE computed over the inner MDP: $\hat{A}_k = \sum_{j=0}^{k} (\gamma \lambda)^j \delta_{k-j}$ where $\delta_k = \tilde{r}_k + \gamma V(\tilde{s}_{k-1}) - V(\tilde{s}_k)$.
$\epsilon$ — the PPO clip range, typically 0.2. Applied independently at each denoising step to prevent any single step from changing too much.
$V(\tilde{s}_k)$ — the inner value function. A learned critic that estimates the expected return from inner state $\tilde{s}_k$. Since all reward comes at $k = 0$, $V(\tilde{s}_k)$ estimates "how good is the partially-denoised action $a^{(k)}$?"

Computing advantages for intermediate denoising steps

The tricky part: only the final action receives reward. So how does the advantage propagate to step $k = K-1$ (the first denoising step from pure noise)?

DPPO treats the denoising chain as a $K$-step episode with a single terminal reward. The value function $V(\tilde{s}_k)$ learns to predict the expected environment reward from inner state $k$. The advantage at step $k$ is:

Inner advantage via GAE $$ \hat{A}_k = \sum_{j=0}^{k} (\gamma_{\text{inner}} \lambda)^j \Big[\tilde{r}_{k-j} + \gamma_{\text{inner}} V(\tilde{s}_{k-j-1}) - V(\tilde{s}_{k-j})\Big]$$

$\gamma_{\text{inner}}$ — the discount factor for the inner MDP. Typically set to 1.0 (no discounting within the chain), since the chain is short ($K = 10$–$16$ steps) and we want the reward signal to propagate fully.
$\lambda$ — the GAE parameter for the inner MDP. Controls bias-variance trade-off in advantage estimation. Typically 0.95, same as outer PPO.
$\tilde{r}_{k-j}$ — inner reward. Zero for all $k > 0$; equals the environment reward at $k = 0$.

In practice, the backward pass is cheap: the chain is only $K = 10$–$16$ steps, so GAE over the inner MDP is a simple loop.

DPPO advantage computation for the denoising chain

def compute_dppo_advantages(
    inner_values,   # V(s_k) for k = K-1, K-2, ..., 0   shape: (B, K)
    env_reward,     # r(o, a^(0))                        shape: (B,)
    gamma=1.0,      # inner discount (usually 1.0)
    lam=0.95        # GAE lambda
):
    B, K = inner_values.shape
    advantages = torch.zeros_like(inner_values)  # (B, K)
    last_gae = torch.zeros(B)

    # Walk backward through denoising chain: k = 0, 1, ..., K-1
    # k=0 is the final (clean) step that receives reward
    for k in range(K):
        if k == 0:
            # Terminal step: reward comes from environment
            inner_reward = env_reward             # (B,)
            next_value = torch.zeros(B)           # no step after final action
        else:
            # Intermediate step: no reward
            inner_reward = torch.zeros(B)
            next_value = inner_values[:, k - 1]   # V(s_{k-1})

        delta = inner_reward + gamma * next_value - inner_values[:, k]
        last_gae = delta + gamma * lam * last_gae
        advantages[:, k] = last_gae

    return advantages  # (B, K) — one advantage per denoising step

REBEL: reward-conditioned diffusion

An alternative to modifying the RL objective: condition the diffusion model on the desired reward, analogous to classifier-free guidance but for RL returns. During training, add a reward embedding to the denoiser's conditioning input. At inference, set the reward conditioning to the maximum observed reward. The model generates high-reward actions without any policy gradient computation. This is the diffusion-policy analogue of Decision Transformer's return-to-go conditioning.

The advantage: no inner MDP, no modified PPO, no value function over denoising steps. The disadvantage: like Decision Transformer, it cannot stitch trajectories or extrapolate beyond the best behavior in the dataset.

CalQL + diffusion policies

Calibrated Q-Learning (Nakamoto et al., 2023) extends CQL with a calibration step that prevents excessive conservatism. When paired with a diffusion action head, the diffusion model serves as the policy $\pi$ in the actor-critic loop, and the Q-function provides gradients to update the denoiser. The key insight is that diffusion policies produce diverse action samples naturally — they are excellent proposal distributions for the log-sum-exp term in CQL's regularizer. arXiv:2303.05479

RLPD: RL with Prior Data

RLPD (Ball et al., 2023) is not specific to diffusion policies, but it is the default recipe for mixing offline demonstrations with online RL. The idea is simple: maintain a replay buffer that contains both online transitions (from the current policy interacting with the environment) and offline transitions (from the demonstration dataset). Sample mini-batches from both, with a fixed ratio (typically 50/50), and train SAC as usual. arXiv:2302.02948

RLPD works because SAC is off-policy — it can learn from any data regardless of which policy collected it. The demonstrations provide a warm start (the policy sees successful behavior immediately), and the online data provides coverage of the states the policy actually visits. The 50/50 ratio is surprisingly robust; most practitioners do not need to tune it.

The Physical Intelligence recipe: RL Token

The $\pi_0.5 \to \pi_0.7$ progression from Physical Intelligence reveals the production recipe for RL fine-tuning of generative VLA policies. The mechanism: add a special RL Token to the VLA's vocabulary. When the token is present in the input, the model enters "RL mode" — the action head is trained with online RL (environment interaction + reward signal) rather than BC. When the token is absent, the model behaves as a standard BC policy.

The RL Token mechanism enables fast online polishing without catastrophic forgetting of the BC prior. The BC data remains in the training mix (the RL Token is absent for those examples), so the model simultaneously learns from demonstrations and from its own experience. Think of it as residual RL (Section 20) but implemented at the token level inside the VLA rather than as a separate correction network.

When to use which

Method	Policy type	Data regime	Best for
DPPO	Diffusion / flow	Sim rollouts (on-policy)	Sim-trained policies that need RL polish
Residual RL	Any (frozen base + correction)	Real-world online	When base is good; need small correction
REBEL / reward-conditioned	Diffusion / flow	Offline + reward labels	No online interaction available; have rewards
CalQL + diffusion	Diffusion / flow	Offline (fixed dataset)	Large offline datasets with mixed quality
RLPD	Any (SAC-based)	Online + offline demos	Real-world with prior demos; sample-efficient
RL Token (PI recipe)	VLA with generative head	Online (post-deployment)	Foundation VLAs; continual improvement

Worked example: DPPO on a sim-trained Diffusion Policy. You have a Diffusion Policy trained via BC on 200 demonstrations for a peg-insertion task. Success rate: 72%. You want to push it to 95% using RL in simulation. Setup. The policy uses $K = 16$ DDIM denoising steps, action dimension $D = 7$ (relative EE pose + gripper). The environment reward is sparse: $r = 1$ on successful insertion, $r = 0$ otherwise. Inner MDP. Each denoising step $k = 15, 14, \ldots, 0$ is a "timestep." The inner state is $(a^{(k)}, o)$. The inner action is the noise prediction $\epsilon_\theta(a^{(k)}, k, o)$. You train a small inner value function $V_\psi(\tilde{s}_k)$ with 2 hidden layers. Rollout. Collect 256 environment episodes. For each episode, record the full denoising chain: 16 inner states, 16 noise predictions, and the terminal reward. This gives $256 \times 16 = 4096$ inner transitions per batch. PPO update. Compute GAE advantages over the inner MDP (the code above). Run 4 PPO epochs with clip $\epsilon = 0.2$. Update both the denoiser $\theta$ and the inner critic $\psi$. Result. After 500 outer iterations (~128K episodes), success rate climbs from 72% to 94%. The denoiser has learned to slightly adjust its noise predictions at steps $k = 3$–$5$ (the final refinement steps) to produce actions that are more precisely aligned with the peg hole. The early denoising steps ($k = 15$–$8$) barely change — the coarse trajectory was already correct from BC.

The field spent 2023–2024 building diffusion and flow-matching policies. It is now spending 2025–2026 figuring out how to RL-fine-tune them. DPPO cracked the theoretical barrier; RLPD and the RL Token cracked the practical one. If your BC policy plateaus at 80%, the answer is not more data — it is a few hundred episodes of RL on top of the denoising chain.

Loss	Used by	Shape
PG / REINFORCE	vanilla PG	$-\mathbb{E}[\log \pi(a \mid s) \cdot A]$
PPO clip	PPO	$-\mathbb{E}[\min(rA, \mathrm{clip}(r, 1\!\pm\!\epsilon)A)]$
DQN / Bellman	DQN, SAC critic	$\mathbb{E}[(Q - (r + \gamma \bar Q'))^2]$
SAC actor	SAC	$\mathbb{E}[\alpha \log \pi - Q]$
CQL extra	offline CQL	$\log \sum_a e^{Q} - \mathbb{E}_{\pi_\beta}[Q]$
IQL expectile	offline IQL	$\mathbb{E}[L^\tau_2(Q - V)]$
AWR / AWAC	offline / hybrid	$-\mathbb{E}[e^{\beta A} \log \pi]$

Ambition	Data scale	Compute	Training time
Single-task specialist	50–200 demos	1 GPU	4–8 hours
Multi-task, same robot	1K–5K demos	1–4 GPUs	1–2 days
Cross-embodiment generalist	100K+ demos	32+ GPUs	1–2 weeks
Foundation model (π₀-scale)	10M+ demos	256+ GPUs	Months

Strategy	Robot success (trained tasks)	Robot success (novel instructions)	Web VQA accuracy	Training time	Forgetting
No co-fine-tuning (robot data only)	89%	32%	41%	7 hours	Severe
Co-fine-tuning ($\lambda_\text{web} = 0.2$)	85%	71%	78%	8 hours	Minimal
LoRA-only (backbone frozen)	82%	68%	82%	4 hours	Zero

Architecture	Inference	Notes
ACT	5–10 ms	Single forward, 80M params
Diffusion Policy	30–80 ms	16 DDIM steps
π₀ flow	40–60 ms	10 Euler steps
π₀-FAST	~20–40 ms	~30 FAST tokens
GR00T N1 (2.2B)	63.9 ms	L40 GPU, bf16
OpenVLA 7B	200–400 ms	INT4 cuts ~2×
SmolVLA 450M	~30–50 ms	Jetson-class
One-step distilled	5–15 ms	1 sampling step

Stage	Time	Notes
Camera capture	5 ms	USB3 camera at 30fps. Async: grab latest frame.
Image preprocessing	3 ms	Resize, normalize, stack history frames.
Vision encoder	8 ms	ResNet-18 forward pass on GPU (bf16).
Denoiser (16 DDIM steps)	48 ms	16 x 3ms per step (CNN U-Net). The bottleneck.
Action postprocessing	2 ms	Unnormalize, convert to joint/EE commands.
Safety filter	5 ms	Workspace bounds, velocity limits, collision check.
Communication	3 ms	Send to robot controller over Ethernet/USB.
Margin	26 ms	Absorbs jitter. Never run at 100% utilization.
Total	100 ms	10Hz policy, 50Hz control (chunk fills the gap).

Scenario	Inference mode	Why
Benchmarking / research	Sync	Simpler, deterministic, easier to debug. Most sim benchmarks pause physics during inference anyway.
Slow tasks (pick-and-place)	Sync	Chunk duration >> inference latency. The pause is invisible.
Fast tasks (pouring, handovers)	Async	Any pause disrupts the task. Async keeps motion smooth.
Large VLA on edge hardware	Async + remote	Inference exceeds one chunk duration. Must plan ahead to avoid stalls.
Multi-robot deployment	Async + shared server	One GPU server plans for N robots. Each robot runs a lightweight async client.

Trials ($n$)	Success rate	95% CI	CI width
20	85% (17/20)	[62%, 97%]	35 points
50	85% (42.5/50)	[72%, 93%]	21 points
100	85% (85/100)	[77%, 91%]	14 points
200	85% (170/200)	[80%, 90%]	10 points

Train → Test	Lab table	Kitchen	Warehouse	Outdoor (OOD)
Mug	92%	78%	70%	45%
Bowl	88%	82%	68%	40%
Bottle	90%	80%	72%	48%
Banana (OOD)	52%	40%	35%	22%

Benchmark	Tasks	Modalities	Typical use	Key metric
LIBERO	130	RGB, language, proprio	BC evaluation, VLA fine-tuning	Success rate (per suite)
Calvin	34	RGB, language, proprio	Long-horizon chaining	Avg chain length (1–5)
MetaWorld	50	State (no images)	RL, meta-learning	Success rate
RLBench	100	RGB-D, language, point cloud	3D policies, keyframe methods	Success rate
Language Table	~20	RGB, language	Language grounding	Success rate
SimplerEnv	~15	RGB, language	VLA sim-to-real correlation	Success rate (sim–real R²)
RoboCasa	100+	RGB, language, proprio	Kitchen generalization	Success rate

Simulator	Physics	GPU parallel	Visual realism	Best for
MuJoCo / MJX	Rigid + contact (accurate)	MJX: 4096+ envs	Low (functional)	Locomotion RL, fast prototyping
Isaac Sim/Lab/Gym	PhysX 5 (rigid + soft)	65K+ envs	High (ray-traced)	Sim-to-real, dexterous, locomotion
Genesis	PhysX + MPM + SPH	Yes (CUDA)	Medium	Multi-physics tasks (cloth, fluid)
SAPIEN	PhysX (articulated focus)	Limited	Medium-high	Articulated object manipulation
Robosuite	MuJoCo	Via MuJoCo	Low-medium	BC/IL benchmark framework

Setup	Task family	Sim-real correlation ($r$)	Typical absolute gap
Bridge V2 (WidowX)	Pick-and-place	0.65–0.72	10–20 points
Bridge V2 (WidowX)	Articulated (drawer)	0.55–0.65	15–25 points
Google Robot	Pick-and-place	0.50–0.60	15–30 points
Google Robot	Drawer manipulation	0.45–0.55	20–35 points

Situation	Architecture	Action space	Training
Single task, 50-200 demos, known object	ACT or Diffusion Policy	Joint positions	BC, 1 GPU, 1 day
Single task, need precision	Diffusion Policy + CLIP encoder	Relative EE 6D	BC, 1 GPU, 2 days
Multi-task, language conditioned	Fine-tune OpenVLA or SmolVLA	Discrete tokens / FAST	BC, 8 GPUs, 1 week
High-precision contact-rich	Diffusion Policy + RL fine-tune	Relative EE	BC then HIL-SERL, 2 hours real
Locomotion	PPO + domain rand	Joint torques	RL, sim only, 4096 envs
Humanoid whole-body	Two-system VLA	Hierarchical	BC + motion prior + RL
Data-scarce manipulation	3D Diffusion Policy	Relative EE	BC, 10-50 demos

00Classical robotics — why we need learning

The robot as a dynamical system

PID control — the classical workhorse

Where classical control hits a wall

Interactive: PID tuning

01The problem

Unpacking the POMDP

Scale of the challenge

Derivation: the $O(\epsilon T^2)$ compounding-error bound

Code: compounding error in 1D

The control loop: where policy meets physics

The POMDP in numbers

Why this is not supervised learning

Roadmap: a taxonomy of responses

Interactive: error growth curves

02Spaces of action and observation

Action representations

Forward and inverse kinematics

Rotation parameterization

Why Euler angles fail: gimbal lock

Gram-Schmidt orthogonalization: step by step

Observations

The observation stack that works

03Three paradigms

Imitation

Reinforcement

Model-based

Derivation: BC as maximum likelihood

Derivation: the Bellman equation

The hybrid spectrum

Paradigm comparison

Interactive: the paradigm triangle

04Why naive BC fails

Compounding error

Multimodality

Derivation: why MSE yields the conditional mean

Interactive: bimodal action distribution

Causal confusion

DAgger: fixing distribution shift

Train initial policy

Roll out the learned policy

Query the expert

Aggregate the dataset

Retrain

DAgger in practice: limitations and variants

The fixes

05The multimodality problem

Gaussian mixture heads

Deriving the GMM loss

Discretized / categorical actions

Implicit / energy-based

Diffusion

Flow matching

Vector-quantized

Action head comparison

Practical decision tree

06Action chunking

Why it works

Derivation: why chunking reduces the compounding bound

Receding horizon control

Temporal ensembling

Code: temporal ensembling

Interactive: temporal ensembling

Choosing H and K in practice

The information-theoretic view

07ACT, in full

The CVAE wrapping

The loss

Derivation: the ELBO that gives ACT's loss

Inference

What ACT gets right

What ACT doesn't do

Hyperparameters that matter

08Diffusion Policy, in full

The forward and reverse processes

Derivation: DDPM forward process closed form

Two backbone variants

CNN-based: 1D temporal U-Net

Transformer-based

Inference: DDIM