Introduction
Imagine you want a robot arm to pick up a coffee mug. In classical robotics, you would write equations describing the arm's geometry, build a perception pipeline to locate the mug, compute a collision-free trajectory, design a grasp planner, tune PID controllers for each joint, and pray that your analytical model of friction is accurate enough. Each component is hand-engineered, each interface is brittle, and the whole system shatters the moment you swap the mug for a wine glass.
Now imagine a different approach: you teleoperate the robot through a hundred mug-grasping demonstrations, then train a neural network to map camera images directly to motor commands. No kinematics derivation, no perception pipeline, no grasp planner. The robot learns the entire behavior end-to-end from data. This is the shift from model-based to data-driven robotics, and it is the central story of this tutorial.
This article is based on "Robot Learning: A Tutorial" by Capuano, Pascal, Zouitine, Wolf, and Aractingi from HuggingFace (arXiv:2510.12403). Their tutorial is remarkable for two reasons: it covers the entire stack from forward kinematics to vision-language-action models, and every algorithm comes with runnable code built on LeRobot — HuggingFace's open-source robot learning library.
We start with the classical robotics that learning-based methods aim to replace: forward/inverse kinematics, Jacobians, and feedback control. Then we build up through reinforcement learning, behavioral cloning, generative action models (VAE, diffusion, flow matching), and the specific architectures that dominate modern robot learning: ACT and Diffusion Policy. We finish with generalist robot policies (VLAs) — models like π0 and SmolVLA that can control multiple robot embodiments from natural language instructions.
The progression tells a story of increasing generality. Classical robotics works for one task on one robot with perfect models. RL works for one task on one robot without models but needs millions of interactions. Behavioral cloning learns from demonstrations but struggles with multimodal actions. Generative policies fix the multimodality problem. And VLAs scale to many tasks, many robots, and language-conditioned behavior. Each step addresses a limitation of the previous one.
Classical Robotics
Before we can appreciate why learning-based methods matter, we need to understand what they replace. Classical robotics gives us exact, analytical tools for controlling robot motion — and also shows us exactly where those tools break down.
We will work with a 2-DOF planar manipulator — the simplest possible robot arm. Two links of equal length l, connected by revolute joints with angles θ1 and θ2. This is a simplified version of the SO-100 arm used in the LeRobot ecosystem, but it captures all the essential ideas of kinematics.
Forward Kinematics
Forward kinematics (FK) answers the question: "Given joint angles, where is the end-effector?" For our 2-DOF arm, the answer is pure trigonometry. The first link takes us from the base to the elbow:
The second link extends from the elbow to the end-effector, but its angle is measured relative to the first link. So the total position is:
x = l · cos(θ1) + l · cos(θ1 + θ2)
y = l · sin(θ1) + l · sin(θ1 + θ2)
where q = [θ1, θ2] ∈ [-π, +π]2 is the configuration (joint-angle vector). The workspace — the set of all reachable positions — is a disk of radius 2l centered at the base. Points at distance exactly 2l require the arm to be fully extended (θ2 = 0); points near the base require it to fold back on itself.
import numpy as np
def forward_kinematics(theta1, theta2, l=1.0):
"""Compute end-effector position for a 2-DOF planar arm."""
x = l * np.cos(theta1) + l * np.cos(theta1 + theta2)
y = l * np.sin(theta1) + l * np.sin(theta1 + theta2)
return np.array([x, y])
# Example: both joints at 45 degrees
pos = forward_kinematics(np.pi/4, np.pi/4)
print(f"End-effector at ({pos[0]:.3f}, {pos[1]:.3f})")
# End-effector at (0.707, 1.707)
Inverse Kinematics
Inverse kinematics (IK) is the reverse problem: "Given a target position p*, what joint angles q reach it?" This is harder because FK is many-to-one — multiple configurations can produce the same end-effector position (think "elbow up" vs "elbow down"). We formulate IK as an optimization problem:
For our 2-DOF arm, this has a closed-form solution using the law of cosines. But for real robots with 6+ joints, closed-form solutions rarely exist. Instead, we use iterative numerical methods.
Differential Inverse Kinematics
The most practical approach is differential IK, which works in velocity space. Differentiate the FK equation with respect to time:
where J(q) = ∂fFK(q) / ∂q is the Jacobian matrix. For our 2-DOF arm, this is a 2×2 matrix. We want to find joint velocities q̇ that produce a desired end-effector velocity ṗ*. The solution uses the Moore-Penrose pseudoinverse:
In practice, we add a proportional feedback term to correct for drift:
where Δp = p* - p(q) is the position error and kp is a proportional gain. This feedback term acts as a spring pulling the end-effector toward the target — without it, numerical integration errors accumulate and the arm drifts off course.
The Jacobian J(q) maps small changes in joint angles to small changes in end-effector position. Its pseudoinverse J(q)+ does the reverse. When J becomes singular (the arm is fully extended or folded), the pseudoinverse blows up — these are kinematic singularities where the arm loses a degree of freedom. Near singularities, tiny end-effector motions require enormous joint velocities.
def jacobian(theta1, theta2, l=1.0):
"""Compute the 2x2 Jacobian for a 2-DOF planar arm."""
j11 = -l * np.sin(theta1) - l * np.sin(theta1 + theta2)
j12 = -l * np.sin(theta1 + theta2)
j21 = l * np.cos(theta1) + l * np.cos(theta1 + theta2)
j22 = l * np.cos(theta1 + theta2)
return np.array([[j11, j12], [j21, j22]])
def diff_ik_step(theta1, theta2, target, kp=5.0, dt=0.02, l=1.0):
"""One step of differential IK with proportional feedback."""
q = np.array([theta1, theta2])
current_pos = forward_kinematics(theta1, theta2, l)
error = target - current_pos
J = jacobian(theta1, theta2, l)
J_pinv = np.linalg.pinv(J)
q_dot = J_pinv @ (kp * error)
q_new = q + q_dot * dt
return q_new[0], q_new[1]
Drag the orange target point. The arm solves inverse kinematics in real-time using differential IK with Jacobian pseudoinverse. Watch how the arm configuration changes — and what happens near singularities (fully extended or fully folded).
Why Learning?
The kinematics we just derived work beautifully for a 2-DOF arm reaching a known position. But real robot tasks — grasping a deformable object, pouring a liquid, folding a towel — expose four fundamental limitations of dynamics-based methods:
1. Brittle multi-component pipelines. A real manipulation system chains together perception (object detection, pose estimation), planning (trajectory optimization, grasp planning), and control (PID, impedance control). Each component is independently designed, and the interfaces between them are fragile. A 5mm error in perception propagates through planning into a failed grasp. There is no end-to-end gradient to optimize the whole pipeline jointly.
2. Poor scaling to multimodal data. Each new sensor modality — a wrist camera, a force-torque sensor, a tactile skin — requires custom processing pipelines. You cannot easily fuse RGB images, depth maps, and proprioceptive state into a unified representation with classical methods.
3. Simplified analytical models. Our FK equation assumed rigid links, frictionless joints, and perfect actuators. Real robots have backlash, cable stretch, temperature-dependent friction, and compliant elements. Analytical models either ignore these effects or become intractably complex.
4. Cannot leverage growing data. The explosion of robot demonstration data — from teleoperation, simulation, and even YouTube videos — is useless to a hand-engineered controller. A PID gain tuned for one scenario does not improve when you collect a thousand more demonstrations.
Learning-based methods replace the entire perception-planning-control stack with a single neural network: observation in, action out. This is not just engineering convenience — it means the perception system is optimized for the downstream control task, not for some proxy objective like "detect the mug." The network learns to extract exactly the visual features that matter for manipulation, ignoring everything else.
The paper frames the transition elegantly: learning-based techniques use monolithic perception-to-action pipelines that directly map sensory inputs to predicted actions. This streamlines control by removing the need to interface multiple hand-designed components. It enables incorporation of diverse modalities through a single architecture. And it naturally positions the system to leverage growing data — more demonstrations mean better policies, with no additional engineering effort.
# Classical approach: multiple hand-designed components
def classical_pick(rgb_image, depth_image, joint_state):
# Step 1: Detect object (custom CV pipeline)
mask = segment_object(rgb_image)
bbox = compute_bounding_box(mask)
# Step 2: Estimate pose (another custom pipeline)
obj_pose = estimate_6dof_pose(depth_image, mask)
# Step 3: Plan grasp (grasp planning library)
grasp = plan_antipodal_grasp(obj_pose, gripper_model)
# Step 4: Plan trajectory (motion planning library)
trajectory = plan_rrt_star(joint_state, grasp, collision_model)
# Step 5: Execute with PID (hand-tuned gains)
return pid_controller(joint_state, trajectory[0], kp=50, kd=10)
# Learning approach: one neural network
def learned_pick(rgb_image, joint_state, policy_network):
observation = torch.cat([
encode_image(rgb_image), # Image encoder (learned)
torch.tensor(joint_state) # Proprioception
])
action = policy_network(observation) # Direct output: joint commands
return action
The rest of this article is about how to train that policy_network — starting with
the simplest approach (reinforcement learning with reward signals), progressing through imitation
learning from demonstrations, and arriving at generalist models that can follow language instructions
across different robot embodiments.
Reinforcement Learning
Reinforcement learning is the most general framework for learning robot behavior: the agent interacts with its environment, receives reward signals, and optimizes a policy to maximize cumulative reward. No demonstrations required — the robot discovers effective behavior through trial and error.
MDP Formulation
We formalize the robot's interaction with its environment as a Markov Decision Process (MDP), defined by the tuple:
where:
- S is the state space — joint angles, velocities, object poses, etc.
- A is the action space — motor torques, desired joint positions, or velocity commands
- D(st, at, st+1) = P(st+1 | st, at) are the dynamics — the physics of the environment
- r : S × A × S → R is the reward function
- γ ∈ [0, 1) is the discount factor — how much we value future rewards
- ρ is the initial state distribution
- T is the horizon — maximum episode length
A trajectory is a sequence of state-action-reward tuples:
The discounted return of a trajectory is:
The RL objective is to find a policy πθ(a|s) that maximizes the expected return:
where the trajectory distribution factors as:
Setting γ = 0.99 vs γ = 0.95 has a dramatic effect on robot behavior. With γ = 0.99, a reward 100 steps in the future is worth 0.99100 ≈ 0.37 of its face value — the robot plans far ahead. With γ = 0.95, that same future reward is worth 0.95100 ≈ 0.006 — essentially zero. Higher discount factors encourage long-horizon planning (useful for multi-step manipulation) but make learning slower and less stable.
Value Functions
Two value functions are central to RL algorithms:
The state-value function Vπ(s) tells us the expected return starting from state s and following policy π thereafter:
The state-action value function Qπ(s, a) tells us the expected return of taking action a in state s, then following π:
These are connected by the Bellman consistency equations:
The Bellman equations are recursive — they express the value of a state in terms of the value of successor states. This recursion is the foundation of every value-based RL algorithm.
Deep RL Methods
Deep Q-Learning (DQN) learns Qθ(s, a) with a neural network, using the temporal difference (TD) loss:
where the TD target is:
DQN works well for discrete action spaces (Atari games), but robot actions are continuous (joint torques, end-effector velocities). We cannot take a max over a continuous action space.
Deep Deterministic Policy Gradient (DDPG) solves this by learning a separate deterministic policy μφ(s) that outputs the action maximizing Q. The policy gradient is:
Soft Actor-Critic (SAC) adds an entropy bonus to encourage exploration:
where α is the temperature parameter controlling the exploration-exploitation tradeoff. SAC is the most popular off-policy RL algorithm for continuous control because the entropy term prevents premature convergence to suboptimal deterministic policies.
Pure RL is sample-inefficient — millions of interactions just to learn basic manipulation. Human-in-the-Loop Sample-Efficient RL (HIL-SERL) changes the game: a human operator intervenes during training to correct dangerous or unproductive behavior, providing both a safety mechanism and a curriculum. With HIL-SERL, real-world RL agents learn policies with near-perfect success rates on challenging manipulation tasks in just 1-2 hours of real interaction.
# Simplified SAC training loop for robot control
import torch
import torch.nn as nn
import torch.optim as optim
class SACPolicy(nn.Module):
def __init__(self, state_dim, action_dim, hidden=256):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
)
self.mean_head = nn.Linear(hidden, action_dim)
self.log_std_head = nn.Linear(hidden, action_dim)
def forward(self, state):
h = self.net(state)
mean = self.mean_head(h)
log_std = self.log_std_head(h).clamp(-20, 2)
return mean, log_std
def sample(self, state):
mean, log_std = self.forward(state)
std = log_std.exp()
normal = torch.distributions.Normal(mean, std)
z = normal.rsample() # Reparameterization trick
action = torch.tanh(z)
# Log-prob with tanh correction
log_prob = normal.log_prob(z) - torch.log(1 - action.pow(2) + 1e-6)
return action, log_prob.sum(-1)
class QNetwork(nn.Module):
def __init__(self, state_dim, action_dim, hidden=256):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim + action_dim, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
nn.Linear(hidden, 1),
)
def forward(self, state, action):
return self.net(torch.cat([state, action], dim=-1))
# SAC soft TD target:
# y = r + gamma * (Q_target(s', a') - alpha * log_pi(a'|s'))
# where a' ~ pi(.|s')
Behavioral Cloning
Reinforcement learning requires a reward function, and designing good reward functions for manipulation tasks is notoriously difficult. How do you assign a scalar reward to "fold a towel neatly"? Behavioral cloning (BC) sidesteps this problem entirely: instead of learning from rewards, we learn from demonstrations.
Given a dataset of expert demonstrations D = {(oi, ai)}i=1..N — pairs of observations and the actions the expert took — BC reduces robot learning to supervised regression:
This is just mean squared error. Train a neural network fθ to predict the expert's action given the observation. Simple, stable, and fast to train. But two fundamental problems make naive BC unreliable for robotics:
Problem 1: Compounding errors (covariate shift). During training, the network sees observations from the expert's state distribution. During deployment, the robot executes its own (imperfect) actions, drifting to states the expert never visited. Small prediction errors compound — the robot enters unfamiliar territory, makes larger errors, enters even more unfamiliar territory, and the policy diverges catastrophically. A tiny angular error at step 1 becomes a 10cm position error by step 50.
Problem 2: Mode averaging. Consider a robot that needs to go around an obstacle. Half the demonstrations go left, half go right. Both are valid. But the MSE loss, which averages over these modes, predicts the mean action — which goes straight into the obstacle. When the demonstration distribution is multimodal (multiple valid behaviors for the same observation), regression to the mean produces behavior that matches none of the modes.
Compounding errors can be mitigated with more data and DAgger-style online correction. But mode averaging is a fundamental limitation of the MSE loss — it cannot represent multimodal action distributions. This is what motivates the entire next section on generative models. We need policies that can represent and sample from multi-modal distributions, not just predict the mean.
import torch
import torch.nn as nn
class BCPolicy(nn.Module):
"""Naive behavioral cloning policy — MSE regression."""
def __init__(self, obs_dim, action_dim, hidden=256):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
nn.Linear(hidden, action_dim),
)
def forward(self, obs):
return self.net(obs)
# Training loop
policy = BCPolicy(obs_dim=64, action_dim=6)
optimizer = torch.optim.Adam(policy.parameters(), lr=1e-4)
for batch in dataloader:
obs, expert_action = batch["observation"], batch["action"]
predicted_action = policy(obs)
loss = nn.functional.mse_loss(predicted_action, expert_action)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Problem: if demonstrations contain both "go left" and "go right"
# for the same observation, the policy learns "go straight" (average)
# which may be catastrophically wrong.
Both problems point in the same direction: we need action predictions that are not single points but rather distributions. If the policy can represent "go left with 50% probability, go right with 50% probability," mode averaging disappears. And if the policy outputs a chunk of future actions rather than a single timestep, compounding errors are reduced. These insights lead us to generative models.
Generative Models for Actions
The mode averaging problem demands policies that can represent multimodal distributions. Three families of generative models have been adapted for action prediction: Variational Autoencoders (VAEs), Diffusion Models, and Flow Matching. Each offers a different mechanism for transforming simple noise into structured, multimodal action distributions.
Variational Autoencoders & ELBO
A VAE learns to generate data by introducing a latent variable z. Instead of directly modeling p(x), we model p(x|z) and p(z), then marginalize:
This integral is intractable, so we introduce an approximate posterior qφ(z|x) and derive the Evidence Lower Bound (ELBO):
The first term is the reconstruction loss — how well can we reconstruct x from z? The second term is the KL regularization — how close is our learned posterior to the prior? We choose p(z) = N(0, I) as a simple Gaussian prior, and qφ(z|x) = N(μφ(x), σφ(x)) as a diagonal Gaussian parameterized by an encoder network.
Training maximizes the ELBO, which is equivalent to minimizing:
For robot learning, x is an action (or action sequence), and we condition everything on the observation o. The decoder becomes pθ(a|z, o) — given a latent code and an observation, generate an action. The latent variable z captures the "mode" of the demonstration: left vs right, fast vs slow, precise vs forceful.
class ActionVAE(nn.Module):
"""VAE for action generation, conditioned on observations."""
def __init__(self, obs_dim, action_dim, latent_dim=32, hidden=256):
super().__init__()
# Encoder: q(z | o, a) — only used during training
self.encoder = nn.Sequential(
nn.Linear(obs_dim + action_dim, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
)
self.mu_head = nn.Linear(hidden, latent_dim)
self.logvar_head = nn.Linear(hidden, latent_dim)
# Decoder: p(a | z, o) — used during training and inference
self.decoder = nn.Sequential(
nn.Linear(obs_dim + latent_dim, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
nn.Linear(hidden, action_dim),
)
def encode(self, obs, action):
h = self.encoder(torch.cat([obs, action], dim=-1))
return self.mu_head(h), self.logvar_head(h)
def reparameterize(self, mu, logvar):
std = (0.5 * logvar).exp()
eps = torch.randn_like(std)
return mu + eps * std
def decode(self, obs, z):
return self.decoder(torch.cat([obs, z], dim=-1))
def loss(self, obs, action):
mu, logvar = self.encode(obs, action)
z = self.reparameterize(mu, logvar)
recon = self.decode(obs, z)
recon_loss = nn.functional.mse_loss(recon, action)
kl_loss = -0.5 * (1 + logvar - mu.pow(2) - logvar.exp()).sum(-1).mean()
return recon_loss + kl_loss
Diffusion Models
Diffusion models take a completely different approach to generation. Instead of learning an explicit latent space, they learn to reverse a gradual noising process.
Forward process. Starting from clean data z0, we add Gaussian noise over T steps according to a fixed schedule β1, ..., βT:
At each step, we scale down the signal by √(1 - βt) and add noise with variance βt. After enough steps, zT ≈ N(0, I) — pure noise. Defining αt = 1 - βt and āt = ∏s=1t αs, we can jump directly to any timestep:
Reverse process. Generation starts from zT ~ N(0, I) and iteratively denoises. We train a neural network εθ(zt, t) to predict the noise that was added at each step. The DDPM training loss simplifies beautifully to:
where ε ~ N(0, I) is the actual noise, zt = √āt · z0 + √(1-āt) · ε, and t is uniformly sampled from {1, ..., T}. We are simply training a denoiser — given a noisy sample and the noise level, predict the noise.
A diffusion model does not predict a single action — it iteratively sculpts a noise sample into an action. Different noise initializations lead to different modes of the action distribution. The network never needs to "average" between modes because it is always denoising toward the nearest clean data point. This is fundamentally different from MSE regression, which cannot represent multimodality.
Watch the forward process gradually corrupt a 2D action distribution into noise, then the reverse process recover structured data. Click "Forward" to add noise step by step, "Reverse" to denoise. The bimodal distribution (two clusters) is preserved through denoising but would be averaged by MSE.
import torch
def ddpm_training_step(model, x_0, T=1000):
"""One DDPM training step: sample t, add noise, predict noise."""
batch_size = x_0.shape[0]
# Sample random timesteps
t = torch.randint(0, T, (batch_size,), device=x_0.device)
# Precomputed noise schedule
alpha_bar = cosine_alpha_bar_schedule(T) # [T]
alpha_bar_t = alpha_bar[t].unsqueeze(-1) # [B, 1]
# Sample noise and create noisy data
epsilon = torch.randn_like(x_0)
z_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * epsilon
# Predict the noise
epsilon_pred = model(z_t, t)
# MSE loss between true and predicted noise
loss = torch.nn.functional.mse_loss(epsilon_pred, epsilon)
return loss
@torch.no_grad()
def ddpm_sample(model, shape, T=1000):
"""Generate a sample by iterative denoising."""
z = torch.randn(shape) # Start from pure noise
alpha_bar = cosine_alpha_bar_schedule(T)
for t in reversed(range(T)):
epsilon_pred = model(z, torch.tensor([t]))
# Reverse step: remove predicted noise, add a bit of fresh noise
alpha_t = 1 - (alpha_bar[t] / alpha_bar[t-1] if t > 0 else alpha_bar[t])
z = (1 / torch.sqrt(1 - alpha_t)) * (
z - alpha_t / torch.sqrt(1 - alpha_bar[t]) * epsilon_pred
)
if t > 0:
z += torch.sqrt(alpha_t) * torch.randn_like(z)
return z
Flow Matching
Flow matching offers a simpler alternative to diffusion. Instead of a discrete Markov chain with a fixed noise schedule, we define a continuous-time flow that transports noise to data along straight paths.
Given a noise sample x0 ~ N(0, I) and a data sample x1, the conditional optimal transport (OT) flow interpolates linearly:
The velocity field along this path is simply:
We train a neural network vθ(xt, t) to predict this velocity:
At inference, we start from x0 ~ N(0, I) and integrate the learned velocity field using an ODE solver (e.g., Euler steps):
Diffusion models use a stochastic process (SDE) with a fixed noise schedule — you must choose βt carefully, and inference requires many steps because the path from noise to data is curved. Flow matching uses a deterministic ODE with straight paths — the trajectory from noise to data is a straight line in expectation, so fewer integration steps are needed. This makes flow matching faster at inference time, which is critical for real-time robot control.
def flow_matching_training_step(model, x_1):
"""One flow matching training step."""
batch_size = x_1.shape[0]
# Sample noise (source distribution)
x_0 = torch.randn_like(x_1)
# Sample time uniformly
t = torch.rand(batch_size, 1, device=x_1.device)
# Interpolate: conditional OT path
x_t = (1 - t) * x_0 + t * x_1
# Target velocity: direction from noise to data
u_t = x_1 - x_0
# Predict velocity
v_pred = model(x_t, t.squeeze(-1))
# MSE loss
loss = torch.nn.functional.mse_loss(v_pred, u_t)
return loss
@torch.no_grad()
def flow_matching_sample(model, shape, num_steps=10):
"""Generate a sample by integrating the learned velocity field."""
x = torch.randn(shape)
dt = 1.0 / num_steps
for i in range(num_steps):
t = torch.tensor([i / num_steps])
v = model(x, t)
x = x + v * dt # Euler step
return x
ACT: Action Chunking with Transformers
ACT (Action Chunking with Transformers) by Zhao et al. (2023) combines two key ideas: a CVAE for multimodal action generation, and action chunking — predicting a sequence of k future actions at once rather than a single timestep. Together, these address both the mode averaging and compounding error problems of naive behavioral cloning.
Architecture: CVAE with Transformer backbone. ACT uses a Conditional VAE where both the encoder and decoder are Transformers:
Encoder (training only): Takes the full observation o and the ground-truth action sequence at:t+k as input. Produces a latent code z ~ qφ(z|o, a) that captures "which mode" this demonstration belongs to — the style of the trajectory.
Decoder (training and inference): Takes the observation o and latent code z (sampled from the encoder during training, from the prior N(0, I) during inference), and predicts the full action chunk at:t+k. This is a Transformer decoder with k action query tokens that cross-attend to the observation.
The training loss is the CVAE ELBO:
where β is a KL weight (typically 10) that controls the regularization strength. A low β lets the encoder encode more information into z (more faithful reconstructions); a high β forces z to stay close to the prior (better generalization but blurrier actions).
Predicting k actions at once provides temporal consistency — the predicted trajectory is smooth because all k actions were generated jointly. It also reduces the effective decision frequency by a factor of k: a 50Hz control loop with k=20 chunks makes a new decision only every 0.4 seconds, giving the policy more time to "think" and reducing compounding errors by a factor of 20. The tradeoff is reduced reactivity — the robot commits to a plan and cannot react to unexpected events within a chunk.
Temporal ensembling. In practice, ACT does not simply execute one chunk and then plan a new one. Instead, at each timestep, it generates a new chunk that overlaps with previous chunks. The executed action is an exponentially weighted average of all active chunks' predictions for that timestep, with recent chunks weighted more heavily. This provides smooth transitions between plans.
Each colored trajectory is one action chunk (k actions predicted simultaneously). Overlapping chunks are averaged with exponential weighting to produce the smooth executed trajectory (white line). Click "Step" to advance time and generate a new chunk.
# LeRobot: Training ACT on your own data
from lerobot.configs.policies import PreTrainedConfig
from lerobot.policies.act.modeling_act import ACTPolicy
# ACT configuration
config = PreTrainedConfig(
policy_type="act",
chunk_size=100, # Predict 100 future actions at once
n_action_steps=100, # Execute all 100 before re-planning
input_shapes={
"observation.images.top": [3, 480, 640],
"observation.state": [14], # 7 joint angles x 2 arms
},
output_shapes={"action": [14]},
latent_dim=32, # CVAE latent dimension
kl_weight=10.0, # Beta for KL regularization
n_obs_steps=1, # Single-frame observation
)
policy = ACTPolicy(config)
# Training step
batch = next(iter(dataloader))
output = policy.forward(batch)
# output.loss = reconstruction_loss + kl_weight * kl_loss
# Inference: sample z from prior, decode action chunk
policy.eval()
with torch.no_grad():
action_chunk = policy.select_action(observation)
# Returns: [chunk_size, action_dim] = [100, 14]
Diffusion Policy
Diffusion Policy (Chi et al., 2024) applies DDPM to action prediction. Instead of generating images, the diffusion model generates action chunks — sequences of robot commands — conditioned on observations. The key architectural choices determine how observations and noise interact.
Observation conditioning. The observation is encoded into a feature vector and injected into the denoising network via cross-attention (for Transformer backbones) or FiLM conditioning (for UNet backbones). Critically, the observation is not noised — only the action is corrupted during the forward process. The observation provides a clean conditioning signal at every denoising step.
1D temporal UNet. The primary architecture treats the action chunk as a 1D signal: the "spatial" dimension is time (action steps), and the "channel" dimension is the action vector. 1D convolutions along the time axis capture temporal patterns. This is computationally efficient and works well for proprioceptive observations.
2D image-conditioned UNet. When the observation includes images, a visual encoder (e.g., ResNet-18) processes the image into a feature map, which is concatenated with the noisy action along the channel dimension. The UNet then operates on this combined representation.
Standard DDPM requires T=1000 denoising steps, far too slow for 50Hz robot control. DDIM (Denoising Diffusion Implicit Models) uses a deterministic update rule that allows skipping steps, reducing inference to 10-50 steps with minimal quality loss. For Diffusion Policy, DDIM with 10 steps is standard — each action chunk takes ~50ms to generate on a GPU, just barely fast enough for real-time control. This is still a bottleneck that motivates the async inference pipeline discussed in the next section.
The training procedure mirrors standard DDPM. Given an observation o and ground-truth action chunk a0 = at:t+k:
- Sample a diffusion timestep t ~ Uniform(1, ..., T)
- Sample noise ε ~ N(0, I)
- Create the noisy action: at = √āt · a0 + √(1 - āt) · ε
- Predict the noise: ε̂ = εθ(at, t, o)
- Loss: ||ε - ε̂||2
At inference, we start from aT ~ N(0, I) and iteratively denoise, conditioning on the current observation at every step. The result is an action chunk a0 that can be executed directly.
# LeRobot: Training Diffusion Policy
from lerobot.configs.policies import PreTrainedConfig
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
config = PreTrainedConfig(
policy_type="diffusion",
chunk_size=16, # Predict 16 future actions
n_action_steps=8, # Execute 8, then re-plan (overlap)
input_shapes={
"observation.images.top": [3, 96, 96],
"observation.state": [2],
},
output_shapes={"action": [2]},
num_inference_steps=10, # DDIM steps (down from 1000)
noise_scheduler_type="DDIM",
down_dims=[256, 512, 1024], # 1D UNet channel progression
n_obs_steps=2, # 2-frame observation history
)
policy = DiffusionPolicy(config)
# Training: standard DDPM objective
batch = next(iter(dataloader))
output = policy.forward(batch)
# output.loss = MSE(epsilon, epsilon_pred)
# Inference: iterative denoising conditioned on observation
policy.eval()
with torch.no_grad():
action_chunk = policy.select_action(observation)
# Returns: [chunk_size, action_dim] = [16, 2]
# Execute first n_action_steps=8, then re-plan
| Feature | ACT | Diffusion Policy |
|---|---|---|
| Generative model | CVAE | DDPM / DDIM |
| Backbone | Transformer | 1D UNet or Transformer |
| Latent space | Explicit (z ~ N(0,I)) | Implicit (iterative denoising) |
| Inference steps | 1 (single forward pass) | 10-50 (iterative denoising) |
| Training signal | Reconstruction + KL | Noise prediction MSE |
| Multimodality | Via latent variable z | Via different noise initializations |
| Inference speed | Fast (single pass) | Slower (multiple passes) |
| Typical chunk size | 100 | 16 |
Optimized Inference
There is a fundamental tension in robot learning: generative policies (especially diffusion-based ones) produce high-quality, multimodal action predictions, but they are slow. A single DDIM denoising pass with 10 steps on a 1D UNet takes 30-80ms on a modern GPU. Meanwhile, the robot's control loop runs at 50Hz (20ms per step) or faster. If the policy cannot generate actions faster than the control loop consumes them, the robot stutters or freezes.
LeRobot solves this with an asynchronous inference pipeline that decouples action planning from action execution using two threads:
Planning thread. Runs the neural network policy at whatever speed it can manage. Each time it finishes generating an action chunk, it pushes it into a shared queue and immediately starts generating the next chunk using the latest observation.
Execution thread. Runs at the robot's control frequency (e.g., 50Hz). At each control step, it pops the next action from the queue and sends it to the robot. If no new chunk is available yet, it continues executing actions from the previous chunk.
The async pipeline does more than just hide latency — it enables a form of temporal consistency that is impossible with synchronous execution. Because the planning thread always starts from the most recent observation, the action queue always contains the freshest possible plan. And because the execution thread can continue executing a previous chunk while a new one is being computed, there are no gaps in robot motion. The action queue also enables natural chunk interpolation: when a new chunk arrives, the execution thread can blend between the old and new plans rather than switching abruptly.
import threading
import queue
import time
class AsyncInferencePipeline:
"""Decoupled planning and execution for real-time robot control."""
def __init__(self, policy, robot, control_freq=50):
self.policy = policy
self.robot = robot
self.control_freq = control_freq
self.action_queue = queue.Queue(maxsize=1)
self.latest_obs = None
self.current_chunk = None
self.chunk_index = 0
self.running = False
def planning_loop(self):
"""Runs in a background thread. Generates action chunks."""
while self.running:
if self.latest_obs is not None:
obs = self.latest_obs
with torch.no_grad():
chunk = self.policy.select_action(obs)
# Replace whatever is in the queue
try:
self.action_queue.get_nowait()
except queue.Empty:
pass
self.action_queue.put(chunk)
def execution_loop(self):
"""Runs in the main thread at control frequency."""
dt = 1.0 / self.control_freq
while self.running:
start = time.time()
# Get latest observation
self.latest_obs = self.robot.get_observation()
# Check for new action chunk
try:
self.current_chunk = self.action_queue.get_nowait()
self.chunk_index = 0
except queue.Empty:
pass
# Execute current action
if self.current_chunk is not None:
if self.chunk_index < len(self.current_chunk):
action = self.current_chunk[self.chunk_index]
self.robot.send_action(action)
self.chunk_index += 1
# Maintain control frequency
elapsed = time.time() - start
time.sleep(max(0, dt - elapsed))
def run(self):
self.running = True
planner = threading.Thread(target=self.planning_loop, daemon=True)
planner.start()
self.execution_loop()
This pattern — separating slow inference from fast control — is a recurring theme in deployed robot systems. It appears in autonomous driving (perception runs at 10Hz, control at 100Hz), in drone flight (planning at 5Hz, attitude control at 500Hz), and now in manipulation. The key insight is that action chunking makes this separation natural: the policy produces a plan (the chunk), and the execution thread follows the plan until a new one arrives.
Generalist Robot Policies (VLAs)
Everything we have discussed so far trains one policy for one task on one robot. But the ultimate vision of robot learning is a generalist policy — a single model that can control multiple robot embodiments, perform many tasks, and accept natural language instructions. Vision-Language-Action models (VLAs) pursue this vision by building on the success of large Vision-Language Models (VLMs).
The core idea is simple: take a pretrained VLM that already understands images and language, and fine-tune it to also output robot actions. The VLM provides a rich visual and semantic representation that transfers across tasks and embodiments. The action output head maps this representation to the specific robot's action space.
π0: VLM + Flow Matching Action Expert
π0 (pi-zero, Black et al. 2024) is the most capable VLA to date. Its architecture consists of two components:
1. VLM backbone. A pretrained vision-language model (PaLI-Gemma 2B) that processes images and language instructions. It produces a sequence of tokens representing the visual scene and the task description.
2. Flow matching action expert. Instead of discretizing actions into tokens (which loses precision), π0 uses a continuous flow matching head. The VLM features condition a flow matching model that generates action chunks. This is the key innovation — combining the semantic understanding of VLMs with the precision of continuous generative models.
π0 was trained on over 10,000 hours of robot demonstration data across 7 different robot embodiments, including single-arm manipulators, bimanual setups, and mobile manipulators. The same model weights control all of them — the model learns to adapt its output to the action space and dynamics of each embodiment.
π0 uses flow matching instead of diffusion for a practical reason: flow matching requires fewer denoising steps (typically 5-10 vs 50-100 for DDPM). When your model backbone is a 2B+ parameter VLM, every inference step is expensive. Flow matching's straight-line trajectories from noise to data mean the ODE can be integrated with fewer Euler steps while maintaining action quality. This is the difference between 100ms and 500ms per action chunk — real-time vs unusable for manipulation.
SmolVLA: Efficiency Without Compromise
π0 is powerful but large — it requires significant GPU resources for both training and inference. SmolVLA (Shukor et al., 2025) asks: can we get comparable performance with a much smaller model that runs on consumer hardware?
SmolVLA uses a 250M parameter architecture built on SmolVLM (a small VLM) with a flow matching action head. Despite being roughly 10x smaller than π0, SmolVLA achieves competitive performance on standard manipulation benchmarks.
Key design decisions:
- Efficient visual encoding: Fewer image tokens via aggressive spatial pooling
- Shared backbone: The same Transformer processes both language tokens and action queries
- Knowledge distillation: Training on data generated by larger models
- Hardware target: Runs inference on a single consumer GPU (e.g., RTX 3090) or even a laptop with an M-series chip
| Feature | π0 | SmolVLA |
|---|---|---|
| Parameters | ~3B | ~250M |
| VLM backbone | PaLI-Gemma 2B | SmolVLM |
| Action head | Flow matching | Flow matching |
| Training data | 10K+ hours, 7 embodiments | Smaller, curated datasets |
| Inference hardware | Server GPU (A100/H100) | Consumer GPU (RTX 3090) or Mac |
| Language conditioning | Yes | Yes |
| Cross-embodiment | Yes (7 robots) | Yes (fewer embodiments) |
# LeRobot: Using a pretrained VLA for inference
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
from lerobot.configs.policies import PreTrainedConfig
# Load pretrained SmolVLA
policy = SmolVLAPolicy.from_pretrained("lerobot/smolvla_base")
policy.eval()
# Observation includes image + state + language instruction
observation = {
"observation.images.top": camera_image, # [3, 480, 640]
"observation.state": joint_state, # [6]
"task": "Pick up the red block and place it in the bowl",
}
# Generate action chunk
with torch.no_grad():
action_chunk = policy.select_action(observation)
# Returns: [chunk_size, action_dim]
# The same model can control different robots —
# it adapts to the action space via the config
print(f"Action shape: {action_chunk.shape}")
# SO-100: [16, 6] (6 joints)
# ALOHA: [16, 14] (7 joints x 2 arms)
VLAs bet on a hypothesis borrowed from NLP: scaling data and model size will produce emergent generalization. Just as GPT-3 learned to perform tasks never seen during training by scaling language data, VLAs aim to learn manipulation skills that generalize to new objects, scenes, and instructions by scaling robot data. π0's results with 10K+ hours of data are encouraging but not conclusive — we may need 100K+ hours (comparable to the text data scale shift from GPT-2 to GPT-3) to see true emergent manipulation capabilities.
Practical Guide
The LeRobot ecosystem provides everything needed to go from a physical robot to a deployed learned policy. This section covers the data pipeline, dataset format, and the full collect-train-deploy workflow.
LeRobotDataset format. All data is stored in a structured format with three components:
- Tabular data: Joint states and actions in memory-mapped Parquet files, concatenated across episodes
- Visual data: MP4 video files grouped by camera and episode
- Metadata: JSON files describing the schema, frame rates, normalization statistics, and task descriptions
The metadata directory contains:
meta/info.json— Complete dataset schema: feature definitions, FPS, data pathsmeta/stats.json— Aggregated statistics (mean, std, min, max) for input normalizationmeta/tasks.jsonl— Natural language task descriptions mapped to integer indicesmeta/episodes/*— Per-episode metadata: length, task, data pointers
LeRobotDataset supports a delta_timestamps parameter that enables temporal windowing.
When you index into the dataset with dataset[100], instead of getting a single frame,
you can specify offsets like [-0.2, -0.1, 0.0] to get the current frame plus two
frames from the past (200ms and 100ms ago). This is how policies access observation history
without explicit sequence padding — the dataset handles alignment automatically.
# Complete pipeline: Load data, train a policy, deploy
from lerobot.datasets.lerobot_dataset import LeRobotDataset
import torch
# 1. Load dataset with temporal windowing
delta_timestamps = {
"observation.images.wrist_camera": [-0.2, -0.1, 0.0],
"observation.state": [0.0],
"action": [0.0, 0.04, 0.08, 0.12], # Future action chunk
}
dataset = LeRobotDataset(
"lerobot/svla_so101_pickplace",
delta_timestamps=delta_timestamps,
)
# 2. Inspect a sample
sample = dataset[100]
print(f"Image shape: {sample['observation.images.wrist_camera'].shape}")
# [3, 3, 480, 640] — 3 time steps, 3 channels, H, W
print(f"State shape: {sample['observation.state'].shape}")
# [1, 6] — 1 time step, 6 joints
print(f"Action shape: {sample['action'].shape}")
# [4, 6] — 4 future steps, 6 joints
# 3. Create DataLoader
dataloader = torch.utils.data.DataLoader(
dataset, batch_size=16, shuffle=True, num_workers=4
)
# 4. Training loop (simplified)
for epoch in range(100):
for batch in dataloader:
device = "cuda" if torch.cuda.is_available() else "cpu"
observations = batch["observation.state"].to(device)
images = batch["observation.images.wrist_camera"].to(device)
actions = batch["action"].to(device)
output = policy.forward(batch)
output.loss.backward()
optimizer.step()
optimizer.zero_grad()
Data collection with teleoperation. The most common way to collect robot demonstrations is teleoperation — a human operator controls the robot through a leader-follower setup. The SO-100 leader arm mirrors the operator's movements on the SO-100 follower arm, while cameras record the scene.
# Data collection with LeRobot teleoperation
from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.robots.so100_follower import SO100Follower, SO100FollowerConfig
from lerobot.teleoperators.so100_leader import SO100Leader, SO100LeaderConfig
from lerobot.scripts.lerobot_record import record_loop
# Hardware configuration
FPS = 30
camera_config = {
"front": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=FPS),
}
robot_config = SO100FollowerConfig(
port="/dev/ttyUSB0", id="follower", cameras=camera_config,
)
teleop_config = SO100LeaderConfig(port="/dev/ttyUSB1", id="leader")
# Initialize hardware
robot = SO100Follower(robot_config)
teleop = SO100Leader(teleop_config)
robot.connect()
teleop.connect()
# Create dataset
dataset = LeRobotDataset.create(
repo_id="my_user/pick_and_place_v1",
fps=FPS,
robot_type=robot.name,
use_videos=True,
)
# Record 10 episodes of 60 seconds each
for episode in range(10):
print(f"Recording episode {episode + 1}/10...")
record_loop(
robot=robot,
teleop=teleop,
dataset=dataset,
control_time_s=60,
single_task="Pick up the red cube and place it on the plate",
)
dataset.save_episode()
# Push to HuggingFace Hub
dataset.push_to_hub()
Hardware options. The LeRobot ecosystem supports a range of hardware:
| Robot | Cost | DOF | Use Case |
|---|---|---|---|
| SO-100 | ~$300 | 6 | Single-arm tabletop manipulation |
| SO-101 | ~$350 | 6 | Improved SO-100 with better actuators |
| ALOHA 2 | ~$20K | 14 (7+7) | Bimanual manipulation |
| Humanoids | $50K+ | 20+ | Full-body mobile manipulation |
The beauty of the LeRobot abstraction is that the same training code works across all of these platforms. You change the robot configuration and action dimensions, but the policy architecture, training loop, and data format remain identical. A Diffusion Policy trained on SO-100 demonstrations has the same code structure as one trained on ALOHA data — only the observation and action shapes differ.
The SO-100 costs about the same as a textbook. With a $300 arm, a $50 webcam, and a laptop, you have everything needed to collect demonstrations, train ACT or Diffusion Policy, and deploy a working learned manipulation system. This is the real contribution of LeRobot — not just the algorithms, but the democratization of robot learning research. Five years ago, this required a $100K+ lab setup.
References
- Capuano, Pascal, Zouitine, Wolf & Aractingi. "Robot Learning: A Tutorial." arXiv, 2025. arXiv:2510.12403
- Zhao et al. "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware." RSS, 2023. arXiv:2304.13705
- Chi et al. "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." RSS, 2024. arXiv:2303.04137
- Black et al. "π0: A Vision-Language-Action Flow Model for General Robot Control." arXiv, 2024. arXiv:2410.24164
- Shukor et al. "SmolVLA: A Small Vision-Language-Action Model for Robot Learning." arXiv, 2025. arXiv:2506.01844
- Ho, Jain & Abbeel. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020. arXiv:2006.11239
- Song, Meng & Ermon. "Denoising Diffusion Implicit Models." ICLR, 2021. arXiv:2010.02502
- Lipman et al. "Flow Matching for Generative Modeling." ICLR, 2023. arXiv:2210.02747
- Haarnoja et al. "Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor." ICML, 2018. arXiv:1801.01290
- Luo et al. "Serl: A Software Suite for Sample-Efficient Robotic Reinforcement Learning." arXiv, 2024. arXiv:2401.16013
- Kingma & Welling. "Auto-Encoding Variational Bayes." ICLR, 2014. arXiv:1312.6114
- Hansen et al. "Temporal Difference Learning for Model Predictive Control." ICML, 2022. arXiv:2203.04955