Robot Learning Tutorial — Engineermaxxing

Introduction

Imagine you want a robot arm to pick up a coffee mug. In classical robotics, you would write equations describing the arm's geometry, build a perception pipeline to locate the mug, compute a collision-free trajectory, design a grasp planner, tune PID controllers for each joint, and pray that your analytical model of friction is accurate enough. Each component is hand-engineered, each interface is brittle, and the whole system shatters the moment you swap the mug for a wine glass.

Now imagine a different approach: you teleoperate the robot through a hundred mug-grasping demonstrations, then train a neural network to map camera images directly to motor commands. No kinematics derivation, no perception pipeline, no grasp planner. The robot learns the entire behavior end-to-end from data. This is the shift from model-based to data-driven robotics, and it is the central story of this tutorial.

This article is based on "Robot Learning: A Tutorial" by Capuano, Pascal, Zouitine, Wolf, and Aractingi from HuggingFace (arXiv:2510.12403). Their tutorial is remarkable for two reasons: it covers the entire stack from forward kinematics to vision-language-action models, and every algorithm comes with runnable code built on LeRobot — HuggingFace's open-source robot learning library.

ℹ What this article covers

We start with the classical robotics that learning-based methods aim to replace: forward/inverse kinematics, Jacobians, and feedback control. Then we build up through reinforcement learning, behavioral cloning, generative action models (VAE, diffusion, flow matching), and the specific architectures that dominate modern robot learning: ACT and Diffusion Policy. We finish with generalist robot policies (VLAs) — models like π0 and SmolVLA that can control multiple robot embodiments from natural language instructions.

The progression tells a story of increasing generality. Classical robotics works for one task on one robot with perfect models. RL works for one task on one robot without models but needs millions of interactions. Behavioral cloning learns from demonstrations but struggles with multimodal actions. Generative policies fix the multimodality problem. And VLAs scale to many tasks, many robots, and language-conditioned behavior. Each step addresses a limitation of the previous one.

Classical Robotics

Before we can appreciate why learning-based methods matter, we need to understand what they replace. Classical robotics gives us exact, analytical tools for controlling robot motion — and also shows us exactly where those tools break down.

We will work with a 2-DOF planar manipulator — the simplest possible robot arm. Two links of equal length l, connected by revolute joints with angles θ₁ and θ₂. This is a simplified version of the SO-100 arm used in the LeRobot ecosystem, but it captures all the essential ideas of kinematics.

Forward Kinematics

Forward kinematics (FK) answers the question: "Given joint angles, where is the end-effector?" For our 2-DOF arm, the answer is pure trigonometry. The first link takes us from the base to the elbow:

p elbow = (l \cdot cos(θ 1), l \cdot sin(θ 1))

The second link extends from the elbow to the end-effector, but its angle is measured relative to the first link. So the total position is:

p(q) = x = l \cdot cos(θ 1) + l \cdot cos(θ 1 + θ 2) y = l \cdot sin(θ 1) + l \cdot sin(θ 1 + θ 2)

where q = [θ₁, θ₂] ∈ [-π, +π]² is the configuration (joint-angle vector). The workspace — the set of all reachable positions — is a disk of radius 2l centered at the base. Points at distance exactly 2l require the arm to be fully extended (θ₂ = 0); points near the base require it to fold back on itself.

python

import numpy as np

def forward_kinematics(theta1, theta2, l=1.0):
    """Compute end-effector position for a 2-DOF planar arm."""
    x = l * np.cos(theta1) + l * np.cos(theta1 + theta2)
    y = l * np.sin(theta1) + l * np.sin(theta1 + theta2)
    return np.array([x, y])

# Example: both joints at 45 degrees
pos = forward_kinematics(np.pi/4, np.pi/4)
print(f"End-effector at ({pos[0]:.3f}, {pos[1]:.3f})")
# End-effector at (0.707, 1.707)

Inverse Kinematics

Inverse kinematics (IK) is the reverse problem: "Given a target position p*, what joint angles q reach it?" This is harder because FK is many-to-one — multiple configurations can produce the same end-effector position (think "elbow up" vs "elbow down"). We formulate IK as an optimization problem:

min q \in Q ||p(q) - p*|| 22

For our 2-DOF arm, this has a closed-form solution using the law of cosines. But for real robots with 6+ joints, closed-form solutions rarely exist. Instead, we use iterative numerical methods.

Differential Inverse Kinematics

The most practical approach is differential IK, which works in velocity space. Differentiate the FK equation with respect to time:

ṗ = J(q) \cdot q̇

where J(q) = ∂f_FK(q) / ∂q is the Jacobian matrix. For our 2-DOF arm, this is a 2×2 matrix. We want to find joint velocities q̇ that produce a desired end-effector velocity ṗ*. The solution uses the Moore-Penrose pseudoinverse:

q̇ = J(q) + \cdot ṗ*

In practice, we add a proportional feedback term to correct for drift:

q̇ = J(q) + \cdot (ṗ* + k p \cdot Δp)

where Δp = p* - p(q) is the position error and k_p is a proportional gain. This feedback term acts as a spring pulling the end-effector toward the target — without it, numerical integration errors accumulate and the arm drifts off course.

💡 The Jacobian is the bridge between joint space and task space

The Jacobian J(q) maps small changes in joint angles to small changes in end-effector position. Its pseudoinverse J(q)⁺ does the reverse. When J becomes singular (the arm is fully extended or folded), the pseudoinverse blows up — these are kinematic singularities where the arm loses a degree of freedom. Near singularities, tiny end-effector motions require enormous joint velocities.

python

def jacobian(theta1, theta2, l=1.0):
    """Compute the 2x2 Jacobian for a 2-DOF planar arm."""
    j11 = -l * np.sin(theta1) - l * np.sin(theta1 + theta2)
    j12 = -l * np.sin(theta1 + theta2)
    j21 = l * np.cos(theta1) + l * np.cos(theta1 + theta2)
    j22 = l * np.cos(theta1 + theta2)
    return np.array([[j11, j12], [j21, j22]])

def diff_ik_step(theta1, theta2, target, kp=5.0, dt=0.02, l=1.0):
    """One step of differential IK with proportional feedback."""
    q = np.array([theta1, theta2])
    current_pos = forward_kinematics(theta1, theta2, l)
    error = target - current_pos
    J = jacobian(theta1, theta2, l)
    J_pinv = np.linalg.pinv(J)
    q_dot = J_pinv @ (kp * error)
    q_new = q + q_dot * dt
    return q_new[0], q_new[1]

Interactive 2-DOF Planar Arm Interactive

Drag the orange target point. The arm solves inverse kinematics in real-time using differential IK with Jacobian pseudoinverse. Watch how the arm configuration changes — and what happens near singularities (fully extended or fully folded).

Drag the target to move the arm

Why Learning?

The kinematics we just derived work beautifully for a 2-DOF arm reaching a known position. But real robot tasks — grasping a deformable object, pouring a liquid, folding a towel — expose four fundamental limitations of dynamics-based methods:

1. Brittle multi-component pipelines. A real manipulation system chains together perception (object detection, pose estimation), planning (trajectory optimization, grasp planning), and control (PID, impedance control). Each component is independently designed, and the interfaces between them are fragile. A 5mm error in perception propagates through planning into a failed grasp. There is no end-to-end gradient to optimize the whole pipeline jointly.

2. Poor scaling to multimodal data. Each new sensor modality — a wrist camera, a force-torque sensor, a tactile skin — requires custom processing pipelines. You cannot easily fuse RGB images, depth maps, and proprioceptive state into a unified representation with classical methods.

3. Simplified analytical models. Our FK equation assumed rigid links, frictionless joints, and perfect actuators. Real robots have backlash, cable stretch, temperature-dependent friction, and compliant elements. Analytical models either ignore these effects or become intractably complex.

4. Cannot leverage growing data. The explosion of robot demonstration data — from teleoperation, simulation, and even YouTube videos — is useless to a hand-engineered controller. A PID gain tuned for one scenario does not improve when you collect a thousand more demonstrations.

💡 The monolithic pipeline advantage

Learning-based methods replace the entire perception-planning-control stack with a single neural network: observation in, action out. This is not just engineering convenience — it means the perception system is optimized for the downstream control task, not for some proxy objective like "detect the mug." The network learns to extract exactly the visual features that matter for manipulation, ignoring everything else.

The paper frames the transition elegantly: learning-based techniques use monolithic perception-to-action pipelines that directly map sensory inputs to predicted actions. This streamlines control by removing the need to interface multiple hand-designed components. It enables incorporation of diverse modalities through a single architecture. And it naturally positions the system to leverage growing data — more demonstrations mean better policies, with no additional engineering effort.

python

# Classical approach: multiple hand-designed components
def classical_pick(rgb_image, depth_image, joint_state):
    # Step 1: Detect object (custom CV pipeline)
    mask = segment_object(rgb_image)
    bbox = compute_bounding_box(mask)

    # Step 2: Estimate pose (another custom pipeline)
    obj_pose = estimate_6dof_pose(depth_image, mask)

    # Step 3: Plan grasp (grasp planning library)
    grasp = plan_antipodal_grasp(obj_pose, gripper_model)

    # Step 4: Plan trajectory (motion planning library)
    trajectory = plan_rrt_star(joint_state, grasp, collision_model)

    # Step 5: Execute with PID (hand-tuned gains)
    return pid_controller(joint_state, trajectory[0], kp=50, kd=10)

# Learning approach: one neural network
def learned_pick(rgb_image, joint_state, policy_network):
    observation = torch.cat([
        encode_image(rgb_image),    # Image encoder (learned)
        torch.tensor(joint_state)   # Proprioception
    ])
    action = policy_network(observation)  # Direct output: joint commands
    return action

The rest of this article is about how to train that policy_network — starting with the simplest approach (reinforcement learning with reward signals), progressing through imitation learning from demonstrations, and arriving at generalist models that can follow language instructions across different robot embodiments.

Reinforcement Learning

Reinforcement learning is the most general framework for learning robot behavior: the agent interacts with its environment, receives reward signals, and optimizes a policy to maximize cumulative reward. No demonstrations required — the robot discovers effective behavior through trial and error.

MDP Formulation

We formalize the robot's interaction with its environment as a Markov Decision Process (MDP), defined by the tuple:

M = ⟨S, A, D, r, γ, ρ, T⟩

where:

S is the state space — joint angles, velocities, object poses, etc.
A is the action space — motor torques, desired joint positions, or velocity commands
D(s_t, a_t, s_t+1) = P(s_t+1 | s_t, a_t) are the dynamics — the physics of the environment
r : S × A × S → R is the reward function
γ ∈ [0, 1) is the discount factor — how much we value future rewards
ρ is the initial state distribution
T is the horizon — maximum episode length

A trajectory is a sequence of state-action-reward tuples:

τ = (s 0, a 0, r 0, s 1, a 1, r 1, ..., s T)

The discounted return of a trajectory is:

G(τ) = \sum t=0 T-1 γ t r t

The RL objective is to find a policy π_θ(a|s) that maximizes the expected return:

J(π θ) = E τ ~ P θ;D [G(τ)]

where the trajectory distribution factors as:

P θ;D (τ) = ρ(s 0) \cdot \prod t=0 T-1 D(s t, a t, s t+1) \cdot π θ (a t | s t)

ℹ Why the discount factor matters for robots

Setting γ = 0.99 vs γ = 0.95 has a dramatic effect on robot behavior. With γ = 0.99, a reward 100 steps in the future is worth 0.99¹⁰⁰ ≈ 0.37 of its face value — the robot plans far ahead. With γ = 0.95, that same future reward is worth 0.95¹⁰⁰ ≈ 0.006 — essentially zero. Higher discount factors encourage long-horizon planning (useful for multi-step manipulation) but make learning slower and less stable.

Value Functions

Two value functions are central to RL algorithms:

The state-value function V_π(s) tells us the expected return starting from state s and following policy π thereafter:

V π (s) = E τ ~ π [G(τ) | s 0 = s]

The state-action value function Q_π(s, a) tells us the expected return of taking action a in state s, then following π:

Q π (s, a) = E τ ~ π [G(τ) | s 0 = s, a 0 = a]

These are connected by the Bellman consistency equations:

Q π (s t, a t) = E s t+1 [r t + γ \cdot V π (s t+1)]

V π (s t) = E a t ~ π [Q π (s t, a t)]

The Bellman equations are recursive — they express the value of a state in terms of the value of successor states. This recursion is the foundation of every value-based RL algorithm.

Deep RL Methods

Deep Q-Learning (DQN) learns Q_θ(s, a) with a neural network, using the temporal difference (TD) loss:

L(θ i) = E (s t, a t) [(y i - Q θ i (s t, a t)) 2]

where the TD target is:

y i = r t + γ \cdot max a' Q θ i-1 (s t+1, a')

DQN works well for discrete action spaces (Atari games), but robot actions are continuous (joint torques, end-effector velocities). We cannot take a max over a continuous action space.

Deep Deterministic Policy Gradient (DDPG) solves this by learning a separate deterministic policy μ_φ(s) that outputs the action maximizing Q. The policy gradient is:

\nabla φ J = E s [\nabla a Q(s, a)| a=μ φ (s) \cdot \nabla φ μ φ (s)]

Soft Actor-Critic (SAC) adds an entropy bonus to encourage exploration:

J(π) = \sum t=0 T E (s t, a t) [r t + α \cdot H(π(\cdot|s t))]

where α is the temperature parameter controlling the exploration-exploitation tradeoff. SAC is the most popular off-policy RL algorithm for continuous control because the entropy term prevents premature convergence to suboptimal deterministic policies.

💡 HIL-SERL: Making real-world RL practical

Pure RL is sample-inefficient — millions of interactions just to learn basic manipulation. Human-in-the-Loop Sample-Efficient RL (HIL-SERL) changes the game: a human operator intervenes during training to correct dangerous or unproductive behavior, providing both a safety mechanism and a curriculum. With HIL-SERL, real-world RL agents learn policies with near-perfect success rates on challenging manipulation tasks in just 1-2 hours of real interaction.

python

# Simplified SAC training loop for robot control
import torch
import torch.nn as nn
import torch.optim as optim

class SACPolicy(nn.Module):
    def __init__(self, state_dim, action_dim, hidden=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
        )
        self.mean_head = nn.Linear(hidden, action_dim)
        self.log_std_head = nn.Linear(hidden, action_dim)

    def forward(self, state):
        h = self.net(state)
        mean = self.mean_head(h)
        log_std = self.log_std_head(h).clamp(-20, 2)
        return mean, log_std

    def sample(self, state):
        mean, log_std = self.forward(state)
        std = log_std.exp()
        normal = torch.distributions.Normal(mean, std)
        z = normal.rsample()  # Reparameterization trick
        action = torch.tanh(z)
        # Log-prob with tanh correction
        log_prob = normal.log_prob(z) - torch.log(1 - action.pow(2) + 1e-6)
        return action, log_prob.sum(-1)

class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, 1),
        )

    def forward(self, state, action):
        return self.net(torch.cat([state, action], dim=-1))

# SAC soft TD target:
# y = r + gamma * (Q_target(s', a') - alpha * log_pi(a'|s'))
# where a' ~ pi(.|s')

Behavioral Cloning

Reinforcement learning requires a reward function, and designing good reward functions for manipulation tasks is notoriously difficult. How do you assign a scalar reward to "fold a towel neatly"? Behavioral cloning (BC) sidesteps this problem entirely: instead of learning from rewards, we learn from demonstrations.

Given a dataset of expert demonstrations D = {(o_i, a_i)}_i=1..N — pairs of observations and the actions the expert took — BC reduces robot learning to supervised regression:

min θ E (o, a) ~ D [||f θ (o) - a|| 2]

This is just mean squared error. Train a neural network f_θ to predict the expert's action given the observation. Simple, stable, and fast to train. But two fundamental problems make naive BC unreliable for robotics:

Problem 1: Compounding errors (covariate shift). During training, the network sees observations from the expert's state distribution. During deployment, the robot executes its own (imperfect) actions, drifting to states the expert never visited. Small prediction errors compound — the robot enters unfamiliar territory, makes larger errors, enters even more unfamiliar territory, and the policy diverges catastrophically. A tiny angular error at step 1 becomes a 10cm position error by step 50.

Problem 2: Mode averaging. Consider a robot that needs to go around an obstacle. Half the demonstrations go left, half go right. Both are valid. But the MSE loss, which averages over these modes, predicts the mean action — which goes straight into the obstacle. When the demonstration distribution is multimodal (multiple valid behaviors for the same observation), regression to the mean produces behavior that matches none of the modes.

💡 Mode averaging is the killer problem

Compounding errors can be mitigated with more data and DAgger-style online correction. But mode averaging is a fundamental limitation of the MSE loss — it cannot represent multimodal action distributions. This is what motivates the entire next section on generative models. We need policies that can represent and sample from multi-modal distributions, not just predict the mean.

python

import torch
import torch.nn as nn

class BCPolicy(nn.Module):
    """Naive behavioral cloning policy — MSE regression."""
    def __init__(self, obs_dim, action_dim, hidden=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, action_dim),
        )

    def forward(self, obs):
        return self.net(obs)

# Training loop
policy = BCPolicy(obs_dim=64, action_dim=6)
optimizer = torch.optim.Adam(policy.parameters(), lr=1e-4)

for batch in dataloader:
    obs, expert_action = batch["observation"], batch["action"]
    predicted_action = policy(obs)
    loss = nn.functional.mse_loss(predicted_action, expert_action)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Problem: if demonstrations contain both "go left" and "go right"
# for the same observation, the policy learns "go straight" (average)
# which may be catastrophically wrong.

Both problems point in the same direction: we need action predictions that are not single points but rather distributions. If the policy can represent "go left with 50% probability, go right with 50% probability," mode averaging disappears. And if the policy outputs a chunk of future actions rather than a single timestep, compounding errors are reduced. These insights lead us to generative models.

Generative Models for Actions

The mode averaging problem demands policies that can represent multimodal distributions. Three families of generative models have been adapted for action prediction: Variational Autoencoders (VAEs), Diffusion Models, and Flow Matching. Each offers a different mechanism for transforming simple noise into structured, multimodal action distributions.

Variational Autoencoders & ELBO

A VAE learns to generate data by introducing a latent variable z. Instead of directly modeling p(x), we model p(x|z) and p(z), then marginalize:

p(x) = \int p(x|z) p(z) dz

This integral is intractable, so we introduce an approximate posterior q_φ(z|x) and derive the Evidence Lower Bound (ELBO):

log p(x) \geq E q(z|x) [log p(x|z)] - KL(q φ (z|x) || p(z))

The first term is the reconstruction loss — how well can we reconstruct x from z? The second term is the KL regularization — how close is our learned posterior to the prior? We choose p(z) = N(0, I) as a simple Gaussian prior, and q_φ(z|x) = N(μ_φ(x), σ_φ(x)) as a diagonal Gaussian parameterized by an encoder network.

Training maximizes the ELBO, which is equivalent to minimizing:

L VAE = -E q(z|x) [log p θ (x|z)] + KL(q φ (z|x) || N(0, I))

For robot learning, x is an action (or action sequence), and we condition everything on the observation o. The decoder becomes p_θ(a|z, o) — given a latent code and an observation, generate an action. The latent variable z captures the "mode" of the demonstration: left vs right, fast vs slow, precise vs forceful.

python

class ActionVAE(nn.Module):
    """VAE for action generation, conditioned on observations."""
    def __init__(self, obs_dim, action_dim, latent_dim=32, hidden=256):
        super().__init__()
        # Encoder: q(z | o, a) — only used during training
        self.encoder = nn.Sequential(
            nn.Linear(obs_dim + action_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
        )
        self.mu_head = nn.Linear(hidden, latent_dim)
        self.logvar_head = nn.Linear(hidden, latent_dim)

        # Decoder: p(a | z, o) — used during training and inference
        self.decoder = nn.Sequential(
            nn.Linear(obs_dim + latent_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, action_dim),
        )

    def encode(self, obs, action):
        h = self.encoder(torch.cat([obs, action], dim=-1))
        return self.mu_head(h), self.logvar_head(h)

    def reparameterize(self, mu, logvar):
        std = (0.5 * logvar).exp()
        eps = torch.randn_like(std)
        return mu + eps * std

    def decode(self, obs, z):
        return self.decoder(torch.cat([obs, z], dim=-1))

    def loss(self, obs, action):
        mu, logvar = self.encode(obs, action)
        z = self.reparameterize(mu, logvar)
        recon = self.decode(obs, z)
        recon_loss = nn.functional.mse_loss(recon, action)
        kl_loss = -0.5 * (1 + logvar - mu.pow(2) - logvar.exp()).sum(-1).mean()
        return recon_loss + kl_loss

Diffusion Models

Diffusion models take a completely different approach to generation. Instead of learning an explicit latent space, they learn to reverse a gradual noising process.

Forward process. Starting from clean data z₀, we add Gaussian noise over T steps according to a fixed schedule β₁, ..., β_T:

q(z t | z t-1) = N(\sqrt(1 - β t) \cdot z t-1, β t \cdot I)

At each step, we scale down the signal by √(1 - β_t) and add noise with variance β_t. After enough steps, z_T ≈ N(0, I) — pure noise. Defining α_t = 1 - β_t and ā_t = ∏_s=1^t α_s, we can jump directly to any timestep:

q(z t | z 0) = N(\sqrtā t \cdot z 0, (1 - ā t) \cdot I)

Reverse process. Generation starts from z_T ~ N(0, I) and iteratively denoises. We train a neural network ε_θ(z_t, t) to predict the noise that was added at each step. The DDPM training loss simplifies beautifully to:

L DDPM = E t, z 0, ε [||ε - ε θ (z t, t)|| 2]

where ε ~ N(0, I) is the actual noise, z_t = √ā_t · z₀ + √(1-ā_t) · ε, and t is uniformly sampled from {1, ..., T}. We are simply training a denoiser — given a noisy sample and the noise level, predict the noise.

💡 Why diffusion models excel at multimodal distributions

A diffusion model does not predict a single action — it iteratively sculpts a noise sample into an action. Different noise initializations lead to different modes of the action distribution. The network never needs to "average" between modes because it is always denoising toward the nearest clean data point. This is fundamentally different from MSE regression, which cannot represent multimodality.

Diffusion Forward & Reverse Process Interactive

Watch the forward process gradually corrupt a 2D action distribution into noise, then the reverse process recover structured data. Click "Forward" to add noise step by step, "Reverse" to denoise. The bimodal distribution (two clusters) is preserved through denoising but would be averaged by MSE.

t = 0 (clean data)

python

import torch

def ddpm_training_step(model, x_0, T=1000):
    """One DDPM training step: sample t, add noise, predict noise."""
    batch_size = x_0.shape[0]

    # Sample random timesteps
    t = torch.randint(0, T, (batch_size,), device=x_0.device)

    # Precomputed noise schedule
    alpha_bar = cosine_alpha_bar_schedule(T)  # [T]
    alpha_bar_t = alpha_bar[t].unsqueeze(-1)  # [B, 1]

    # Sample noise and create noisy data
    epsilon = torch.randn_like(x_0)
    z_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * epsilon

    # Predict the noise
    epsilon_pred = model(z_t, t)

    # MSE loss between true and predicted noise
    loss = torch.nn.functional.mse_loss(epsilon_pred, epsilon)
    return loss

@torch.no_grad()
def ddpm_sample(model, shape, T=1000):
    """Generate a sample by iterative denoising."""
    z = torch.randn(shape)  # Start from pure noise
    alpha_bar = cosine_alpha_bar_schedule(T)

    for t in reversed(range(T)):
        epsilon_pred = model(z, torch.tensor([t]))
        # Reverse step: remove predicted noise, add a bit of fresh noise
        alpha_t = 1 - (alpha_bar[t] / alpha_bar[t-1] if t > 0 else alpha_bar[t])
        z = (1 / torch.sqrt(1 - alpha_t)) * (
            z - alpha_t / torch.sqrt(1 - alpha_bar[t]) * epsilon_pred
        )
        if t > 0:
            z += torch.sqrt(alpha_t) * torch.randn_like(z)
    return z

Flow Matching

Flow matching offers a simpler alternative to diffusion. Instead of a discrete Markov chain with a fixed noise schedule, we define a continuous-time flow that transports noise to data along straight paths.

Given a noise sample x₀ ~ N(0, I) and a data sample x₁, the conditional optimal transport (OT) flow interpolates linearly:

x t = (1 - t) \cdot x 0 + t \cdot x 1, t \in [0, 1]

The velocity field along this path is simply:

u t (x | x 1) = x 1 - x 0

We train a neural network v_θ(x_t, t) to predict this velocity:

L FM = E t, x 0, x 1 [||v θ (x t, t) - u t (x t | x 1)|| 2]

At inference, we start from x₀ ~ N(0, I) and integrate the learned velocity field using an ODE solver (e.g., Euler steps):

x t+Δt = x t + v θ (x t, t) \cdot Δt

ℹ Flow matching vs diffusion: the key difference

Diffusion models use a stochastic process (SDE) with a fixed noise schedule — you must choose β_t carefully, and inference requires many steps because the path from noise to data is curved. Flow matching uses a deterministic ODE with straight paths — the trajectory from noise to data is a straight line in expectation, so fewer integration steps are needed. This makes flow matching faster at inference time, which is critical for real-time robot control.

python

def flow_matching_training_step(model, x_1):
    """One flow matching training step."""
    batch_size = x_1.shape[0]

    # Sample noise (source distribution)
    x_0 = torch.randn_like(x_1)

    # Sample time uniformly
    t = torch.rand(batch_size, 1, device=x_1.device)

    # Interpolate: conditional OT path
    x_t = (1 - t) * x_0 + t * x_1

    # Target velocity: direction from noise to data
    u_t = x_1 - x_0

    # Predict velocity
    v_pred = model(x_t, t.squeeze(-1))

    # MSE loss
    loss = torch.nn.functional.mse_loss(v_pred, u_t)
    return loss

@torch.no_grad()
def flow_matching_sample(model, shape, num_steps=10):
    """Generate a sample by integrating the learned velocity field."""
    x = torch.randn(shape)
    dt = 1.0 / num_steps

    for i in range(num_steps):
        t = torch.tensor([i / num_steps])
        v = model(x, t)
        x = x + v * dt  # Euler step

    return x

ACT: Action Chunking with Transformers

ACT (Action Chunking with Transformers) by Zhao et al. (2023) combines two key ideas: a CVAE for multimodal action generation, and action chunking — predicting a sequence of k future actions at once rather than a single timestep. Together, these address both the mode averaging and compounding error problems of naive behavioral cloning.

Architecture: CVAE with Transformer backbone. ACT uses a Conditional VAE where both the encoder and decoder are Transformers:

Encoder (training only): Takes the full observation o and the ground-truth action sequence a_t:t+k as input. Produces a latent code z ~ q_φ(z|o, a) that captures "which mode" this demonstration belongs to — the style of the trajectory.

Decoder (training and inference): Takes the observation o and latent code z (sampled from the encoder during training, from the prior N(0, I) during inference), and predicts the full action chunk a_t:t+k. This is a Transformer decoder with k action query tokens that cross-attend to the observation.

The training loss is the CVAE ELBO:

L ACT = E z ~ q(z|o,a) [||f θ (o, z) - a t:t+k || 2] + β \cdot KL(q φ (z|o, a) || N(0, I))

where β is a KL weight (typically 10) that controls the regularization strength. A low β lets the encoder encode more information into z (more faithful reconstructions); a high β forces z to stay close to the prior (better generalization but blurrier actions).

💡 Why action chunking works

Predicting k actions at once provides temporal consistency — the predicted trajectory is smooth because all k actions were generated jointly. It also reduces the effective decision frequency by a factor of k: a 50Hz control loop with k=20 chunks makes a new decision only every 0.4 seconds, giving the policy more time to "think" and reducing compounding errors by a factor of 20. The tradeoff is reduced reactivity — the robot commits to a plan and cannot react to unexpected events within a chunk.

Temporal ensembling. In practice, ACT does not simply execute one chunk and then plan a new one. Instead, at each timestep, it generates a new chunk that overlaps with previous chunks. The executed action is an exponentially weighted average of all active chunks' predictions for that timestep, with recent chunks weighted more heavily. This provides smooth transitions between plans.

Action Chunking & Temporal Ensembling Interactive

Each colored trajectory is one action chunk (k actions predicted simultaneously). Overlapping chunks are averaged with exponential weighting to produce the smooth executed trajectory (white line). Click "Step" to advance time and generate a new chunk.

t = 0 | Chunks: 0

python

# LeRobot: Training ACT on your own data
from lerobot.configs.policies import PreTrainedConfig
from lerobot.policies.act.modeling_act import ACTPolicy

# ACT configuration
config = PreTrainedConfig(
    policy_type="act",
    chunk_size=100,         # Predict 100 future actions at once
    n_action_steps=100,     # Execute all 100 before re-planning
    input_shapes={
        "observation.images.top": [3, 480, 640],
        "observation.state": [14],  # 7 joint angles x 2 arms
    },
    output_shapes={"action": [14]},
    latent_dim=32,          # CVAE latent dimension
    kl_weight=10.0,         # Beta for KL regularization
    n_obs_steps=1,          # Single-frame observation
)

policy = ACTPolicy(config)

# Training step
batch = next(iter(dataloader))
output = policy.forward(batch)
# output.loss = reconstruction_loss + kl_weight * kl_loss

# Inference: sample z from prior, decode action chunk
policy.eval()
with torch.no_grad():
    action_chunk = policy.select_action(observation)
    # Returns: [chunk_size, action_dim] = [100, 14]

Diffusion Policy

Diffusion Policy (Chi et al., 2024) applies DDPM to action prediction. Instead of generating images, the diffusion model generates action chunks — sequences of robot commands — conditioned on observations. The key architectural choices determine how observations and noise interact.

Observation conditioning. The observation is encoded into a feature vector and injected into the denoising network via cross-attention (for Transformer backbones) or FiLM conditioning (for UNet backbones). Critically, the observation is not noised — only the action is corrupted during the forward process. The observation provides a clean conditioning signal at every denoising step.

1D temporal UNet. The primary architecture treats the action chunk as a 1D signal: the "spatial" dimension is time (action steps), and the "channel" dimension is the action vector. 1D convolutions along the time axis capture temporal patterns. This is computationally efficient and works well for proprioceptive observations.

2D image-conditioned UNet. When the observation includes images, a visual encoder (e.g., ResNet-18) processes the image into a feature map, which is concatenated with the noisy action along the channel dimension. The UNet then operates on this combined representation.

ℹ DDIM: faster inference for real-time control

Standard DDPM requires T=1000 denoising steps, far too slow for 50Hz robot control. DDIM (Denoising Diffusion Implicit Models) uses a deterministic update rule that allows skipping steps, reducing inference to 10-50 steps with minimal quality loss. For Diffusion Policy, DDIM with 10 steps is standard — each action chunk takes ~50ms to generate on a GPU, just barely fast enough for real-time control. This is still a bottleneck that motivates the async inference pipeline discussed in the next section.

The training procedure mirrors standard DDPM. Given an observation o and ground-truth action chunk a₀ = a_t:t+k:

Sample a diffusion timestep t ~ Uniform(1, ..., T)
Sample noise ε ~ N(0, I)
Create the noisy action: a_t = √ā_t · a₀ + √(1 - ā_t) · ε
Predict the noise: ε̂ = ε_θ(a_t, t, o)
Loss: ||ε - ε̂||²

At inference, we start from a_T ~ N(0, I) and iteratively denoise, conditioning on the current observation at every step. The result is an action chunk a₀ that can be executed directly.

python

# LeRobot: Training Diffusion Policy
from lerobot.configs.policies import PreTrainedConfig
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy

config = PreTrainedConfig(
    policy_type="diffusion",
    chunk_size=16,              # Predict 16 future actions
    n_action_steps=8,           # Execute 8, then re-plan (overlap)
    input_shapes={
        "observation.images.top": [3, 96, 96],
        "observation.state": [2],
    },
    output_shapes={"action": [2]},
    num_inference_steps=10,     # DDIM steps (down from 1000)
    noise_scheduler_type="DDIM",
    down_dims=[256, 512, 1024], # 1D UNet channel progression
    n_obs_steps=2,              # 2-frame observation history
)

policy = DiffusionPolicy(config)

# Training: standard DDPM objective
batch = next(iter(dataloader))
output = policy.forward(batch)
# output.loss = MSE(epsilon, epsilon_pred)

# Inference: iterative denoising conditioned on observation
policy.eval()
with torch.no_grad():
    action_chunk = policy.select_action(observation)
    # Returns: [chunk_size, action_dim] = [16, 2]
    # Execute first n_action_steps=8, then re-plan

Feature	ACT	Diffusion Policy
Generative model	CVAE	DDPM / DDIM
Backbone	Transformer	1D UNet or Transformer
Latent space	Explicit (z ~ N(0,I))	Implicit (iterative denoising)
Inference steps	1 (single forward pass)	10-50 (iterative denoising)
Training signal	Reconstruction + KL	Noise prediction MSE
Multimodality	Via latent variable z	Via different noise initializations
Inference speed	Fast (single pass)	Slower (multiple passes)
Typical chunk size	100	16

Optimized Inference

There is a fundamental tension in robot learning: generative policies (especially diffusion-based ones) produce high-quality, multimodal action predictions, but they are slow. A single DDIM denoising pass with 10 steps on a 1D UNet takes 30-80ms on a modern GPU. Meanwhile, the robot's control loop runs at 50Hz (20ms per step) or faster. If the policy cannot generate actions faster than the control loop consumes them, the robot stutters or freezes.

LeRobot solves this with an asynchronous inference pipeline that decouples action planning from action execution using two threads:

Planning thread. Runs the neural network policy at whatever speed it can manage. Each time it finishes generating an action chunk, it pushes it into a shared queue and immediately starts generating the next chunk using the latest observation.

Execution thread. Runs at the robot's control frequency (e.g., 50Hz). At each control step, it pops the next action from the queue and sends it to the robot. If no new chunk is available yet, it continues executing actions from the previous chunk.

💡 Why async inference enables temporal consistency

The async pipeline does more than just hide latency — it enables a form of temporal consistency that is impossible with synchronous execution. Because the planning thread always starts from the most recent observation, the action queue always contains the freshest possible plan. And because the execution thread can continue executing a previous chunk while a new one is being computed, there are no gaps in robot motion. The action queue also enables natural chunk interpolation: when a new chunk arrives, the execution thread can blend between the old and new plans rather than switching abruptly.

python

import threading
import queue
import time

class AsyncInferencePipeline:
    """Decoupled planning and execution for real-time robot control."""

    def __init__(self, policy, robot, control_freq=50):
        self.policy = policy
        self.robot = robot
        self.control_freq = control_freq
        self.action_queue = queue.Queue(maxsize=1)
        self.latest_obs = None
        self.current_chunk = None
        self.chunk_index = 0
        self.running = False

    def planning_loop(self):
        """Runs in a background thread. Generates action chunks."""
        while self.running:
            if self.latest_obs is not None:
                obs = self.latest_obs
                with torch.no_grad():
                    chunk = self.policy.select_action(obs)
                # Replace whatever is in the queue
                try:
                    self.action_queue.get_nowait()
                except queue.Empty:
                    pass
                self.action_queue.put(chunk)

    def execution_loop(self):
        """Runs in the main thread at control frequency."""
        dt = 1.0 / self.control_freq
        while self.running:
            start = time.time()

            # Get latest observation
            self.latest_obs = self.robot.get_observation()

            # Check for new action chunk
            try:
                self.current_chunk = self.action_queue.get_nowait()
                self.chunk_index = 0
            except queue.Empty:
                pass

            # Execute current action
            if self.current_chunk is not None:
                if self.chunk_index < len(self.current_chunk):
                    action = self.current_chunk[self.chunk_index]
                    self.robot.send_action(action)
                    self.chunk_index += 1

            # Maintain control frequency
            elapsed = time.time() - start
            time.sleep(max(0, dt - elapsed))

    def run(self):
        self.running = True
        planner = threading.Thread(target=self.planning_loop, daemon=True)
        planner.start()
        self.execution_loop()

This pattern — separating slow inference from fast control — is a recurring theme in deployed robot systems. It appears in autonomous driving (perception runs at 10Hz, control at 100Hz), in drone flight (planning at 5Hz, attitude control at 500Hz), and now in manipulation. The key insight is that action chunking makes this separation natural: the policy produces a plan (the chunk), and the execution thread follows the plan until a new one arrives.

Generalist Robot Policies (VLAs)

Everything we have discussed so far trains one policy for one task on one robot. But the ultimate vision of robot learning is a generalist policy — a single model that can control multiple robot embodiments, perform many tasks, and accept natural language instructions. Vision-Language-Action models (VLAs) pursue this vision by building on the success of large Vision-Language Models (VLMs).

The core idea is simple: take a pretrained VLM that already understands images and language, and fine-tune it to also output robot actions. The VLM provides a rich visual and semantic representation that transfers across tasks and embodiments. The action output head maps this representation to the specific robot's action space.

π0: VLM + Flow Matching Action Expert

π0 (pi-zero, Black et al. 2024) is the most capable VLA to date. Its architecture consists of two components:

1. VLM backbone. A pretrained vision-language model (PaLI-Gemma 2B) that processes images and language instructions. It produces a sequence of tokens representing the visual scene and the task description.

2. Flow matching action expert. Instead of discretizing actions into tokens (which loses precision), π0 uses a continuous flow matching head. The VLM features condition a flow matching model that generates action chunks. This is the key innovation — combining the semantic understanding of VLMs with the precision of continuous generative models.

π0 was trained on over 10,000 hours of robot demonstration data across 7 different robot embodiments, including single-arm manipulators, bimanual setups, and mobile manipulators. The same model weights control all of them — the model learns to adapt its output to the action space and dynamics of each embodiment.

💡 Why flow matching, not diffusion, for VLAs

π0 uses flow matching instead of diffusion for a practical reason: flow matching requires fewer denoising steps (typically 5-10 vs 50-100 for DDPM). When your model backbone is a 2B+ parameter VLM, every inference step is expensive. Flow matching's straight-line trajectories from noise to data mean the ODE can be integrated with fewer Euler steps while maintaining action quality. This is the difference between 100ms and 500ms per action chunk — real-time vs unusable for manipulation.

SmolVLA: Efficiency Without Compromise

π0 is powerful but large — it requires significant GPU resources for both training and inference. SmolVLA (Shukor et al., 2025) asks: can we get comparable performance with a much smaller model that runs on consumer hardware?

SmolVLA uses a 250M parameter architecture built on SmolVLM (a small VLM) with a flow matching action head. Despite being roughly 10x smaller than π0, SmolVLA achieves competitive performance on standard manipulation benchmarks.

Key design decisions:

Efficient visual encoding: Fewer image tokens via aggressive spatial pooling
Shared backbone: The same Transformer processes both language tokens and action queries
Knowledge distillation: Training on data generated by larger models
Hardware target: Runs inference on a single consumer GPU (e.g., RTX 3090) or even a laptop with an M-series chip

Feature	π0	SmolVLA
Parameters	~3B	~250M
VLM backbone	PaLI-Gemma 2B	SmolVLM
Action head	Flow matching	Flow matching
Training data	10K+ hours, 7 embodiments	Smaller, curated datasets
Inference hardware	Server GPU (A100/H100)	Consumer GPU (RTX 3090) or Mac
Language conditioning	Yes	Yes
Cross-embodiment	Yes (7 robots)	Yes (fewer embodiments)

python

# LeRobot: Using a pretrained VLA for inference
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
from lerobot.configs.policies import PreTrainedConfig

# Load pretrained SmolVLA
policy = SmolVLAPolicy.from_pretrained("lerobot/smolvla_base")
policy.eval()

# Observation includes image + state + language instruction
observation = {
    "observation.images.top": camera_image,          # [3, 480, 640]
    "observation.state": joint_state,                # [6]
    "task": "Pick up the red block and place it in the bowl",
}

# Generate action chunk
with torch.no_grad():
    action_chunk = policy.select_action(observation)
    # Returns: [chunk_size, action_dim]

# The same model can control different robots —
# it adapts to the action space via the config
print(f"Action shape: {action_chunk.shape}")
# SO-100: [16, 6] (6 joints)
# ALOHA:  [16, 14] (7 joints x 2 arms)

ℹ The VLA scaling hypothesis

VLAs bet on a hypothesis borrowed from NLP: scaling data and model size will produce emergent generalization. Just as GPT-3 learned to perform tasks never seen during training by scaling language data, VLAs aim to learn manipulation skills that generalize to new objects, scenes, and instructions by scaling robot data. π0's results with 10K+ hours of data are encouraging but not conclusive — we may need 100K+ hours (comparable to the text data scale shift from GPT-2 to GPT-3) to see true emergent manipulation capabilities.

Practical Guide

The LeRobot ecosystem provides everything needed to go from a physical robot to a deployed learned policy. This section covers the data pipeline, dataset format, and the full collect-train-deploy workflow.

LeRobotDataset format. All data is stored in a structured format with three components:

Tabular data: Joint states and actions in memory-mapped Parquet files, concatenated across episodes
Visual data: MP4 video files grouped by camera and episode
Metadata: JSON files describing the schema, frame rates, normalization statistics, and task descriptions

The metadata directory contains:

meta/info.json — Complete dataset schema: feature definitions, FPS, data paths
meta/stats.json — Aggregated statistics (mean, std, min, max) for input normalization
meta/tasks.jsonl — Natural language task descriptions mapped to integer indices
meta/episodes/* — Per-episode metadata: length, task, data pointers

ℹ The windowing operation: temporal context from a single index

LeRobotDataset supports a delta_timestamps parameter that enables temporal windowing. When you index into the dataset with dataset[100], instead of getting a single frame, you can specify offsets like [-0.2, -0.1, 0.0] to get the current frame plus two frames from the past (200ms and 100ms ago). This is how policies access observation history without explicit sequence padding — the dataset handles alignment automatically.

python

# Complete pipeline: Load data, train a policy, deploy
from lerobot.datasets.lerobot_dataset import LeRobotDataset
import torch

# 1. Load dataset with temporal windowing
delta_timestamps = {
    "observation.images.wrist_camera": [-0.2, -0.1, 0.0],
    "observation.state": [0.0],
    "action": [0.0, 0.04, 0.08, 0.12],  # Future action chunk
}

dataset = LeRobotDataset(
    "lerobot/svla_so101_pickplace",
    delta_timestamps=delta_timestamps,
)

# 2. Inspect a sample
sample = dataset[100]
print(f"Image shape: {sample['observation.images.wrist_camera'].shape}")
# [3, 3, 480, 640] — 3 time steps, 3 channels, H, W
print(f"State shape: {sample['observation.state'].shape}")
# [1, 6] — 1 time step, 6 joints
print(f"Action shape: {sample['action'].shape}")
# [4, 6] — 4 future steps, 6 joints

# 3. Create DataLoader
dataloader = torch.utils.data.DataLoader(
    dataset, batch_size=16, shuffle=True, num_workers=4
)

# 4. Training loop (simplified)
for epoch in range(100):
    for batch in dataloader:
        device = "cuda" if torch.cuda.is_available() else "cpu"
        observations = batch["observation.state"].to(device)
        images = batch["observation.images.wrist_camera"].to(device)
        actions = batch["action"].to(device)

        output = policy.forward(batch)
        output.loss.backward()
        optimizer.step()
        optimizer.zero_grad()

Data collection with teleoperation. The most common way to collect robot demonstrations is teleoperation — a human operator controls the robot through a leader-follower setup. The SO-100 leader arm mirrors the operator's movements on the SO-100 follower arm, while cameras record the scene.

python

# Data collection with LeRobot teleoperation
from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.robots.so100_follower import SO100Follower, SO100FollowerConfig
from lerobot.teleoperators.so100_leader import SO100Leader, SO100LeaderConfig
from lerobot.scripts.lerobot_record import record_loop

# Hardware configuration
FPS = 30
camera_config = {
    "front": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=FPS),
}
robot_config = SO100FollowerConfig(
    port="/dev/ttyUSB0", id="follower", cameras=camera_config,
)
teleop_config = SO100LeaderConfig(port="/dev/ttyUSB1", id="leader")

# Initialize hardware
robot = SO100Follower(robot_config)
teleop = SO100Leader(teleop_config)
robot.connect()
teleop.connect()

# Create dataset
dataset = LeRobotDataset.create(
    repo_id="my_user/pick_and_place_v1",
    fps=FPS,
    robot_type=robot.name,
    use_videos=True,
)

# Record 10 episodes of 60 seconds each
for episode in range(10):
    print(f"Recording episode {episode + 1}/10...")
    record_loop(
        robot=robot,
        teleop=teleop,
        dataset=dataset,
        control_time_s=60,
        single_task="Pick up the red cube and place it on the plate",
    )
    dataset.save_episode()

# Push to HuggingFace Hub
dataset.push_to_hub()

Hardware options. The LeRobot ecosystem supports a range of hardware:

Robot	Cost	DOF	Use Case
SO-100	~$300	6	Single-arm tabletop manipulation
SO-101	~$350	6	Improved SO-100 with better actuators
ALOHA 2	~$20K	14 (7+7)	Bimanual manipulation
Humanoids	$50K+	20+	Full-body mobile manipulation

The beauty of the LeRobot abstraction is that the same training code works across all of these platforms. You change the robot configuration and action dimensions, but the policy architecture, training loop, and data format remain identical. A Diffusion Policy trained on SO-100 demonstrations has the same code structure as one trained on ALOHA data — only the observation and action shapes differ.

💡 The $300 robot learning lab

The SO-100 costs about the same as a textbook. With a $300 arm, a $50 webcam, and a laptop, you have everything needed to collect demonstrations, train ACT or Diffusion Policy, and deploy a working learned manipulation system. This is the real contribution of LeRobot — not just the algorithms, but the democratization of robot learning research. Five years ago, this required a $100K+ lab setup.

References

Capuano, Pascal, Zouitine, Wolf & Aractingi. "Robot Learning: A Tutorial." arXiv, 2025. arXiv:2510.12403
Zhao et al. "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware." RSS, 2023. arXiv:2304.13705
Chi et al. "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." RSS, 2024. arXiv:2303.04137
Black et al. "π0: A Vision-Language-Action Flow Model for General Robot Control." arXiv, 2024. arXiv:2410.24164
Shukor et al. "SmolVLA: A Small Vision-Language-Action Model for Robot Learning." arXiv, 2025. arXiv:2506.01844
Ho, Jain & Abbeel. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020. arXiv:2006.11239
Song, Meng & Ermon. "Denoising Diffusion Implicit Models." ICLR, 2021. arXiv:2010.02502
Lipman et al. "Flow Matching for Generative Modeling." ICLR, 2023. arXiv:2210.02747
Haarnoja et al. "Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor." ICML, 2018. arXiv:1801.01290
Luo et al. "Serl: A Software Suite for Sample-Efficient Robotic Reinforcement Learning." arXiv, 2024. arXiv:2401.16013
Kingma & Welling. "Auto-Encoding Variational Bayes." ICLR, 2014. arXiv:1312.6114
Hansen et al. "Temporal Difference Learning for Model Predictive Control." ICML, 2022. arXiv:2203.04955