Imitation Learning & Behavioral Cloning

Introduction

Reinforcement learning promises a general recipe for robot intelligence: define a reward function, let the agent explore, and wait. In principle, RL can discover optimal policies for any task. In practice, RL for real-world robotics has a crippling problem: it requires the robot to try things — including catastrophically wrong things — millions of times before converging on useful behavior.

Imitation learning offers an alternative path. Instead of learning from trial and error, the robot learns from demonstrations — examples of a human (or another capable agent) performing the task. A human teleoperates the robot to pick up a cup 100 times, and the robot learns a policy that can pick up cups in new configurations. No reward function needed. No dangerous exploration. No millions of episodes.

The simplest form of imitation learning is behavioral cloning (BC): treat the demonstration data as a supervised learning dataset, where observations are inputs and actions are labels, and train a neural network to predict actions from observations. This is exactly what VLA models do — they are behavioral cloning at scale, using foundation model architectures trained on massive demonstration datasets.

This article covers the theory and practice of imitation learning, with a focus on behavioral cloning as the foundation for VLA research. We'll see why BC is both deceptively simple and surprisingly effective, understand its fundamental failure mode (distribution shift), and explore the techniques that modern VLAs use to overcome these limitations.

ℹ What this article covers

Why RL alone isn't enough for real-world robotics, the behavioral cloning objective and its loss functions, the distribution shift problem and compounding errors, DAgger and interactive imitation learning, modern action representations (action chunking, diffusion policies), data collection via teleoperation, and scaling laws for behavioral cloning.

Why Not Just Use RL?

Before diving into imitation learning, it's worth understanding why reinforcement learning — the theoretically more general approach — has struggled with real-world robot learning.

The reward design problem

RL requires a scalar reward signal at each timestep. For many robotics tasks, designing this signal is deceptively hard. Consider "pick up the cup and place it on the shelf":

Sparse reward (success/failure only): R = +1 if cup is on shelf, 0 otherwise. The robot must discover the entire pick-place-release sequence through random exploration before receiving any positive signal. For a 7-DOF arm at 10 Hz, a 10-second task has 100 action steps. If each action has 256 possible discretized values per dimension, the space to explore is incomprehensibly vast. Random exploration will almost never succeed.
Dense reward (shaped to guide behavior): R = -dist(gripper, cup) before grasp, -dist(cup, shelf) after grasp. This provides a gradient toward the goal but introduces new problems: the reward function must be carefully tuned, it can create local optima (the gripper goes to the cup but doesn't grasp), and it encodes the designer's assumptions about how the task should be solved, not just what should be achieved.

Reward misspecification is a persistent source of failure. Amodei et al. (2016) documented numerous cases where RL agents exploit reward functions in unexpected ways. A robot rewarded for "moving the cup toward the shelf" might learn to bat the cup in the right direction rather than grasping it. Reward design effectively requires programming the behavior — which defeats much of the purpose of learning.

Sample efficiency

Even with well-designed rewards, RL is sample-hungry. PPO and SAC typically require millions of environment interactions for moderately complex manipulation tasks. In simulation this is merely slow; on a real robot it is impossible. A single episode of a pick-and-place task takes 10–30 seconds in real time. One million episodes would take over 300 days of continuous operation — with constant human supervision to reset the environment between episodes.

Imitation learning sidesteps both problems. The "reward" is implicit in the demonstrations — a successful demonstration shows the task being completed, providing supervision at every timestep. And sample efficiency is dramatically better: many tasks can be learned from 50–200 demonstrations, each collected in under a minute.

💡 IL vs. RL: different trade-offs

Imitation learning and RL make complementary trade-offs. IL requires high-quality demonstrations but needs no reward function and few samples. RL requires no demonstrations but needs a reward function and millions of samples. In practice, the two are often combined: IL provides a good initial policy, and RL fine-tunes it with a task reward. This hybrid approach — used by systems like RLHF in language models — is increasingly common in robotics, though pure behavioral cloning with sufficient data (the VLA approach) has proven surprisingly competitive.

Behavioral Cloning

Behavioral cloning (Pomerleau, 1989) is the most direct form of imitation learning. The idea is disarmingly simple: collect a dataset of (observation, action) pairs from expert demonstrations, and train a policy network to predict actions from observations using supervised learning.

Dean Pomerleau's ALVINN system (Autonomous Land Vehicle In a Neural Network) demonstrated this in 1989: a neural network trained on images of a road and corresponding steering angles from a human driver could autonomously steer a vehicle at up to 55 mph on highways. With a single hidden layer of 29 units, ALVINN showed that behavioral cloning could work — if the data was right and the distribution stayed close to training.

The BC objective

Given a dataset D = {(o₁, a₁), (o₂, a₂), ..., (o_N, a_N)} of observation-action pairs from expert demonstrations, behavioral cloning learns a policy π_θ(a | o) that minimizes:

L(θ) = E (o, a) ~ D [ ℓ( π θ (o), a ) ]

where ℓ is a loss function measuring how well the policy's prediction matches the expert's action. This is standard supervised learning. The observation o is the input, the expert action a is the label, and the policy π_θ is the model being trained.

For robotics, the observation o typically includes a camera image (and optionally proprioception), and the action a is a continuous vector specifying the desired end-effector motion. The policy architecture can be a CNN processing the image, a ViT, or — in the VLA case — a full vision-language model that also conditions on a language instruction.

Loss functions for continuous actions

The choice of loss function ℓ has important implications for what the policy can express:

MSE Loss

L_MSE = ||π(o) - a||²

Mean squared error. The simplest choice: the policy predicts a single action vector, and the loss penalizes squared deviation from the expert. Assumes a unimodal Gaussian action distribution. Fails when multiple valid actions exist for the same observation — the policy averages them, potentially producing an invalid action. Widely used for simple tasks.

GMM / MDN Loss

-log ∑_k w_k N(a; μ_k, Σ_k)

Gaussian Mixture Model (or Mixture Density Network). The policy predicts K mixture components, each with a weight, mean, and covariance. Handles multimodal action distributions naturally. Used in ACT (Action Chunking with Transformers) and other modern BC methods. Training can be unstable with many components.

Cross-Entropy Loss

-∑_d log p(a_d | o)

When actions are discretized into bins (as in RT-2 and other VLAs), each action dimension becomes a classification problem. The loss is the sum of cross-entropy losses across action dimensions. Naturally handles multimodality because the output is a full probability distribution over bins.

Diffusion Loss

E_ε,t ||ε_θ(a_t, o, t) - ε||²

Diffusion Policy (Chi et al., 2023) treats action prediction as a denoising problem. The policy learns to iteratively denoise a random action vector into the correct action, conditioned on the observation. Handles arbitrary multimodal distributions and produces smooth, coherent action sequences. State-of-the-art for many manipulation tasks.

The multimodality problem deserves emphasis. When a human demonstrator approaches an object, they might reach from the left or from the right. Both are valid actions for the same observation. An MSE-trained policy averages these two trajectories and reaches straight ahead — potentially missing the object entirely. This is not a failure of training; it's a fundamental limitation of unimodal action prediction. Discretized actions and diffusion policies handle this naturally, which is one reason they dominate modern VLA architectures.

Behavioral Cloning — Supervised Learning from Demonstrations Interactive

Watch how a BC policy learns to follow expert demonstrations. The green path shows the expert, the purple path shows the learned policy. Click Train to run gradient steps.

Step 0 — untrained policy (random)

Distribution Shift

Behavioral cloning has a fundamental weakness that makes it fragile in practice. The problem is called distribution shift (also known as covariate shift), and it is arguably the single most important concept in understanding the limitations of naive BC and the motivations behind more sophisticated imitation learning methods.

Compounding errors

The core issue: the policy is trained on states visited by the expert, but at test time it visits states generated by its own actions. Any small error at timestep t shifts the robot into a state the expert would never have visited. From this unfamiliar state, the policy is less likely to produce a correct action, which shifts the state further from the expert's distribution. Errors compound exponentially.

Ross and Bagnell (2010) proved a formal bound: for a policy with per-step error ε, the total cost over a horizon T grows as O(ε T²). This quadratic dependence on horizon length means that even tiny per-step errors become catastrophic for long tasks.

J(π) - J(π*) \leq ε T 2

where J(π) is the cost of the learned policy, J(π*) is the cost of the expert, ε is the per-step classification error, and T is the task horizon. Doubling the task length quadruples the performance gap.

Consider a simple driving example. A human driver keeps the car centered in the lane. A BC policy trained on this data predicts steering angles for centered driving states. But suppose at timestep 10, the policy makes a small error and the car drifts slightly right. The expert data never includes "slightly right of center" states, so the policy has no training signal for recovery. It may continue predicting straight-ahead steering, causing the car to drift further. By timestep 50, the car is off the road.

Covariate shift formalized

In supervised learning terms, the training distribution p_expert(o) and the test distribution p_π(o) are different. The policy is trained to minimize:

L train = E o ~ p expert [ ℓ( π(o), a*(o) ) ]

but is evaluated on:

L test = E o ~ p π [ ℓ( π(o), a*(o) ) ]

These two expectations are over different distributions. Standard supervised learning guarantees (generalization bounds, convergence theorems) assume that training and test distributions match. When they don't, all bets are off. The policy may have zero training loss yet perform terribly at test time because it encounters states far from any training example.

💡 Why VLAs survive distribution shift

Modern VLAs are remarkably robust to distribution shift despite being trained with pure behavioral cloning. Three factors explain this. First, scale: training on 100,000+ diverse demonstrations covers a much broader state distribution, including many "recovery" states that appear naturally in non-expert demonstrations. Second, visual generalization: pretrained vision encoders (like SigLIP or DINOv2) provide robust features that transfer across visual variations, reducing the effective distribution shift in feature space even when pixel-level observations differ. Third, action chunking: predicting chunks of future actions rather than single-step actions reduces the compounding error problem by lowering the effective horizon length.

Distribution Shift — Compounding Errors Interactive

Watch how small per-step errors compound over time. The green path is the expert; the red path is the BC policy. Adjust the error rate to see how quickly drift accumulates.

Per-step error: 5%

DAgger

DAgger (Dataset Aggregation), introduced by Ross, Gordon, and Bagnell in 2011, is the foundational algorithm for addressing distribution shift in imitation learning. The key insight is simple: instead of training only on the expert's state distribution, collect additional labels for the states the policy actually visits.

The algorithm

DAgger works as follows:

Initialize: Collect an initial dataset D₁ by running the expert policy π*. Train an initial policy π₁ on D₁.
Execute: Run the current policy π_i in the environment to collect new observations {o₁, o₂, ..., o_T}.
Query expert: For each observation o_t collected by π_i, ask the expert for the correct action a*_t = π*(o_t).
Aggregate: Add the new (o_t, a*_t) pairs to the dataset: D_i+1 = D_i ∪ {(o_t, a*_t)}.
Retrain: Train a new policy π_i+1 on D_i+1.
Repeat from step 2.

The theoretical guarantee: DAgger's cost bound is O(ε T) instead of O(ε T²) — linear rather than quadratic in the horizon. After N iterations, the policy has been trained on states from both the expert's distribution and its own distribution, closing the gap between training and test conditions.

DAgger in practice

The practical difficulty with DAgger is step 3: querying the expert for labels on policy-visited states requires an expert who can provide optimal actions for arbitrary states on demand. For a human expert, this means watching the robot execute its (potentially poor) policy and continuously providing corrective labels — which is cognitively demanding and time-consuming.

Several variants address this:

HG-DAgger (Human-Gated DAgger): The human takes over control whenever the robot deviates too far, providing both corrective trajectories and implicit error detection. This is more natural than labeling individual states.
ThriftyDAgger: Uses uncertainty estimation to decide when to query the expert, reducing the labeling burden by only asking for help in uncertain states.
IWR (Implicit Written Rewards): Uses interventions as a reward signal, combining DAgger-style correction with RL-style fine-tuning.

In the context of VLAs, DAgger-style methods are less common because they require interactive access to an expert during training. Instead, VLAs rely on large, diverse datasets that naturally cover a broader state distribution, reducing the need for on-policy corrections. However, the DAgger principle — that training data should cover the policy's actual state distribution — remains a guiding insight.

DAgger vs. Behavioral Cloning Interactive

Compare BC (one round of data) vs DAgger (iterative data collection). Click Iterate to run a DAgger round.

Round 0 — initial BC policy (expert data only)

Action Representations

How you represent actions has as much impact on policy quality as the choice of network architecture or training algorithm. Modern manipulation research has converged on two dominant approaches that address the limitations of naive single-step action prediction.

Action chunking

Instead of predicting a single action per observation, action chunking predicts a sequence of future actions [a_t, a_t+1, ..., a_t+k] from each observation. The robot executes the first few actions in the chunk, then queries the policy again for a new chunk. This approach was formalized in ACT (Action Chunking with Transformers, Zhao et al., 2023) and has become standard in VLA research.

Action chunking provides three key benefits:

Reduces compounding errors. With a chunk size of k=20 and a control frequency of 10 Hz, each policy query covers 2 seconds of execution. A 30-second task requires only 15 policy queries instead of 300 single-step decisions. The effective horizon drops from T to T/k, and Ross & Bagnell's quadratic bound becomes O(ε (T/k)²).
Temporal consistency. Single-step policies can produce jerky, inconsistent motions because each action is predicted independently. Chunked actions are jointly predicted and form smooth trajectories. Adjacent chunks are blended using exponential smoothing to avoid discontinuities at chunk boundaries.
Handles pauses and timing. Demonstrations often include natural pauses (waiting for an object to settle, hovering before placing). A single-step policy may oscillate during pauses because the action for "stay still" is context-dependent. A chunked policy can explicitly predict a sequence of near-zero actions.

ACT uses a CVAE (Conditional Variational Autoencoder) architecture where the encoder observes the entire future action sequence during training (providing a latent code that captures the mode of the demonstration) and the decoder generates action chunks conditioned on this latent code and the current observation. At test time, the latent code is sampled from the prior, allowing the policy to produce diverse but coherent action sequences.

Diffusion policies

Diffusion Policy (Chi et al., 2023) applies denoising diffusion to action generation. Instead of directly predicting actions, the policy learns to iteratively denoise a random noise vector into a coherent action chunk, conditioned on the current observation.

The training objective is the standard DDPM loss: given an action chunk a from the dataset and a noise level t, add noise ε to get a_t = α_ta + σ_tε, and train the network to predict the noise:

L = E a~D, ε~N(0,I), t~U(1,T) [ ||ε θ (a t, o, t) - ε|| 2]

At inference time, the policy starts with pure noise and applies K denoising steps (typically K=10–100 using DDIM) to produce a clean action chunk. The key advantage over other generative approaches is that diffusion models can represent any multimodal distribution without requiring explicit mixture components or mode-switching logic.

Diffusion Policy achieves state-of-the-art performance on most manipulation benchmarks (robomimic, Push-T, etc.) and has been integrated into several VLA architectures. The main downside is inference speed: multiple denoising steps are required per action prediction, which can limit the control frequency. DDIM with 10 steps on a consumer GPU takes roughly 50–100ms, allowing 10–20 Hz control.

💡 Flow matching: the next generation

Flow matching (Lipman et al., 2023) is emerging as a faster alternative to diffusion for action generation. Instead of iterative denoising, flow matching learns a continuous-time ODE that transports noise to actions in a single forward pass. π₀ (Physical Intelligence, 2024) uses flow matching as its action head, achieving strong manipulation performance with lower inference latency than diffusion-based approaches.

Action Heads — MSE vs. Discretized vs. Diffusion Interactive

See how different action prediction methods handle a bimodal action distribution (two valid grasping approaches for the same object).

MSE — averages modes, invalid action

Data Collection

The quality of a behavioral cloning policy is bounded by the quality of its demonstrations. Unlike language data (abundant on the internet) or image data (web-scraped at scale), robot demonstration data must be collected through physical interaction. This is the single biggest bottleneck in robot learning.

Teleoperation methods

Teleoperation — controlling the robot remotely — is the primary method for collecting demonstrations. Several teleoperation interfaces have been developed:

Kinesthetic teaching (hand-guiding): The human physically moves the robot's arm through the desired motion while joint encoders record the trajectory. Natural and intuitive, but limited to compliant robots (like the Franka Panda in gravity-compensation mode) and influenced by the robot's physical dynamics.
Leader-follower (puppet control): A matched pair of robot arms where the "leader" is hand-moved by the human and the "follower" replicates the motion. Used by ALOHA (Zhao et al., 2023), which pairs two ViperX 300 arms as leaders with two as followers. Enables bimanual manipulation demonstrations and is more ergonomic than kinesthetic teaching.
VR controllers: The human uses a VR headset and hand controllers to teleoperate the robot from a distance. Provides an intuitive 6-DOF interface but requires additional hardware and calibration. Used by several data collection efforts.
3D SpaceMouse: A 6-DOF input device that controls end-effector velocity. Compact and inexpensive (~$150) but has a learning curve and lower bandwidth than leader-follower systems.

Data quality considerations

Not all demonstrations are equally useful. Several factors affect the quality of demonstration data:

Demonstrator skill. Expert demonstrators produce clean, efficient trajectories with consistent strategies. Novice demonstrators produce noisy, variable trajectories with suboptimal strategies. Perhaps counterintuitively, a mix of skill levels can be beneficial: novice demonstrations naturally include recovery behaviors (correcting for mistakes) that provide supervision for exactly the kind of states the learned policy might visit during distribution shift.

Diversity. Demonstrations should cover the full range of object positions, orientations, lighting conditions, and task variations that the policy will encounter at deployment. 50 demonstrations of picking up the same cup from the same position are less useful than 50 demonstrations of picking up different objects from varied positions.

Action quality. The recorded actions should accurately reflect the demonstrator's intent. Teleoperation latency, calibration errors, and control noise can corrupt the action labels. Action smoothing and filtering are commonly applied as post-processing steps.

ℹ The data wall

As of 2024, the largest robot demonstration datasets contain roughly 1–2 million episodes (e.g., Open X-Embodiment with ~1M episodes from 22 robot types). This is orders of magnitude smaller than language datasets (trillions of tokens) or vision datasets (billions of images). Closing this gap is one of the most active areas of research in embodied AI, with approaches including simulation-based data generation, video-to-action models, and cross-embodiment transfer. We'll explore these in detail in Article 06.

Scaling Behavioral Cloning

The central bet of VLA research is that behavioral cloning scales — that with enough data, a large enough model, and the right architecture, BC can produce robot policies that generalize broadly. This bet is inspired by the success of scaling in language (GPT series) and vision-language (LLaVA, Flamingo) models.

Evidence supporting this bet:

RT-1 (Brohan et al., 2022) demonstrated that a transformer trained on 130,000 real robot episodes across 700+ tasks could generalize to novel object combinations and task instructions. Success rate improved log-linearly with data scale.
Octo (Ghosh et al., 2024) showed that pretraining a generalist policy on 800,000 episodes from the Open X-Embodiment dataset and fine-tuning on target tasks improved performance over training from scratch, even when the pretraining data came from different robots.
OpenVLA (Kim et al., 2024) trained a 7B-parameter VLA on 970,000 episodes from Open X-Embodiment and demonstrated strong zero-shot generalization and efficient fine-tuning on new tasks with as few as 10–50 demonstrations.

However, the scaling trajectory for BC in robotics is not as smooth as for language modeling. Robot data is more heterogeneous (different robots, cameras, tasks, environments), lower quality (teleoperation noise, calibration errors), and more sparsely distributed in the space of possible situations. The field is still early in understanding the scaling laws for robot behavioral cloning.

Model	Training Data	Model Size	Action Repr.	Key Result
ACT	50 demos/task	~10M	Chunk (CVAE)	Bimanual manipulation with 80%+ success
Diffusion Policy	100–200 demos	~100M	Diffusion chunk	SOTA on robomimic, Push-T benchmarks
RT-1	130K episodes	35M	Discrete tokens	700+ tasks, 97% success on seen tasks
Octo	800K episodes	93M	Diffusion chunk	Cross-embodiment pretraining works
OpenVLA	970K episodes	7B	Discrete tokens	Zero-shot generalization, efficient fine-tuning
π₀	10K+ hours	3B	Flow matching	Dexterous manipulation, multi-task

Code Examples

Let's implement behavioral cloning from scratch to see exactly how simple the training loop is, then implement action chunking and compare.

Minimal behavioral cloning

python

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import numpy as np

class DemonstrationDataset(Dataset):
    """Dataset of (observation, action) pairs from expert demonstrations."""

    def __init__(self, demo_paths: list[str]):
        self.observations = []
        self.actions = []
        for path in demo_paths:
            demo = np.load(path)
            self.observations.append(demo['observations'])  # (T, obs_dim)
            self.actions.append(demo['actions'])              # (T, act_dim)
        self.observations = np.concatenate(self.observations, axis=0)
        self.actions = np.concatenate(self.actions, axis=0)

    def __len__(self):
        return len(self.observations)

    def __getitem__(self, idx):
        return (
            torch.FloatTensor(self.observations[idx]),
            torch.FloatTensor(self.actions[idx]),
        )

class BCPolicy(nn.Module):
    """Simple MLP policy for behavioral cloning."""

    def __init__(self, obs_dim: int, act_dim: int, hidden: int = 256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, act_dim),
        )

    def forward(self, obs: torch.Tensor) -> torch.Tensor:
        return self.net(obs)

# ── Training loop ──
def train_bc(policy, dataset, epochs=100, lr=1e-3, batch_size=256):
    optimizer = torch.optim.Adam(policy.parameters(), lr=lr)
    loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    for epoch in range(epochs):
        total_loss = 0
        for obs, actions in loader:
            pred_actions = policy(obs)
            loss = nn.functional.mse_loss(pred_actions, actions)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        if (epoch + 1) % 10 == 0:
            avg_loss = total_loss / len(loader)
            print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.6f}")

    return policy

# ── Evaluation ──
def evaluate_bc(policy, env, num_episodes=20):
    successes = 0
    for ep in range(num_episodes):
        obs, _ = env.reset()
        for step in range(200):
            with torch.no_grad():
                action = policy(torch.FloatTensor(obs)).numpy()
            obs, reward, done, truncated, info = env.step(action)
            if done:
                if info.get('success', False):
                    successes += 1
                break
    return successes / num_episodes

Action chunking with temporal ensemble

python

class ActionChunkingPolicy(nn.Module):
    """Predicts a chunk of K future actions from current observation."""

    def __init__(self, obs_dim: int, act_dim: int, chunk_size: int = 20,
                 hidden: int = 256):
        super().__init__()
        self.chunk_size = chunk_size
        self.act_dim = act_dim

        self.encoder = nn.Sequential(
            nn.Linear(obs_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
        )
        # Predict entire chunk at once: (chunk_size * act_dim) outputs
        self.action_head = nn.Linear(hidden, chunk_size * act_dim)

    def forward(self, obs: torch.Tensor) -> torch.Tensor:
        """Returns (batch, chunk_size, act_dim)."""
        h = self.encoder(obs)
        chunks = self.action_head(h)
        return chunks.view(-1, self.chunk_size, self.act_dim)

class TemporalEnsemble:
    """Blends overlapping action chunks with exponential weighting."""

    def __init__(self, chunk_size: int, act_dim: int, decay: float = 0.01):
        self.chunk_size = chunk_size
        self.act_dim = act_dim
        self.decay = decay
        self.buffer = []  # list of (remaining_actions, weight)

    def add_chunk(self, action_chunk: np.ndarray):
        """Add a new predicted action chunk."""
        self.buffer.append({
            'actions': action_chunk,  # (chunk_size, act_dim)
            'index': 0,
        })

    def get_action(self) -> np.ndarray:
        """Get blended action from all active chunks."""
        if not self.buffer:
            raise ValueError("No action chunks available")

        weighted_sum = np.zeros(self.act_dim)
        weight_sum = 0.0

        for i, chunk_info in enumerate(self.buffer):
            idx = chunk_info['index']
            if idx < len(chunk_info['actions']):
                # More recent chunks get higher weight
                w = np.exp(-self.decay * (len(self.buffer) - 1 - i))
                weighted_sum += w * chunk_info['actions'][idx]
                weight_sum += w
                chunk_info['index'] += 1

        # Remove exhausted chunks
        self.buffer = [c for c in self.buffer if c['index'] < len(c['actions'])]

        return weighted_sum / weight_sum

# ── Evaluation with action chunking ──
def evaluate_chunked(policy, env, chunk_size=20, query_freq=10):
    """Run policy with action chunking and temporal ensembling."""
    ensemble = TemporalEnsemble(chunk_size, act_dim=7)
    obs, _ = env.reset()

    for step in range(200):
        # Query policy every query_freq steps
        if step % query_freq == 0:
            with torch.no_grad():
                chunk = policy(torch.FloatTensor(obs)).numpy()[0]
            ensemble.add_chunk(chunk)

        action = ensemble.get_action()
        obs, reward, done, truncated, info = env.step(action)
        if done:
            break

DAgger implementation

python

def dagger(env, expert_policy, learner, num_rounds=10,
           episodes_per_round=20, train_epochs=50):
    """DAgger: iteratively aggregate on-policy data with expert labels."""

    dataset_obs, dataset_act = [], []

    for round_idx in range(num_rounds):
        new_obs, new_act = [], []

        for ep in range(episodes_per_round):
            obs, _ = env.reset()
            for step in range(200):
                # Execute the LEARNER's policy
                if round_idx == 0:
                    action = expert_policy(obs)  # first round: use expert
                else:
                    with torch.no_grad():
                        action = learner(torch.FloatTensor(obs)).numpy()

                # Query the EXPERT for the correct action at this state
                expert_action = expert_policy(obs)

                new_obs.append(obs.copy())
                new_act.append(expert_action.copy())

                obs, reward, done, truncated, info = env.step(action)
                if done:
                    break

        # Aggregate datasets
        dataset_obs.extend(new_obs)
        dataset_act.extend(new_act)

        # Retrain on aggregated dataset
        obs_tensor = torch.FloatTensor(np.array(dataset_obs))
        act_tensor = torch.FloatTensor(np.array(dataset_act))
        dataset = torch.utils.data.TensorDataset(obs_tensor, act_tensor)

        train_bc(learner, dataset, epochs=train_epochs)

        # Evaluate
        success_rate = evaluate_bc(learner, env)
        print(f"DAgger round {round_idx+1}: {len(dataset_obs)} samples, "
              f"success rate = {success_rate:.1%}")

    return learner

References

Seminal papers and key works referenced in this article.

Pomerleau. "ALVINN: An Autonomous Land Vehicle in a Neural Network." NeurIPS, 1989.
Ross et al. "A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning." AISTATS, 2011. arXiv
Chi et al. "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." RSS, 2023. arXiv
Zhao et al. "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware." RSS, 2023. arXiv
Florence et al. "Implicit Behavioral Cloning." CoRL, 2022. arXiv

Introduction

Why Not Just Use RL?

The reward design problem

Sample efficiency

Behavioral Cloning

The BC objective

Loss functions for continuous actions

LMSE = ||π(o) - a||2

-log ∑k wk N(a; μk, Σk)

-∑d log p(ad | o)

Eε,t ||εθ(at, o, t) - ε||2

Distribution Shift

Compounding errors

Covariate shift formalized

DAgger

The algorithm

DAgger in practice

Action Representations

Action chunking

Diffusion policies

Data Collection

Teleoperation methods

Data quality considerations

Scaling Behavioral Cloning

Code Examples

Minimal behavioral cloning

Action chunking with temporal ensemble

DAgger implementation

References

L_MSE = ||π(o) - a||²

-log ∑_k w_k N(a; μ_k, Σ_k)

-∑_d log p(a_d | o)

E_ε,t ||ε_θ(a_t, o, t) - ε||²