Kochenderfer, Wheeler & Wray — Chapter 18

Imitation Learning

Teaching agents from expert demonstrations: from supervised copying, through iterative expert queries, to inferring hidden reward functions and adversarial imitation.

Prerequisites: Chapter 17 (Model-Free Methods), supervised learning basics, policy gradient concepts. That's it.
10
Chapters
6
Simulations
10
Quizzes

Chapter 0: Learning from Demonstrations

Imagine teaching a child to ride a bike by writing down a reward function. "You get +10 for moving forward, −5 for wobbling, −100 for falling..." You'd spend days arguing over the exact numbers, only to watch the child optimize your metric in ways you never intended — leaning against a wall to avoid falling while never actually moving.

Now imagine instead sitting on the back of their bike for a few laps and pedaling for them. That communicates the intent effortlessly. Expert demonstration is often vastly more natural than reward specification.

Imitation learning is the family of methods that learn from expert demonstrations — recorded (state, action) pairs from someone (or some system) that already knows how to do the task. No reward function required.

The core promise: Instead of designing R(s,a) — which requires knowing what you want, formalized exactly — you collect a dataset D = {(s1,a1), (s2,a2), …} from an expert, and infer both the behavior and the objective from it. This works even when the reward is hard to articulate.

This chapter covers Chapter 18 of Kochenderfer et al. (pp. 355–373). There are six main approaches, organized by what they assume about expert access and what they try to learn:

Behavioral Cloning
Batch demos only. Treat it as supervised learning: π(a|s) ← expert demos. Simple, fast, fragile.
↓ needs more
DAgger / SMILe
Interactive expert. Run learned policy, query expert at visited states. Fixes distributional shift.
↓ needs even more
Inverse RL / GAIL
Recover the expert's hidden reward, then use it to train an optimal policy from scratch.

The methods form a spectrum. As you move down, you get more powerful solutions to deeper problems — but at the cost of more expert access, more computation, or stronger assumptions.

Why not just clone? The expert's demonstrations only cover the states the expert visits. As soon as a learner makes even a tiny error, it can drift into unfamiliar territory. Without expert coverage of those states, the policy has nothing to fall back on — and errors compound. This distributional mismatch is the central challenge the whole chapter is solving.
What is the core challenge that distinguishes imitation learning from standard supervised learning?

Chapter 1: Behavioral Cloning

You have a dataset D = {(s(1), a(1)), …, (s(m), a(m))} from an expert. The simplest thing imaginable: train a supervised classifier to predict the expert's action from the state.

For a discrete action space, the policy is a conditional distribution. The likelihood of the expert data under policy πθ is:

L(θ) = ∏i=1m πθ(a(i) | s(i))

Maximize this (equivalently, minimize cross-entropy loss) using gradient ascent. For a tabular policy with no function approximation, this reduces to counting: π(a|s) = N(s,a) / ∑a' N(s,a'). For a neural network policy, it's standard backprop.

The mountain car example (textbook p. 357): With 10 expert rollouts on mountain car, the policy learns correct (velocity, position) → action mappings in the region the expert visits. But the expert always starts near center with moderate velocity — in regions far from those, the policy assigns uniform random probability to actions. The agent looks fine on the training distribution but fails as soon as it wanders.

The resulting policy can be tested by rolling out the learned policy from the initial state. On step 1, it follows the expert closely. On step 2, it's slightly off. By step 20, it's in a state the expert never visited. The policy guesses. Guess wrong, drift further. Error cascades.

python
# Behavioral cloning: pure supervised learning on expert demos
import numpy as np

def behavioral_clone_tabular(demos, n_states, n_actions):
    """
    demos: list of (state, action) tuples from expert.
    Returns: policy pi[s][a] = probability of action a in state s.
    """
    counts = np.ones((n_states, n_actions))  # Laplace smoothing
    for s, a in demos:
        counts[s, a] += 1
    policy = counts / counts.sum(axis=1, keepdims=True)
    return policy  # shape (n_states, n_actions)

# Neural network variant: just replace with torch.nn.CrossEntropyLoss
# loss = F.cross_entropy(pi_theta(states), expert_actions)
# optimizer.zero_grad(); loss.backward(); optimizer.step()

def rollout_bc_policy(policy, env, max_steps=200):
    """Roll out learned policy, collecting trajectory."""
    s = env.reset()
    traj = []
    for _ in range(max_steps):
        a = np.argmax(policy[s])   # greedy w.r.t. cloned policy
        sp, r, done = env.step(a)
        traj.append((s, a, r))
        if done: break
        s = sp
    return traj
Behavioral Cloning — Data Coverage vs Error

The expert (teal curve) drives through a 1D track. The clone (orange) copies it — but only has data near the teal regions. Drag the slider to add more expert demonstrations and watch coverage improve. Then press "Rollout" to see how the clone performs from its own starting point.

Expert demos 2
States covered: 0%
In behavioral cloning, what determines whether the policy is reliable in a given state?

Chapter 2: DAgger — Dataset Aggregation

Behavioral cloning fails because the learned policy visits states the expert never showed us. The fix is surgical: go get data from those states. Run the learned policy, observe where it goes, ask the expert what they'd do there, add that to the dataset, retrain. Repeat.

This is DAgger (Ross et al., 2011) — Dataset Aggregation. At iteration N, the dataset DN contains expert labels for every state the learned policy has ever visited across all N iterations. A new policy is trained on this growing aggregate.

DN+1 = DN ∪ { (s, π*(s)) : s ∈ rollout(πN) }

Where π*(s) is what the expert would do from state s. The next policy πN+1 is trained on all of DN+1.

Why this fixes distributional shift: After enough iterations, D contains expert labels for every state the policy is likely to visit — because those states were visited, and labeled, in previous iterations. The policy's training distribution converges to match its test distribution (the states it actually visits).

DAgger's key theoretical result (Ross & Bagnell, 2011): the expected cost of the trained policy after N iterations is bounded by the expert's cost plus a term that shrinks as 1/N. Behavioral cloning's error bound grows quadratically with horizon T. DAgger's bound is linear in T. For long-horizon tasks, this is a qualitative improvement.

The price of DAgger: It requires an interactive expert — someone (or something) that can be queried on arbitrary states during training, not just during data collection. For human experts this can be expensive. For experts that are themselves learned models or simulators, it's cheap. Much of modern robot learning uses DAgger with a teleoperated reference policy.

The algorithm is conceptually simple but the implementation detail matters: you collect states visited by πN and label them with expert actions, not the policy's own actions. This is critical — you're asking "what should have been done here," not "what did we do."

python
def dagger(env, expert_policy, n_iters=10, n_rollout_steps=200):
    """
    DAgger: Dataset Aggregation.
    expert_policy: callable s -> a (the oracle we can query anywhere)
    Returns: final trained policy after n_iters rounds.
    """
    D = []  # aggregate dataset

    # Iteration 0: behavioral clone on expert demos
    D += collect_expert_rollout(expert_policy, env, n_rollout_steps)
    pi = behavioral_clone(D)

    for N in range(1, n_iters):
        # Step 1: roll out the CURRENT learned policy
        visited_states = []
        s = env.reset()
        for _ in range(n_rollout_steps):
            a = pi.act(s)              # learned policy chooses action
            visited_states.append(s)
            sp, _, done = env.step(a)
            if done: break
            s = sp

        # Step 2: query expert for EACH visited state
        new_labels = [(s, expert_policy(s)) for s in visited_states]

        # Step 3: aggregate and retrain
        D += new_labels            # D grows every iteration
        pi = behavioral_clone(D)  # retrain on full aggregate

    return pi
DAgger Showcase — Interactive Dataset Aggregation

A 1D agent (orange dot) tries to follow the expert's sinusoidal target trajectory (teal). The gray region shows states covered by training data. Press DAgger Step to run one iteration: the learned policy rolls out (red path), the expert labels those states (new gray coverage), and the policy improves. Watch how coverage expands and trajectories converge.

Iteration: 0  |  Coverage: 0%
Teal = expert trajectory    Red = current policy rollout    Gray bands = expert-labeled states
Convergence guarantee: After N DAgger iterations, the policy's expected cost satisfies: E[C(πN)] ≤ C(π*) + O(1/N). This holds even if the environment dynamics are stochastic. The policy's cost converges to the expert's cost as iterations increase.
What does DAgger add to the training dataset at each iteration?

Chapter 3: SMILe — Stochastic Mixing Iterative Learning

DAgger retrains from scratch on a growing dataset each iteration. SMILe (Ross & Bagnell, 2010) takes a different path: instead of aggregating data, it aggregates policies.

At each iteration k, SMILe trains a component policy πk on data collected from a mixture of the expert and all previously trained policies. The final policy is a mixture of all component policies, weighted so that older policies contribute more (they were trained on the most representative states).

Concretely, define the mixing parameter β ∈ (0,1). At iteration k, the data-collection policy is:

πmix(k) = (1−β)k · π* + [1 − (1−β)k] · πk−1

The weight on the expert, (1−β)k, decays toward zero. Early on, most data comes from the expert. Later, most comes from the learned policy. This is a gradual handoff.

SMILe vs DAgger — two paths to the same destination:

DAgger: One policy at a time, retrained on everything seen so far. Simple, practical, widely used.

SMILe: Mixture of policies, each trained on progressively less expert-guided data. More principled convergence analysis, but harder to implement. The final SMILe policy is a stochastic mixture: at each step, it flips a (biased) coin to decide which component policy to follow.

Both achieve O(T) error scaling vs BC's O(T2). DAgger is usually preferred in practice for its simplicity.
python
def smile(env, expert_policy, n_iters=10, beta=0.5, steps_per_iter=200):
    """
    SMILe: Stochastic Mixing Iterative Learning.
    beta: mixing decay parameter (0 < beta < 1).
    Returns: list of (weight, component_policy) pairs.
    """
    components = []     # (weight, policy) list

    for k in range(n_iters):
        expert_weight = (1 - beta) ** k   # decays to 0
        learned_weight = 1 - expert_weight

        # Build data-collection mixture policy
        def mix_policy(s):
            if np.random.random() < expert_weight:
                return expert_policy(s)
            elif components:
                # sample from previous mixture
                w, p = zip(*components)
                w = np.array(w) / sum(w)
                chosen = np.random.choice(len(components), p=w)
                return components[chosen][1](s)
            else:
                return expert_policy(s)  # fallback

        # Collect data under mix_policy
        data = rollout_with_expert_labels(mix_policy, expert_policy,
                                           env, steps_per_iter)
        # Train new component on this data
        new_component = behavioral_clone(data)
        components.append(((1-beta)**(n_iters-1-k), new_component))

    return components  # sample from this at test time
SMILe Policy Mixing — Expert Weight Decay

Adjust β to see how quickly the expert's influence fades. Lower β = expert stays relevant longer. Higher β = learned policy takes over faster. The right plot shows the per-iteration expert weight (1−β)k.

β (mixing decay) 0.50
Iterations 10
In SMILe, what happens to the expert's influence on data collection as iterations increase?

Chapter 4: Max-Margin IRL

DAgger and SMILe require an interactive expert. What if you only have a batch of recordings — and you want to understand why the expert does what it does, not just copy the behavior?

Inverse reinforcement learning (IRL) assumes the expert was optimizing some unknown reward function Rφ. Given expert demonstrations, recover φ. Then train an agent to optimize Rφ — it will generalize even to states the expert never visited.

The textbook presents the maximum margin formulation. Assume a linear reward:

Rφ(s, a) = φT β(s, a)

where β(s,a) ∈ {0,1}k is a binary feature vector indicating which features are "active" in this (state, action), and φ ∈ ℝk with ||φ||2 ≤ 1 are unknown reward weights. The discounted expected feature count under policy π is:

μπ = Eτ∼π[ ∑t=0 γt β(st, at) ]

The expert's empirical feature count μE is estimated from the demonstrations by averaging across trajectories. The maximum margin objective finds weights φ such that the expert's expected reward exceeds every other policy's reward by the largest possible margin:

maximizeφ minπ [ φTE − μπ) ]
Reading the objective: φTμE is the expert's expected reward under weights φ. φTμπ is any other policy's expected reward. We want the expert to be best by as large a margin as possible. The "min over π" finds the competitor policy that is hardest to beat. This is a max-min optimization — a classic game between reward weights and competitor policies.
Solving it iteratively: The algorithm alternates: (1) given current φ, find the worst competitor π by solving the MDP; (2) given π, update φ to increase the margin. This is a form of constraint generation, similar to boosting. The textbook (Algorithm 18.4) adds slack variables to allow small violations when perfect separation is impossible.

The maximum margin formulation reformulates as a quadratic program. Denoting the set of competing policies seen so far as {π1, …, πn}:

minimizeφ ||φ||22 + C ∑i ξi subject to: φTE − μπi) ≥ 1 − ξi, ξi ≥ 0

This is exactly a support vector machine formulation, with feature count differences playing the role of feature vectors and the expert-versus-policy gap playing the role of the class margin.

python
from scipy.optimize import minimize
import numpy as np

def max_margin_irl(mu_E, solver, n_features, C=1.0, n_iters=10):
    """
    Max-margin IRL via iterative constraint generation.
    mu_E: expert feature counts (shape: n_features,)
    solver: callable phi -> (policy, mu_pi) -- solves MDP and returns feature counts
    C: slack penalty
    """
    phi = np.zeros(n_features)
    policies = []   # accumulated competitor policies

    for _ in range(n_iters):
        # Step 1: find best competitor policy under current phi
        pi_i, mu_pi = solver(phi)
        policies.append(mu_pi)

        # Step 2: solve the QP for new phi
        # minimize 0.5*||phi||^2 + C*sum(xi) s.t. phi^T(mu_E - mu_pi) >= 1 - xi
        def objective(x):
            phi_ = x[:n_features]
            xi_  = x[n_features:]
            return 0.5 * np.dot(phi_, phi_) + C * xi_.sum()

        def constraints(x):
            phi_, xi_ = x[:n_features], x[n_features:]
            return [np.dot(phi_, mu_E - mu_p) + xi_[i] - 1
                    for i, mu_p in enumerate(policies)]

        x0 = np.zeros(n_features + len(policies))
        res = minimize(objective, x0, method='SLSQP',
                       constraints=[{'type':'ineq','fun':constraints}])
        phi = res.x[:n_features]
        phi /= max(1.0, np.linalg.norm(phi))  # project onto unit ball

    return phi  # use phi to construct final reward and retrain
What is the "maximum margin" in max-margin IRL?

Chapter 5: MaxEnt IRL

Max-margin IRL has an uncomfortable property: many reward functions can explain the same demonstrations equally well. If the expert always turns left at the intersection, does it prefer left turns? Avoid right turns? Minimize travel time? All three reward functions rank the expert's behavior first.

Maximum entropy IRL (Ziebart et al., 2008) resolves this ambiguity with an elegant principle: among all probability distributions over trajectories that match the observed feature expectations, choose the one with maximum entropy. This is the least committal choice — you're not inventing preferences the expert didn't reveal.

The maximum entropy trajectory distribution subject to feature matching constraints has a closed form:

P(τ ; φ) = 1Z(φ) exp( φT μ(τ) )

where μ(τ) = ∑t γt β(st, at) is the trajectory's feature count and Z(φ) = ∑τ exp(φTμ(τ)) is the partition function. Higher-reward trajectories are exponentially more likely. This is a Boltzmann distribution over trajectories.

Feature matching, precisely: The maximum entropy solution satisfies EP[μ(τ)] = μE. The expected feature counts under the learned distribution exactly match the expert's empirical feature counts. This is the constraint — MaxEnt picks the highest-entropy distribution that satisfies it. The reward weights φ are Lagrange multipliers for these constraints.

To learn φ, maximize the log-likelihood of expert trajectories under this model:

maximizeφτ∈D log P(τ;φ) = ∑τ∈DTμ(τ) − log Z(φ)]

The gradient is exactly the feature matching condition: ∇φL = μE − EP[μ(τ)]. To compute EP[μ(τ)], you need to know the partition function — which requires a forward pass through the MDP (value iteration or dynamic programming). This is the expensive part.

Soft value iteration: The MaxEnt forward pass uses "soft" Bellman equations where max is replaced by log-sum-exp. The soft Q-values are Qsoft(s,a) = r(s,a) + γ log ∑s' T(s'|s,a) exp(Vsoft(s')). This computes the partition function efficiently via DP — the same complexity as standard value iteration.
python
import numpy as np
from scipy.special import logsumexp

def soft_value_iteration(R, T, gamma=0.99, n_iters=50):
    """
    Compute soft values for MaxEnt IRL.
    R: (n_states, n_actions) reward array
    T: (n_states, n_actions, n_states) transition tensor
    Returns: V (n_states,), Q (n_states, n_actions)
    """
    n_states, n_actions = R.shape
    V = np.zeros(n_states)
    for _ in range(n_iters):
        # Q_soft(s,a) = R(s,a) + gamma * E_T[V(s')]
        Q = R + gamma * np.einsum('ijk,k->ij', T, V)
        # V_soft(s) = log sum_a exp(Q(s,a))  [soft max over actions]
        V = logsumexp(Q, axis=1)
    return V, Q

def maxent_irl(demos, T, beta_fn, gamma=0.99, lr=0.01, n_iters=100):
    """
    Maximum entropy IRL via gradient ascent on log-likelihood.
    demos: list of trajectories (each: list of (s,a) pairs)
    T: transition model
    beta_fn: feature function mapping (s,a) -> feature vector
    """
    n_states, n_actions = T.shape[:2]
    n_features = beta_fn(0, 0).shape[0]
    phi = np.zeros(n_features)

    # Expert feature expectations (empirical mean over demos)
    mu_E = np.zeros(n_features)
    for traj in demos:
        for t, (s, a) in enumerate(traj):
            mu_E += (gamma**t) * beta_fn(s, a)
    mu_E /= len(demos)

    for _ in range(n_iters):
        # Build reward from current phi
        R = np.array([[np.dot(phi, beta_fn(s, a))
                        for a in range(n_actions)]
                       for s in range(n_states)])
        # Soft value iteration -> expected feature counts
        V, Q = soft_value_iteration(R, T, gamma)
        pi_soft = np.exp(Q - V[:, None])  # Boltzmann policy
        # Compute expected feature counts via state visitation
        mu_pi = compute_feature_expectations(pi_soft, T, beta_fn, gamma)
        # Gradient = expert feature counts - model feature counts
        phi += lr * (mu_E - mu_pi)
    return phi
MaxEnt IRL — Reward Heatmap Recovery

Expert trajectories (teal lines) cross a 6×6 grid. Press Run MaxEnt IRL to infer the reward function. The heatmap shows recovered reward — bright = high reward. Expert trajectories should align with high-reward states.

Iteration: 0
What does the maximum entropy principle guarantee about the recovered reward distribution?

Chapter 6: GAIL — Generative Adversarial Imitation Learning

IRL is principled but expensive: every gradient step requires solving a full MDP with the current reward weights. Can we skip the reward-recovery step entirely and learn the policy directly from demonstrations?

GAIL (Ho & Ermon, 2016) says yes. It frames imitation as a game between two networks, borrowing the GAN architecture:

The minimax objective:

minθ maxφ Eπθ[log Dφ(s,a)] + Eπ*[log(1 − Dφ(s,a))]
Where does the reward come from? The discriminator output Dφ(s,a) implicitly defines a reward signal: r̃(s,a) = −log Dφ(s,a). When D assigns high probability to the learner's pair (D ≈ 1, fooled), the reward is small. When D correctly identifies it as a learner pair (D ≈ 0), the reward is large — penalizing the policy. The policy maximizes this implicit reward via standard on-policy RL (e.g., TRPO, PPO).

The training loop alternates:

Collect rollouts
Run πθ in the environment for N steps
Update discriminator
Maximize log D(π*) + log(1−D(πθ)) via gradient ascent
Update policy
Run TRPO/PPO with reward r̃(s,a) = −log Dφ(s,a)
↻ until D outputs 0.5 everywhere (indistinguishable)
Connection to IRL: GAIL approximates the occupancy measure matching objective from IRL without explicitly solving an MDP at each step. The discriminator implicitly computes the density ratio between expert and learner state-action distributions. When both match, the discriminator is 50/50, and the implicit reward vanishes — training is done. This is equivalent to finding the policy whose occupancy measure most closely matches the expert's.
GAIL on trajectories: The textbook notes that passing entire trajectories (not just single (s,a) pairs) through the discriminator lets it capture temporal features — like whether lane changes are smooth, not just whether a steering angle is plausible. In the driving domain this produces noticeably more natural-looking behavior.
python
# GAIL training loop (simplified, using PPO for policy update)
import torch
import torch.nn as nn

class Discriminator(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim + action_dim, 64),
            nn.Tanh(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, 1),
            nn.Sigmoid()   # output in [0,1]: 1 = expert, 0 = policy
        )

    def forward(self, s, a):
        x = torch.cat([s, a], dim=-1)
        return self.net(x)   # shape: (batch,)

def gail_step(disc, pi, expert_sa, policy_sa, disc_opt, pi_optimizer):
    # --- Discriminator update ---
    s_e, a_e = expert_sa               # expert (state, action)
    s_p, a_p = policy_sa              # learner (state, action)

    D_expert = disc(s_e, a_e)         # should be close to 1
    D_policy = disc(s_p, a_p)         # should be close to 0

    disc_loss = -(torch.log(D_expert + 1e-8) +
                  torch.log(1 - D_policy + 1e-8)).mean()

    disc_opt.zero_grad()
    disc_loss.backward()
    disc_opt.step()

    # --- Implicit reward for policy update ---
    with torch.no_grad():
        D_vals = disc(s_p, a_p)
        # High D: policy looks like expert -> low penalty
        # Low D:  discriminator caught us -> high penalty
        rewards = -torch.log(D_vals + 1e-8)

    # --- Policy update via PPO/TRPO with rewards ---
    pi_optimizer.step(s_p, a_p, rewards)  # pseudocode
In GAIL, what signal does the policy use to improve itself?

Chapter 7: BC vs IRL — When to Use What

Every method in this chapter trades off cost, generalization, and complexity. Making a good choice requires understanding what each method actually buys you.

The fundamental split:

BC and DAgger/SMILe learn a policy from demonstrations. They're asking: "What action should I take?" They generalize to new states only as well as the policy function class allows.

IRL and GAIL learn an objective from demonstrations. They're asking: "What was the expert optimizing?" They can generalize to scenarios with different dynamics — new obstacles, different starting positions — because the reward function is transferable in a way that a policy is not.
Method Expert Access Error Scaling Generalization Cost
Behavioral Cloning Batch only O(T2) cascading Only covered states Very low
DAgger Interactive oracle O(T) linear States policy visits Low–medium
SMILe Interactive oracle O(T) linear States mixture visits Medium
Max-Margin IRL Batch only Depends on RL Any state via reward High (QP + MDP)
MaxEnt IRL Batch only Depends on RL Any state via reward High (DP per step)
GAIL Batch + RL env Depends on RL Any state via reward High (RL + GAN)

The critical practical distinction is expert access. If you have a simulator or a teleoperation system where you can collect new expert labels on demand, DAgger is usually the right choice — simple, efficient, strong guarantees. If you only have a batch of recordings and need a transferable reward, MaxEnt IRL or GAIL are the tools.

Trajectory Quality vs Expert Queries

Compare how trajectory quality (lower cost = better) evolves with more expert data. BC saturates early. DAgger keeps improving. IRL methods require many queries but eventually produce a transferable reward. Adjust the horizon T to see how error scaling changes.

Horizon T 30
The "covariate shift" framing: Train/test mismatch in BC is called covariate shift — the distribution of states at test time (induced by the learned policy) differs from train time (induced by the expert). DAgger eliminates covariate shift by training directly on the test distribution. IRL sidesteps it by learning an objective that transfers, not a policy that overfits.
Which imitation learning approach is most likely to transfer well to a modified environment with different dynamics?

Chapter 8: Modern Methods — Diffusion Policy & ACT

The methods in Kochenderfer Ch. 18 establish the conceptual foundations. But the field has moved quickly. Two approaches now dominate practical robot imitation learning: Diffusion Policy and Action Chunking with Transformers (ACT).

Both are behavioral cloning at heart — they learn π(a|s) directly from expert demonstrations. What they improve is the expressiveness of the policy representation.

Diffusion Policy (Chi et al., 2023)

Standard BC parameterizes π(a|s) as a unimodal distribution (Gaussian, softmax). But expert behavior is often multimodal: when approaching a fork in the road, the expert turns left or right — never straight-ahead as a Gaussian mean would predict. Averaging modes produces a bad policy.

Diffusion Policy uses a denoising diffusion probabilistic model to represent the action distribution. Instead of outputting a single action, the policy runs a denoising chain: starting from Gaussian noise aT, iteratively denoise to produce action a0:

at−1 = αt at − σt εθ(at, s, t)

where εθ is a neural network that predicts the noise. The data flow is:

Input
Observation s (image + proprioception): e.g., (224,224,3) RGB + (7,) joint positions
↓ ViT / ResNet encoder
Visual embedding
Shape (B, D_obs). Condition for denoising network.
↓ K denoising steps (K = 100 training, 10–20 inference)
Action chunk
Shape (B, H, D_action): H future actions at once. H = 8–16 for manipulation.
Why action chunks? Predicting a single action at each step requires the policy to react to its own outputs 50–100 Hz, which amplifies noise. Predicting H future actions at once (an "action chunk") smooths trajectories and lets the policy commit to coherent motion plans rather than reacting to every sensory fluctuation.

Action Chunking with Transformers (ACT, Zhao et al., 2023)

ACT uses a conditional VAE architecture to handle the multimodality problem differently. During training, a Transformer encoder takes the full expert trajectory (joint positions + images) and encodes a style vector z that captures which "mode" of behavior the expert used. During inference, z is sampled from the prior N(0, I) — picking a behavior mode at random, like the expert did.

ACT data flow (training):
Input: images [st] (shape (B,4,3,H,W)), joint states (B,14), future actions (B,H,14).
Style encoder: Transformer on joint+action sequence → μz, σz (shape (B,32)). Sample z via reparameterization.
Policy decoder: DETR-style Transformer with image tokens + z as memory → H predicted actions (B,H,14).
Loss: L2 on predicted vs expert actions + KL(μz, σz || N(0,I)).

ACT data flow (inference):
z = 0 (zero-vector, takes the mode of the prior). Decoder: images + z → H actions. Execute first action, re-plan every H steps or on a fixed schedule.
Method Multimodal? Action repr. Inference cost Best for
BC (MLP) No Single action Very low Simple, unimodal tasks
Diffusion Policy Yes Action chunk High (K denoising) Dexterous manipulation
ACT Yes (via CVAE) Action chunk Medium (Transformer) Bimanual manipulation
DAgger + any Depends on policy Single or chunk Depends Sim-to-real transfer
The data hunger: Both Diffusion Policy and ACT require 50–200 expert demonstrations per task. For 25-DoF dexterous hands, collecting this data is itself a research challenge. The field is actively exploring few-shot imitation (10 demos), play data (uncurated exploration with recovery), and foundation models pre-trained on large robot datasets (RT-2, π0).
Why do Diffusion Policy and ACT use action chunks (predicting multiple future actions at once)?

Chapter 9: Connections & What's Next

Imitation learning sits at the intersection of supervised learning, reinforcement learning, and game theory. Every chapter in this book connects here.

From Chapter 17 (Model-Free RL):
GAIL's policy update uses TRPO or PPO from model-free RL. The discriminator reward replaces the environment reward. All the convergence theory from model-free RL applies, with the caveat that the reward function itself is non-stationary (it changes as the discriminator updates). This can cause instability — the same training difficulties as vanilla GANs.
From Chapter 9 (Online Planning):
IRL-recovered reward functions can be plugged into MCTS or other online planners for look-ahead planning. This is especially powerful when the reward is dense and smooth (MaxEnt IRL) rather than sparse.
From Chapter 14 (Policy Validation):
BC policies are easy to validate: just compare held-out state-action accuracy. DAgger policies are harder — their distribution shifts across iterations. IRL policies are hardest: you must validate the recovered reward and the downstream policy jointly. Reward model error compounds with RL optimization error.
From Chapter 20+ (POMDPs):
Most imitation learning assumes the expert observes the full state. When the expert only observes partial observations (a camera image), recovering a reward over the latent state requires POMDP-aware IRL. This is an active research area.

Active Research Frontiers

RLHF (Reinforcement Learning from Human Feedback) is an IRL variant where human preference labels (A is better than B) replace trajectory demonstrations. Used in every major LLM alignment method (InstructGPT, Claude, Gemini). The reward model is a Bradley-Terry preference model, and the policy is fine-tuned against it via PPO.

Play data (Lynch et al., 2020) replaces curated demonstrations with uncurated "play" — an operator freely manipulating objects with no specific goal. A goal-conditioned policy is then extracted. This dramatically reduces data collection cost.

Foundation models for imitation0, RT-2, RoboFlamingo) pre-train on internet-scale data then fine-tune with a handful of task-specific demonstrations. The pre-trained visual representations and language grounding dramatically reduce the number of demonstrations needed per task.

Method Landscape — Axes of Comparison

Where each method sits in (expert access required) × (generalization capability) space. Click a method to highlight its position.

Related micro-lessons:

"It is easier to show than to tell. The hardest part of imitation learning is realizing that the expert and the learner inhabit different worlds — and building a bridge between them."
— paraphrasing Stéphane Ross, DAgger paper, 2011
RLHF (used in LLM alignment) is most closely related to which imitation learning paradigm?
← Ch 17: Model-Free Ch 19: Beliefs →