Imitation Learning — Kochenderfer et al., Chapter 18

Chapter 0: Learning from Demonstrations

Imagine teaching a child to ride a bike by writing down a reward function. "You get +10 for moving forward, −5 for wobbling, −100 for falling..." You'd spend days arguing over the exact numbers, only to watch the child optimize your metric in ways you never intended — leaning against a wall to avoid falling while never actually moving.

Now imagine instead sitting on the back of their bike for a few laps and pedaling for them. That communicates the intent effortlessly. Expert demonstration is often vastly more natural than reward specification.

Imitation learning is the family of methods that learn from expert demonstrations — recorded (state, action) pairs from someone (or some system) that already knows how to do the task. No reward function required.

The core promise: Instead of designing R(s,a) — which requires knowing what you want, formalized exactly — you collect a dataset D = {(s₁,a₁), (s₂,a₂), …} from an expert, and infer both the behavior and the objective from it. This works even when the reward is hard to articulate.

This chapter covers Chapter 18 of Kochenderfer et al. (pp. 355–373). There are six main approaches, organized by what they assume about expert access and what they try to learn:

Behavioral Cloning

Batch demos only. Treat it as supervised learning: π(a|s) ← expert demos. Simple, fast, fragile.

↓ needs more

DAgger / SMILe

Interactive expert. Run learned policy, query expert at visited states. Fixes distributional shift.

↓ needs even more

Inverse RL / GAIL

Recover the expert's hidden reward, then use it to train an optimal policy from scratch.

The methods form a spectrum. As you move down, you get more powerful solutions to deeper problems — but at the cost of more expert access, more computation, or stronger assumptions.

Why not just clone? The expert's demonstrations only cover the states the expert visits. As soon as a learner makes even a tiny error, it can drift into unfamiliar territory. Without expert coverage of those states, the policy has nothing to fall back on — and errors compound. This distributional mismatch is the central challenge the whole chapter is solving.

What is the core challenge that distinguishes imitation learning from standard supervised learning?

The learner's policy can drift into states the expert never demonstrated, causing compounding errors Expert data always has labelling errors Neural networks cannot be trained on state-action pairs

Chapter 1: Behavioral Cloning

You have a dataset D = {(s⁽¹⁾, a⁽¹⁾), …, (s^(m), a^(m))} from an expert. The simplest thing imaginable: train a supervised classifier to predict the expert's action from the state.

For a discrete action space, the policy is a conditional distribution. The likelihood of the expert data under policy π_θ is:

L(θ) = ∏_i=1^m π_θ(a⁽ⁱ⁾ | s⁽ⁱ⁾)

Maximize this (equivalently, minimize cross-entropy loss) using gradient ascent. For a tabular policy with no function approximation, this reduces to counting: π(a|s) = N(s,a) / ∑_a' N(s,a'). For a neural network policy, it's standard backprop.

The mountain car example (textbook p. 357): With 10 expert rollouts on mountain car, the policy learns correct (velocity, position) → action mappings in the region the expert visits. But the expert always starts near center with moderate velocity — in regions far from those, the policy assigns uniform random probability to actions. The agent looks fine on the training distribution but fails as soon as it wanders.

The resulting policy can be tested by rolling out the learned policy from the initial state. On step 1, it follows the expert closely. On step 2, it's slightly off. By step 20, it's in a state the expert never visited. The policy guesses. Guess wrong, drift further. Error cascades.

python
# Behavioral cloning: pure supervised learning on expert demos
import numpy as np

def behavioral_clone_tabular(demos, n_states, n_actions):
    """
    demos: list of (state, action) tuples from expert.
    Returns: policy pi[s][a] = probability of action a in state s.
    """
    counts = np.ones((n_states, n_actions))  # Laplace smoothing
    for s, a in demos:
        counts[s, a] += 1
    policy = counts / counts.sum(axis=1, keepdims=True)
    return policy  # shape (n_states, n_actions)

# Neural network variant: just replace with torch.nn.CrossEntropyLoss
# loss = F.cross_entropy(pi_theta(states), expert_actions)
# optimizer.zero_grad(); loss.backward(); optimizer.step()

def rollout_bc_policy(policy, env, max_steps=200):
    """Roll out learned policy, collecting trajectory."""
    s = env.reset()
    traj = []
    for _ in range(max_steps):
        a = np.argmax(policy[s])   # greedy w.r.t. cloned policy
        sp, r, done = env.step(a)
        traj.append((s, a, r))
        if done: break
        s = sp
    return traj

Behavioral Cloning — Data Coverage vs Error

The expert (teal curve) drives through a 1D track. The clone (orange) copies it — but only has data near the teal regions. Drag the slider to add more expert demonstrations and watch coverage improve. Then press "Rollout" to see how the clone performs from its own starting point.

Expert demos 2

States covered: 0%

In behavioral cloning, what determines whether the policy is reliable in a given state?

Whether the expert's demonstrations included that state or nearby states The total number of gradient descent steps Whether the action space is continuous

Chapter 2: DAgger — Dataset Aggregation

Behavioral cloning fails because the learned policy visits states the expert never showed us. The fix is surgical: go get data from those states. Run the learned policy, observe where it goes, ask the expert what they'd do there, add that to the dataset, retrain. Repeat.

This is DAgger (Ross et al., 2011) — Dataset Aggregation. At iteration N, the dataset D_N contains expert labels for every state the learned policy has ever visited across all N iterations. A new policy is trained on this growing aggregate.

D_N+1 = D_N ∪ { (s, π^*(s)) : s ∈ rollout(π_N) }

Where π^*(s) is what the expert would do from state s. The next policy π_N+1 is trained on all of D_N+1.

Why this fixes distributional shift: After enough iterations, D contains expert labels for every state the policy is likely to visit — because those states were visited, and labeled, in previous iterations. The policy's training distribution converges to match its test distribution (the states it actually visits).

DAgger's key theoretical result (Ross & Bagnell, 2011): the expected cost of the trained policy after N iterations is bounded by the expert's cost plus a term that shrinks as 1/N. Behavioral cloning's error bound grows quadratically with horizon T. DAgger's bound is linear in T. For long-horizon tasks, this is a qualitative improvement.

The price of DAgger: It requires an interactive expert — someone (or something) that can be queried on arbitrary states during training, not just during data collection. For human experts this can be expensive. For experts that are themselves learned models or simulators, it's cheap. Much of modern robot learning uses DAgger with a teleoperated reference policy.

The algorithm is conceptually simple but the implementation detail matters: you collect states visited by π_N and label them with expert actions, not the policy's own actions. This is critical — you're asking "what should have been done here," not "what did we do."

python
def dagger(env, expert_policy, n_iters=10, n_rollout_steps=200):
    """
    DAgger: Dataset Aggregation.
    expert_policy: callable s -> a (the oracle we can query anywhere)
    Returns: final trained policy after n_iters rounds.
    """
    D = []  # aggregate dataset

    # Iteration 0: behavioral clone on expert demos
    D += collect_expert_rollout(expert_policy, env, n_rollout_steps)
    pi = behavioral_clone(D)

    for N in range(1, n_iters):
        # Step 1: roll out the CURRENT learned policy
        visited_states = []
        s = env.reset()
        for _ in range(n_rollout_steps):
            a = pi.act(s)              # learned policy chooses action
            visited_states.append(s)
            sp, _, done = env.step(a)
            if done: break
            s = sp

        # Step 2: query expert for EACH visited state
        new_labels = [(s, expert_policy(s)) for s in visited_states]

        # Step 3: aggregate and retrain
        D += new_labels            # D grows every iteration
        pi = behavioral_clone(D)  # retrain on full aggregate

    return pi

DAgger Showcase — Interactive Dataset Aggregation

A 1D agent (orange dot) tries to follow the expert's sinusoidal target trajectory (teal). The gray region shows states covered by training data. Press DAgger Step to run one iteration: the learned policy rolls out (red path), the expert labels those states (new gray coverage), and the policy improves. Watch how coverage expands and trajectories converge.

Iteration: 0 | Coverage: 0%

Teal = expert trajectory Red = current policy rollout Gray bands = expert-labeled states

Convergence guarantee: After N DAgger iterations, the policy's expected cost satisfies: E[C(π_N)] ≤ C(π^*) + O(1/N). This holds even if the environment dynamics are stochastic. The policy's cost converges to the expert's cost as iterations increase.

What does DAgger add to the training dataset at each iteration?

Expert labels for the states the current learned policy actually visits during rollout Random state-action pairs to improve exploration More demonstrations from the expert's own trajectory

Chapter 3: SMILe — Stochastic Mixing Iterative Learning

DAgger retrains from scratch on a growing dataset each iteration. SMILe (Ross & Bagnell, 2010) takes a different path: instead of aggregating data, it aggregates policies.

At each iteration k, SMILe trains a component policy π_k on data collected from a mixture of the expert and all previously trained policies. The final policy is a mixture of all component policies, weighted so that older policies contribute more (they were trained on the most representative states).

Concretely, define the mixing parameter β ∈ (0,1). At iteration k, the data-collection policy is:

π_mix^(k) = (1−β)^k · π^* + [1 − (1−β)^k] · π_k−1

The weight on the expert, (1−β)^k, decays toward zero. Early on, most data comes from the expert. Later, most comes from the learned policy. This is a gradual handoff.

SMILe vs DAgger — two paths to the same destination:

DAgger: One policy at a time, retrained on everything seen so far. Simple, practical, widely used.

SMILe: Mixture of policies, each trained on progressively less expert-guided data. More principled convergence analysis, but harder to implement. The final SMILe policy is a stochastic mixture: at each step, it flips a (biased) coin to decide which component policy to follow.

Both achieve O(T) error scaling vs BC's O(T²). DAgger is usually preferred in practice for its simplicity.

python
def smile(env, expert_policy, n_iters=10, beta=0.5, steps_per_iter=200):
    """
    SMILe: Stochastic Mixing Iterative Learning.
    beta: mixing decay parameter (0 < beta < 1).
    Returns: list of (weight, component_policy) pairs.
    """
    components = []     # (weight, policy) list

    for k in range(n_iters):
        expert_weight = (1 - beta) ** k   # decays to 0
        learned_weight = 1 - expert_weight

        # Build data-collection mixture policy
        def mix_policy(s):
            if np.random.random() < expert_weight:
                return expert_policy(s)
            elif components:
                # sample from previous mixture
                w, p = zip(*components)
                w = np.array(w) / sum(w)
                chosen = np.random.choice(len(components), p=w)
                return components[chosen][1](s)
            else:
                return expert_policy(s)  # fallback

        # Collect data under mix_policy
        data = rollout_with_expert_labels(mix_policy, expert_policy,
                                           env, steps_per_iter)
        # Train new component on this data
        new_component = behavioral_clone(data)
        components.append(((1-beta)**(n_iters-1-k), new_component))

    return components  # sample from this at test time

SMILe Policy Mixing — Expert Weight Decay

Adjust β to see how quickly the expert's influence fades. Lower β = expert stays relevant longer. Higher β = learned policy takes over faster. The right plot shows the per-iteration expert weight (1−β)^k.

β (mixing decay) 0.50

Iterations 10

In SMILe, what happens to the expert's influence on data collection as iterations increase?

It decays geometrically: the expert contributes (1−β)^k of the data at iteration k It stays constant at β throughout all iterations It increases linearly as the policy improves

Chapter 4: Max-Margin IRL

DAgger and SMILe require an interactive expert. What if you only have a batch of recordings — and you want to understand why the expert does what it does, not just copy the behavior?

Inverse reinforcement learning (IRL) assumes the expert was optimizing some unknown reward function R_φ. Given expert demonstrations, recover φ. Then train an agent to optimize R_φ — it will generalize even to states the expert never visited.

The textbook presents the maximum margin formulation. Assume a linear reward:

R_φ(s, a) = φ^T β(s, a)

where β(s,a) ∈ {0,1}^k is a binary feature vector indicating which features are "active" in this (state, action), and φ ∈ ℝ^k with ||φ||₂ ≤ 1 are unknown reward weights. The discounted expected feature count under policy π is:

μ_π = E_τ∼π[ ∑_t=0^∞ γ^t β(s_t, a_t) ]

The expert's empirical feature count μ_E is estimated from the demonstrations by averaging across trajectories. The maximum margin objective finds weights φ such that the expert's expected reward exceeds every other policy's reward by the largest possible margin:

maximize_φ min_π [ φ^T (μ_E − μ_π) ]

Reading the objective: φ^Tμ_E is the expert's expected reward under weights φ. φ^Tμ_π is any other policy's expected reward. We want the expert to be best by as large a margin as possible. The "min over π" finds the competitor policy that is hardest to beat. This is a max-min optimization — a classic game between reward weights and competitor policies.

Solving it iteratively: The algorithm alternates: (1) given current φ, find the worst competitor π by solving the MDP; (2) given π, update φ to increase the margin. This is a form of constraint generation, similar to boosting. The textbook (Algorithm 18.4) adds slack variables to allow small violations when perfect separation is impossible.

The maximum margin formulation reformulates as a quadratic program. Denoting the set of competing policies seen so far as {π₁, …, π_n}:

minimize_φ ||φ||₂² + C ∑_i ξ_i subject to: φ^T(μ_E − μ_{π_i}) ≥ 1 − ξ_i, ξ_i ≥ 0

This is exactly a support vector machine formulation, with feature count differences playing the role of feature vectors and the expert-versus-policy gap playing the role of the class margin.

python
from scipy.optimize import minimize
import numpy as np

def max_margin_irl(mu_E, solver, n_features, C=1.0, n_iters=10):
    """
    Max-margin IRL via iterative constraint generation.
    mu_E: expert feature counts (shape: n_features,)
    solver: callable phi -> (policy, mu_pi) -- solves MDP and returns feature counts
    C: slack penalty
    """
    phi = np.zeros(n_features)
    policies = []   # accumulated competitor policies

    for _ in range(n_iters):
        # Step 1: find best competitor policy under current phi
        pi_i, mu_pi = solver(phi)
        policies.append(mu_pi)

        # Step 2: solve the QP for new phi
        # minimize 0.5*||phi||^2 + C*sum(xi) s.t. phi^T(mu_E - mu_pi) >= 1 - xi
        def objective(x):
            phi_ = x[:n_features]
            xi_  = x[n_features:]
            return 0.5 * np.dot(phi_, phi_) + C * xi_.sum()

        def constraints(x):
            phi_, xi_ = x[:n_features], x[n_features:]
            return [np.dot(phi_, mu_E - mu_p) + xi_[i] - 1
                    for i, mu_p in enumerate(policies)]

        x0 = np.zeros(n_features + len(policies))
        res = minimize(objective, x0, method='SLSQP',
                       constraints=[{'type':'ineq','fun':constraints}])
        phi = res.x[:n_features]
        phi /= max(1.0, np.linalg.norm(phi))  # project onto unit ball

    return phi  # use phi to construct final reward and retrain

What is the "maximum margin" in max-margin IRL?

The largest reward gap between the expert policy and any competing policy under the learned reward weights The L2 norm of the reward weight vector φ The maximum number of expert demonstrations allowed

Chapter 5: MaxEnt IRL

Max-margin IRL has an uncomfortable property: many reward functions can explain the same demonstrations equally well. If the expert always turns left at the intersection, does it prefer left turns? Avoid right turns? Minimize travel time? All three reward functions rank the expert's behavior first.

Maximum entropy IRL (Ziebart et al., 2008) resolves this ambiguity with an elegant principle: among all probability distributions over trajectories that match the observed feature expectations, choose the one with maximum entropy. This is the least committal choice — you're not inventing preferences the expert didn't reveal.

The maximum entropy trajectory distribution subject to feature matching constraints has a closed form:

P(τ ; φ) = ¹⁄_Z(φ) exp( φ^T μ(τ) )

where μ(τ) = ∑_t γ^t β(s_t, a_t) is the trajectory's feature count and Z(φ) = ∑_τ exp(φ^Tμ(τ)) is the partition function. Higher-reward trajectories are exponentially more likely. This is a Boltzmann distribution over trajectories.

Feature matching, precisely: The maximum entropy solution satisfies E_P[μ(τ)] = μ_E. The expected feature counts under the learned distribution exactly match the expert's empirical feature counts. This is the constraint — MaxEnt picks the highest-entropy distribution that satisfies it. The reward weights φ are Lagrange multipliers for these constraints.

To learn φ, maximize the log-likelihood of expert trajectories under this model:

maximize_φ ∑_τ∈D log P(τ;φ) = ∑_τ∈D [φ^Tμ(τ) − log Z(φ)]

The gradient is exactly the feature matching condition: ∇_φL = μ_E − E_P[μ(τ)]. To compute E_P[μ(τ)], you need to know the partition function — which requires a forward pass through the MDP (value iteration or dynamic programming). This is the expensive part.

Soft value iteration: The MaxEnt forward pass uses "soft" Bellman equations where max is replaced by log-sum-exp. The soft Q-values are Q_soft(s,a) = r(s,a) + γ log ∑_s' T(s'|s,a) exp(V_soft(s')). This computes the partition function efficiently via DP — the same complexity as standard value iteration.

python
import numpy as np
from scipy.special import logsumexp

def soft_value_iteration(R, T, gamma=0.99, n_iters=50):
    """
    Compute soft values for MaxEnt IRL.
    R: (n_states, n_actions) reward array
    T: (n_states, n_actions, n_states) transition tensor
    Returns: V (n_states,), Q (n_states, n_actions)
    """
    n_states, n_actions = R.shape
    V = np.zeros(n_states)
    for _ in range(n_iters):
        # Q_soft(s,a) = R(s,a) + gamma * E_T[V(s')]
        Q = R + gamma * np.einsum('ijk,k->ij', T, V)
        # V_soft(s) = log sum_a exp(Q(s,a))  [soft max over actions]
        V = logsumexp(Q, axis=1)
    return V, Q

def maxent_irl(demos, T, beta_fn, gamma=0.99, lr=0.01, n_iters=100):
    """
    Maximum entropy IRL via gradient ascent on log-likelihood.
    demos: list of trajectories (each: list of (s,a) pairs)
    T: transition model
    beta_fn: feature function mapping (s,a) -> feature vector
    """
    n_states, n_actions = T.shape[:2]
    n_features = beta_fn(0, 0).shape[0]
    phi = np.zeros(n_features)

    # Expert feature expectations (empirical mean over demos)
    mu_E = np.zeros(n_features)
    for traj in demos:
        for t, (s, a) in enumerate(traj):
            mu_E += (gamma**t) * beta_fn(s, a)
    mu_E /= len(demos)

    for _ in range(n_iters):
        # Build reward from current phi
        R = np.array([[np.dot(phi, beta_fn(s, a))
                        for a in range(n_actions)]
                       for s in range(n_states)])
        # Soft value iteration -> expected feature counts
        V, Q = soft_value_iteration(R, T, gamma)
        pi_soft = np.exp(Q - V[:, None])  # Boltzmann policy
        # Compute expected feature counts via state visitation
        mu_pi = compute_feature_expectations(pi_soft, T, beta_fn, gamma)
        # Gradient = expert feature counts - model feature counts
        phi += lr * (mu_E - mu_pi)
    return phi

MaxEnt IRL — Reward Heatmap Recovery

Expert trajectories (teal lines) cross a 6×6 grid. Press Run MaxEnt IRL to infer the reward function. The heatmap shows recovered reward — bright = high reward. Expert trajectories should align with high-reward states.

Iteration: 0

What does the maximum entropy principle guarantee about the recovered reward distribution?

It is the least assumptive distribution that matches the expert's expected feature counts It is the distribution that assigns zero probability to the expert's trajectories It maximizes the reward gap between the expert and all other policies

Chapter 6: GAIL — Generative Adversarial Imitation Learning

IRL is principled but expensive: every gradient step requires solving a full MDP with the current reward weights. Can we skip the reward-recovery step entirely and learn the policy directly from demonstrations?

GAIL (Ho & Ermon, 2016) says yes. It frames imitation as a game between two networks, borrowing the GAN architecture:

Discriminator D_φ(s,a): classifies whether (s,a) came from the expert or the learner. Output is a probability ∈ [0,1]. Expert pairs → 1, learner pairs → 0.
Policy π_θ: the "generator." Tries to produce (s,a) pairs that fool the discriminator into outputting 1.

The minimax objective:

min_θ max_φ E_πθ[log D_φ(s,a)] + E_π*[log(1 − D_φ(s,a))]

Where does the reward come from? The discriminator output D_φ(s,a) implicitly defines a reward signal: r̃(s,a) = −log D_φ(s,a). When D assigns high probability to the learner's pair (D ≈ 1, fooled), the reward is small. When D correctly identifies it as a learner pair (D ≈ 0), the reward is large — penalizing the policy. The policy maximizes this implicit reward via standard on-policy RL (e.g., TRPO, PPO).

The training loop alternates:

Collect rollouts

Run π_θ in the environment for N steps

↓

Update discriminator

Maximize log D(π*) + log(1−D(π_θ)) via gradient ascent

↓

Update policy

Run TRPO/PPO with reward r̃(s,a) = −log D_φ(s,a)

↻ until D outputs 0.5 everywhere (indistinguishable)

Connection to IRL: GAIL approximates the occupancy measure matching objective from IRL without explicitly solving an MDP at each step. The discriminator implicitly computes the density ratio between expert and learner state-action distributions. When both match, the discriminator is 50/50, and the implicit reward vanishes — training is done. This is equivalent to finding the policy whose occupancy measure most closely matches the expert's.

GAIL on trajectories: The textbook notes that passing entire trajectories (not just single (s,a) pairs) through the discriminator lets it capture temporal features — like whether lane changes are smooth, not just whether a steering angle is plausible. In the driving domain this produces noticeably more natural-looking behavior.

python
# GAIL training loop (simplified, using PPO for policy update)
import torch
import torch.nn as nn

class Discriminator(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim + action_dim, 64),
            nn.Tanh(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, 1),
            nn.Sigmoid()   # output in [0,1]: 1 = expert, 0 = policy
        )

    def forward(self, s, a):
        x = torch.cat([s, a], dim=-1)
        return self.net(x)   # shape: (batch,)

def gail_step(disc, pi, expert_sa, policy_sa, disc_opt, pi_optimizer):
    # --- Discriminator update ---
    s_e, a_e = expert_sa               # expert (state, action)
    s_p, a_p = policy_sa              # learner (state, action)

    D_expert = disc(s_e, a_e)         # should be close to 1
    D_policy = disc(s_p, a_p)         # should be close to 0

    disc_loss = -(torch.log(D_expert + 1e-8) +
                  torch.log(1 - D_policy + 1e-8)).mean()

    disc_opt.zero_grad()
    disc_loss.backward()
    disc_opt.step()

    # --- Implicit reward for policy update ---
    with torch.no_grad():
        D_vals = disc(s_p, a_p)
        # High D: policy looks like expert -> low penalty
        # Low D:  discriminator caught us -> high penalty
        rewards = -torch.log(D_vals + 1e-8)

    # --- Policy update via PPO/TRPO with rewards ---
    pi_optimizer.step(s_p, a_p, rewards)  # pseudocode

In GAIL, what signal does the policy use to improve itself?

The implicit reward −log D(s,a), which penalizes the policy when the discriminator distinguishes it from the expert The ground-truth environment reward function The direct log-likelihood of the expert's actions

Chapter 7: BC vs IRL — When to Use What

Every method in this chapter trades off cost, generalization, and complexity. Making a good choice requires understanding what each method actually buys you.

The fundamental split:

BC and DAgger/SMILe learn a policy from demonstrations. They're asking: "What action should I take?" They generalize to new states only as well as the policy function class allows.

IRL and GAIL learn an objective from demonstrations. They're asking: "What was the expert optimizing?" They can generalize to scenarios with different dynamics — new obstacles, different starting positions — because the reward function is transferable in a way that a policy is not.

Method	Expert Access	Error Scaling	Generalization	Cost
Behavioral Cloning	Batch only	O(T²) cascading	Only covered states	Very low
DAgger	Interactive oracle	O(T) linear	States policy visits	Low–medium
SMILe	Interactive oracle	O(T) linear	States mixture visits	Medium
Max-Margin IRL	Batch only	Depends on RL	Any state via reward	High (QP + MDP)
MaxEnt IRL	Batch only	Depends on RL	Any state via reward	High (DP per step)
GAIL	Batch + RL env	Depends on RL	Any state via reward	High (RL + GAN)

The critical practical distinction is expert access. If you have a simulator or a teleoperation system where you can collect new expert labels on demand, DAgger is usually the right choice — simple, efficient, strong guarantees. If you only have a batch of recordings and need a transferable reward, MaxEnt IRL or GAIL are the tools.

Trajectory Quality vs Expert Queries

Compare how trajectory quality (lower cost = better) evolves with more expert data. BC saturates early. DAgger keeps improving. IRL methods require many queries but eventually produce a transferable reward. Adjust the horizon T to see how error scaling changes.

Horizon T 30

The "covariate shift" framing: Train/test mismatch in BC is called covariate shift — the distribution of states at test time (induced by the learned policy) differs from train time (induced by the expert). DAgger eliminates covariate shift by training directly on the test distribution. IRL sidesteps it by learning an objective that transfers, not a policy that overfits.

Which imitation learning approach is most likely to transfer well to a modified environment with different dynamics?

IRL / GAIL, because they learn a transferable reward function rather than a policy tied to specific states Behavioral cloning, because it copies the expert's exact actions DAgger, because it queries the expert on all visited states

Chapter 8: Modern Methods — Diffusion Policy & ACT

The methods in Kochenderfer Ch. 18 establish the conceptual foundations. But the field has moved quickly. Two approaches now dominate practical robot imitation learning: Diffusion Policy and Action Chunking with Transformers (ACT).

Both are behavioral cloning at heart — they learn π(a|s) directly from expert demonstrations. What they improve is the expressiveness of the policy representation.

Diffusion Policy (Chi et al., 2023)

Standard BC parameterizes π(a|s) as a unimodal distribution (Gaussian, softmax). But expert behavior is often multimodal: when approaching a fork in the road, the expert turns left or right — never straight-ahead as a Gaussian mean would predict. Averaging modes produces a bad policy.

Diffusion Policy uses a denoising diffusion probabilistic model to represent the action distribution. Instead of outputting a single action, the policy runs a denoising chain: starting from Gaussian noise a_T, iteratively denoise to produce action a₀:

a_t−1 = α_t a_t − σ_t ε_θ(a_t, s, t)

where ε_θ is a neural network that predicts the noise. The data flow is:

Input

Observation s (image + proprioception): e.g., (224,224,3) RGB + (7,) joint positions

↓ ViT / ResNet encoder

Visual embedding

Shape (B, D_obs). Condition for denoising network.

↓ K denoising steps (K = 100 training, 10–20 inference)

Action chunk

Shape (B, H, D_action): H future actions at once. H = 8–16 for manipulation.

Why action chunks? Predicting a single action at each step requires the policy to react to its own outputs 50–100 Hz, which amplifies noise. Predicting H future actions at once (an "action chunk") smooths trajectories and lets the policy commit to coherent motion plans rather than reacting to every sensory fluctuation.

Action Chunking with Transformers (ACT, Zhao et al., 2023)

ACT uses a conditional VAE architecture to handle the multimodality problem differently. During training, a Transformer encoder takes the full expert trajectory (joint positions + images) and encodes a style vector z that captures which "mode" of behavior the expert used. During inference, z is sampled from the prior N(0, I) — picking a behavior mode at random, like the expert did.

ACT data flow (training):
Input: images [s_t] (shape (B,4,3,H,W)), joint states (B,14), future actions (B,H,14).
Style encoder: Transformer on joint+action sequence → μ_z, σ_z (shape (B,32)). Sample z via reparameterization.
Policy decoder: DETR-style Transformer with image tokens + z as memory → H predicted actions (B,H,14).
Loss: L2 on predicted vs expert actions + KL(μ_z, σ_z || N(0,I)).

ACT data flow (inference):
z = 0 (zero-vector, takes the mode of the prior). Decoder: images + z → H actions. Execute first action, re-plan every H steps or on a fixed schedule.

Method	Multimodal?	Action repr.	Inference cost	Best for
BC (MLP)	No	Single action	Very low	Simple, unimodal tasks
Diffusion Policy	Yes	Action chunk	High (K denoising)	Dexterous manipulation
ACT	Yes (via CVAE)	Action chunk	Medium (Transformer)	Bimanual manipulation
DAgger + any	Depends on policy	Single or chunk	Depends	Sim-to-real transfer

The data hunger: Both Diffusion Policy and ACT require 50–200 expert demonstrations per task. For 25-DoF dexterous hands, collecting this data is itself a research challenge. The field is actively exploring few-shot imitation (10 demos), play data (uncurated exploration with recovery), and foundation models pre-trained on large robot datasets (RT-2, π₀).

Why do Diffusion Policy and ACT use action chunks (predicting multiple future actions at once)?

Chunking smooths trajectories and lets the policy commit to coherent motion plans, reducing noise amplification from step-by-step reactive control The robot hardware requires all future commands in advance It reduces the total amount of expert demonstration data needed

Chapter 9: Connections & What's Next

Imitation learning sits at the intersection of supervised learning, reinforcement learning, and game theory. Every chapter in this book connects here.

From Chapter 17 (Model-Free RL):
GAIL's policy update uses TRPO or PPO from model-free RL. The discriminator reward replaces the environment reward. All the convergence theory from model-free RL applies, with the caveat that the reward function itself is non-stationary (it changes as the discriminator updates). This can cause instability — the same training difficulties as vanilla GANs.

From Chapter 9 (Online Planning):
IRL-recovered reward functions can be plugged into MCTS or other online planners for look-ahead planning. This is especially powerful when the reward is dense and smooth (MaxEnt IRL) rather than sparse.

From Chapter 14 (Policy Validation):
BC policies are easy to validate: just compare held-out state-action accuracy. DAgger policies are harder — their distribution shifts across iterations. IRL policies are hardest: you must validate the recovered reward and the downstream policy jointly. Reward model error compounds with RL optimization error.

From Chapter 20+ (POMDPs):
Most imitation learning assumes the expert observes the full state. When the expert only observes partial observations (a camera image), recovering a reward over the latent state requires POMDP-aware IRL. This is an active research area.

Active Research Frontiers

RLHF (Reinforcement Learning from Human Feedback) is an IRL variant where human preference labels (A is better than B) replace trajectory demonstrations. Used in every major LLM alignment method (InstructGPT, Claude, Gemini). The reward model is a Bradley-Terry preference model, and the policy is fine-tuned against it via PPO.

Play data (Lynch et al., 2020) replaces curated demonstrations with uncurated "play" — an operator freely manipulating objects with no specific goal. A goal-conditioned policy is then extracted. This dramatically reduces data collection cost.

Foundation models for imitation (π₀, RT-2, RoboFlamingo) pre-train on internet-scale data then fine-tune with a handful of task-specific demonstrations. The pre-trained visual representations and language grounding dramatically reduce the number of demonstrations needed per task.

Method Landscape — Axes of Comparison

Where each method sits in (expert access required) × (generalization capability) space. Click a method to highlight its position.

Related micro-lessons:

Ch 17: Model-Free Methods — Q-learning and Sarsa that power the RL component of GAIL
Ch 12: Policy Gradient Optimization — TRPO/PPO used inside GAIL's policy update
Ch 14: Policy Validation — How to evaluate imitation learning policies
Ch 19: Beliefs (POMDPs) → — What happens when the state isn't fully observable

"It is easier to show than to tell. The hardest part of imitation learning is realizing that the expert and the learner inhabit different worlds — and building a bridge between them."
— paraphrasing Stéphane Ross, DAgger paper, 2011

RLHF (used in LLM alignment) is most closely related to which imitation learning paradigm?

Inverse RL: a reward model is learned from human feedback (preferences), and a policy is optimized against it Behavioral cloning: the LLM is trained to clone human-written text DAgger: human feedback labels states visited by the LLM during generation

← Ch 17: Model-Free Ch 19: Beliefs →