Teaching agents from expert demonstrations: from supervised copying, through iterative expert queries, to inferring hidden reward functions and adversarial imitation.
Imagine teaching a child to ride a bike by writing down a reward function. "You get +10 for moving forward, −5 for wobbling, −100 for falling..." You'd spend days arguing over the exact numbers, only to watch the child optimize your metric in ways you never intended — leaning against a wall to avoid falling while never actually moving.
Now imagine instead sitting on the back of their bike for a few laps and pedaling for them. That communicates the intent effortlessly. Expert demonstration is often vastly more natural than reward specification.
Imitation learning is the family of methods that learn from expert demonstrations — recorded (state, action) pairs from someone (or some system) that already knows how to do the task. No reward function required.
This chapter covers Chapter 18 of Kochenderfer et al. (pp. 355–373). There are six main approaches, organized by what they assume about expert access and what they try to learn:
The methods form a spectrum. As you move down, you get more powerful solutions to deeper problems — but at the cost of more expert access, more computation, or stronger assumptions.
You have a dataset D = {(s(1), a(1)), …, (s(m), a(m))} from an expert. The simplest thing imaginable: train a supervised classifier to predict the expert's action from the state.
For a discrete action space, the policy is a conditional distribution. The likelihood of the expert data under policy πθ is:
Maximize this (equivalently, minimize cross-entropy loss) using gradient ascent. For a tabular policy with no function approximation, this reduces to counting: π(a|s) = N(s,a) / ∑a' N(s,a'). For a neural network policy, it's standard backprop.
The resulting policy can be tested by rolling out the learned policy from the initial state. On step 1, it follows the expert closely. On step 2, it's slightly off. By step 20, it's in a state the expert never visited. The policy guesses. Guess wrong, drift further. Error cascades.
python # Behavioral cloning: pure supervised learning on expert demos import numpy as np def behavioral_clone_tabular(demos, n_states, n_actions): """ demos: list of (state, action) tuples from expert. Returns: policy pi[s][a] = probability of action a in state s. """ counts = np.ones((n_states, n_actions)) # Laplace smoothing for s, a in demos: counts[s, a] += 1 policy = counts / counts.sum(axis=1, keepdims=True) return policy # shape (n_states, n_actions) # Neural network variant: just replace with torch.nn.CrossEntropyLoss # loss = F.cross_entropy(pi_theta(states), expert_actions) # optimizer.zero_grad(); loss.backward(); optimizer.step() def rollout_bc_policy(policy, env, max_steps=200): """Roll out learned policy, collecting trajectory.""" s = env.reset() traj = [] for _ in range(max_steps): a = np.argmax(policy[s]) # greedy w.r.t. cloned policy sp, r, done = env.step(a) traj.append((s, a, r)) if done: break s = sp return traj
The expert (teal curve) drives through a 1D track. The clone (orange) copies it — but only has data near the teal regions. Drag the slider to add more expert demonstrations and watch coverage improve. Then press "Rollout" to see how the clone performs from its own starting point.
Behavioral cloning fails because the learned policy visits states the expert never showed us. The fix is surgical: go get data from those states. Run the learned policy, observe where it goes, ask the expert what they'd do there, add that to the dataset, retrain. Repeat.
This is DAgger (Ross et al., 2011) — Dataset Aggregation. At iteration N, the dataset DN contains expert labels for every state the learned policy has ever visited across all N iterations. A new policy is trained on this growing aggregate.
Where π*(s) is what the expert would do from state s. The next policy πN+1 is trained on all of DN+1.
DAgger's key theoretical result (Ross & Bagnell, 2011): the expected cost of the trained policy after N iterations is bounded by the expert's cost plus a term that shrinks as 1/N. Behavioral cloning's error bound grows quadratically with horizon T. DAgger's bound is linear in T. For long-horizon tasks, this is a qualitative improvement.
The algorithm is conceptually simple but the implementation detail matters: you collect states visited by πN and label them with expert actions, not the policy's own actions. This is critical — you're asking "what should have been done here," not "what did we do."
python def dagger(env, expert_policy, n_iters=10, n_rollout_steps=200): """ DAgger: Dataset Aggregation. expert_policy: callable s -> a (the oracle we can query anywhere) Returns: final trained policy after n_iters rounds. """ D = [] # aggregate dataset # Iteration 0: behavioral clone on expert demos D += collect_expert_rollout(expert_policy, env, n_rollout_steps) pi = behavioral_clone(D) for N in range(1, n_iters): # Step 1: roll out the CURRENT learned policy visited_states = [] s = env.reset() for _ in range(n_rollout_steps): a = pi.act(s) # learned policy chooses action visited_states.append(s) sp, _, done = env.step(a) if done: break s = sp # Step 2: query expert for EACH visited state new_labels = [(s, expert_policy(s)) for s in visited_states] # Step 3: aggregate and retrain D += new_labels # D grows every iteration pi = behavioral_clone(D) # retrain on full aggregate return pi
A 1D agent (orange dot) tries to follow the expert's sinusoidal target trajectory (teal). The gray region shows states covered by training data. Press DAgger Step to run one iteration: the learned policy rolls out (red path), the expert labels those states (new gray coverage), and the policy improves. Watch how coverage expands and trajectories converge.
DAgger retrains from scratch on a growing dataset each iteration. SMILe (Ross & Bagnell, 2010) takes a different path: instead of aggregating data, it aggregates policies.
At each iteration k, SMILe trains a component policy πk on data collected from a mixture of the expert and all previously trained policies. The final policy is a mixture of all component policies, weighted so that older policies contribute more (they were trained on the most representative states).
Concretely, define the mixing parameter β ∈ (0,1). At iteration k, the data-collection policy is:
The weight on the expert, (1−β)k, decays toward zero. Early on, most data comes from the expert. Later, most comes from the learned policy. This is a gradual handoff.
python def smile(env, expert_policy, n_iters=10, beta=0.5, steps_per_iter=200): """ SMILe: Stochastic Mixing Iterative Learning. beta: mixing decay parameter (0 < beta < 1). Returns: list of (weight, component_policy) pairs. """ components = [] # (weight, policy) list for k in range(n_iters): expert_weight = (1 - beta) ** k # decays to 0 learned_weight = 1 - expert_weight # Build data-collection mixture policy def mix_policy(s): if np.random.random() < expert_weight: return expert_policy(s) elif components: # sample from previous mixture w, p = zip(*components) w = np.array(w) / sum(w) chosen = np.random.choice(len(components), p=w) return components[chosen][1](s) else: return expert_policy(s) # fallback # Collect data under mix_policy data = rollout_with_expert_labels(mix_policy, expert_policy, env, steps_per_iter) # Train new component on this data new_component = behavioral_clone(data) components.append(((1-beta)**(n_iters-1-k), new_component)) return components # sample from this at test time
Adjust β to see how quickly the expert's influence fades. Lower β = expert stays relevant longer. Higher β = learned policy takes over faster. The right plot shows the per-iteration expert weight (1−β)k.
DAgger and SMILe require an interactive expert. What if you only have a batch of recordings — and you want to understand why the expert does what it does, not just copy the behavior?
Inverse reinforcement learning (IRL) assumes the expert was optimizing some unknown reward function Rφ. Given expert demonstrations, recover φ. Then train an agent to optimize Rφ — it will generalize even to states the expert never visited.
The textbook presents the maximum margin formulation. Assume a linear reward:
where β(s,a) ∈ {0,1}k is a binary feature vector indicating which features are "active" in this (state, action), and φ ∈ ℝk with ||φ||2 ≤ 1 are unknown reward weights. The discounted expected feature count under policy π is:
The expert's empirical feature count μE is estimated from the demonstrations by averaging across trajectories. The maximum margin objective finds weights φ such that the expert's expected reward exceeds every other policy's reward by the largest possible margin:
The maximum margin formulation reformulates as a quadratic program. Denoting the set of competing policies seen so far as {π1, …, πn}:
This is exactly a support vector machine formulation, with feature count differences playing the role of feature vectors and the expert-versus-policy gap playing the role of the class margin.
python from scipy.optimize import minimize import numpy as np def max_margin_irl(mu_E, solver, n_features, C=1.0, n_iters=10): """ Max-margin IRL via iterative constraint generation. mu_E: expert feature counts (shape: n_features,) solver: callable phi -> (policy, mu_pi) -- solves MDP and returns feature counts C: slack penalty """ phi = np.zeros(n_features) policies = [] # accumulated competitor policies for _ in range(n_iters): # Step 1: find best competitor policy under current phi pi_i, mu_pi = solver(phi) policies.append(mu_pi) # Step 2: solve the QP for new phi # minimize 0.5*||phi||^2 + C*sum(xi) s.t. phi^T(mu_E - mu_pi) >= 1 - xi def objective(x): phi_ = x[:n_features] xi_ = x[n_features:] return 0.5 * np.dot(phi_, phi_) + C * xi_.sum() def constraints(x): phi_, xi_ = x[:n_features], x[n_features:] return [np.dot(phi_, mu_E - mu_p) + xi_[i] - 1 for i, mu_p in enumerate(policies)] x0 = np.zeros(n_features + len(policies)) res = minimize(objective, x0, method='SLSQP', constraints=[{'type':'ineq','fun':constraints}]) phi = res.x[:n_features] phi /= max(1.0, np.linalg.norm(phi)) # project onto unit ball return phi # use phi to construct final reward and retrain
Max-margin IRL has an uncomfortable property: many reward functions can explain the same demonstrations equally well. If the expert always turns left at the intersection, does it prefer left turns? Avoid right turns? Minimize travel time? All three reward functions rank the expert's behavior first.
Maximum entropy IRL (Ziebart et al., 2008) resolves this ambiguity with an elegant principle: among all probability distributions over trajectories that match the observed feature expectations, choose the one with maximum entropy. This is the least committal choice — you're not inventing preferences the expert didn't reveal.
The maximum entropy trajectory distribution subject to feature matching constraints has a closed form:
where μ(τ) = ∑t γt β(st, at) is the trajectory's feature count and Z(φ) = ∑τ exp(φTμ(τ)) is the partition function. Higher-reward trajectories are exponentially more likely. This is a Boltzmann distribution over trajectories.
To learn φ, maximize the log-likelihood of expert trajectories under this model:
The gradient is exactly the feature matching condition: ∇φL = μE − EP[μ(τ)]. To compute EP[μ(τ)], you need to know the partition function — which requires a forward pass through the MDP (value iteration or dynamic programming). This is the expensive part.
python import numpy as np from scipy.special import logsumexp def soft_value_iteration(R, T, gamma=0.99, n_iters=50): """ Compute soft values for MaxEnt IRL. R: (n_states, n_actions) reward array T: (n_states, n_actions, n_states) transition tensor Returns: V (n_states,), Q (n_states, n_actions) """ n_states, n_actions = R.shape V = np.zeros(n_states) for _ in range(n_iters): # Q_soft(s,a) = R(s,a) + gamma * E_T[V(s')] Q = R + gamma * np.einsum('ijk,k->ij', T, V) # V_soft(s) = log sum_a exp(Q(s,a)) [soft max over actions] V = logsumexp(Q, axis=1) return V, Q def maxent_irl(demos, T, beta_fn, gamma=0.99, lr=0.01, n_iters=100): """ Maximum entropy IRL via gradient ascent on log-likelihood. demos: list of trajectories (each: list of (s,a) pairs) T: transition model beta_fn: feature function mapping (s,a) -> feature vector """ n_states, n_actions = T.shape[:2] n_features = beta_fn(0, 0).shape[0] phi = np.zeros(n_features) # Expert feature expectations (empirical mean over demos) mu_E = np.zeros(n_features) for traj in demos: for t, (s, a) in enumerate(traj): mu_E += (gamma**t) * beta_fn(s, a) mu_E /= len(demos) for _ in range(n_iters): # Build reward from current phi R = np.array([[np.dot(phi, beta_fn(s, a)) for a in range(n_actions)] for s in range(n_states)]) # Soft value iteration -> expected feature counts V, Q = soft_value_iteration(R, T, gamma) pi_soft = np.exp(Q - V[:, None]) # Boltzmann policy # Compute expected feature counts via state visitation mu_pi = compute_feature_expectations(pi_soft, T, beta_fn, gamma) # Gradient = expert feature counts - model feature counts phi += lr * (mu_E - mu_pi) return phi
Expert trajectories (teal lines) cross a 6×6 grid. Press Run MaxEnt IRL to infer the reward function. The heatmap shows recovered reward — bright = high reward. Expert trajectories should align with high-reward states.
IRL is principled but expensive: every gradient step requires solving a full MDP with the current reward weights. Can we skip the reward-recovery step entirely and learn the policy directly from demonstrations?
GAIL (Ho & Ermon, 2016) says yes. It frames imitation as a game between two networks, borrowing the GAN architecture:
The minimax objective:
The training loop alternates:
python # GAIL training loop (simplified, using PPO for policy update) import torch import torch.nn as nn class Discriminator(nn.Module): def __init__(self, state_dim, action_dim): super().__init__() self.net = nn.Sequential( nn.Linear(state_dim + action_dim, 64), nn.Tanh(), nn.Linear(64, 64), nn.Tanh(), nn.Linear(64, 1), nn.Sigmoid() # output in [0,1]: 1 = expert, 0 = policy ) def forward(self, s, a): x = torch.cat([s, a], dim=-1) return self.net(x) # shape: (batch,) def gail_step(disc, pi, expert_sa, policy_sa, disc_opt, pi_optimizer): # --- Discriminator update --- s_e, a_e = expert_sa # expert (state, action) s_p, a_p = policy_sa # learner (state, action) D_expert = disc(s_e, a_e) # should be close to 1 D_policy = disc(s_p, a_p) # should be close to 0 disc_loss = -(torch.log(D_expert + 1e-8) + torch.log(1 - D_policy + 1e-8)).mean() disc_opt.zero_grad() disc_loss.backward() disc_opt.step() # --- Implicit reward for policy update --- with torch.no_grad(): D_vals = disc(s_p, a_p) # High D: policy looks like expert -> low penalty # Low D: discriminator caught us -> high penalty rewards = -torch.log(D_vals + 1e-8) # --- Policy update via PPO/TRPO with rewards --- pi_optimizer.step(s_p, a_p, rewards) # pseudocode
Every method in this chapter trades off cost, generalization, and complexity. Making a good choice requires understanding what each method actually buys you.
| Method | Expert Access | Error Scaling | Generalization | Cost |
|---|---|---|---|---|
| Behavioral Cloning | Batch only | O(T2) cascading | Only covered states | Very low |
| DAgger | Interactive oracle | O(T) linear | States policy visits | Low–medium |
| SMILe | Interactive oracle | O(T) linear | States mixture visits | Medium |
| Max-Margin IRL | Batch only | Depends on RL | Any state via reward | High (QP + MDP) |
| MaxEnt IRL | Batch only | Depends on RL | Any state via reward | High (DP per step) |
| GAIL | Batch + RL env | Depends on RL | Any state via reward | High (RL + GAN) |
The critical practical distinction is expert access. If you have a simulator or a teleoperation system where you can collect new expert labels on demand, DAgger is usually the right choice — simple, efficient, strong guarantees. If you only have a batch of recordings and need a transferable reward, MaxEnt IRL or GAIL are the tools.
Compare how trajectory quality (lower cost = better) evolves with more expert data. BC saturates early. DAgger keeps improving. IRL methods require many queries but eventually produce a transferable reward. Adjust the horizon T to see how error scaling changes.
The methods in Kochenderfer Ch. 18 establish the conceptual foundations. But the field has moved quickly. Two approaches now dominate practical robot imitation learning: Diffusion Policy and Action Chunking with Transformers (ACT).
Both are behavioral cloning at heart — they learn π(a|s) directly from expert demonstrations. What they improve is the expressiveness of the policy representation.
Standard BC parameterizes π(a|s) as a unimodal distribution (Gaussian, softmax). But expert behavior is often multimodal: when approaching a fork in the road, the expert turns left or right — never straight-ahead as a Gaussian mean would predict. Averaging modes produces a bad policy.
Diffusion Policy uses a denoising diffusion probabilistic model to represent the action distribution. Instead of outputting a single action, the policy runs a denoising chain: starting from Gaussian noise aT, iteratively denoise to produce action a0:
where εθ is a neural network that predicts the noise. The data flow is:
ACT uses a conditional VAE architecture to handle the multimodality problem differently. During training, a Transformer encoder takes the full expert trajectory (joint positions + images) and encodes a style vector z that captures which "mode" of behavior the expert used. During inference, z is sampled from the prior N(0, I) — picking a behavior mode at random, like the expert did.
| Method | Multimodal? | Action repr. | Inference cost | Best for |
|---|---|---|---|---|
| BC (MLP) | No | Single action | Very low | Simple, unimodal tasks |
| Diffusion Policy | Yes | Action chunk | High (K denoising) | Dexterous manipulation |
| ACT | Yes (via CVAE) | Action chunk | Medium (Transformer) | Bimanual manipulation |
| DAgger + any | Depends on policy | Single or chunk | Depends | Sim-to-real transfer |
Imitation learning sits at the intersection of supervised learning, reinforcement learning, and game theory. Every chapter in this book connects here.
RLHF (Reinforcement Learning from Human Feedback) is an IRL variant where human preference labels (A is better than B) replace trajectory demonstrations. Used in every major LLM alignment method (InstructGPT, Claude, Gemini). The reward model is a Bradley-Terry preference model, and the policy is fine-tuned against it via PPO.
Play data (Lynch et al., 2020) replaces curated demonstrations with uncurated "play" — an operator freely manipulating objects with no specific goal. A goal-conditioned policy is then extracted. This dramatically reduces data collection cost.
Foundation models for imitation (π0, RT-2, RoboFlamingo) pre-train on internet-scale data then fine-tune with a handful of task-specific demonstrations. The pre-trained visual representations and language grounding dramatically reduce the number of demonstrations needed per task.
Where each method sits in (expert access required) × (generalization capability) space. Click a method to highlight its position.
Related micro-lessons:
"It is easier to show than to tell. The hardest part of imitation learning is realizing that the expert and the learner inhabit different worlds — and building a bridge between them."
— paraphrasing Stéphane Ross, DAgger paper, 2011