Hierarchy in Imitation and Reinforcement Learning

Roadmap

What You'll Master

01The Long-Horizon Problem 02The Main Idea: Two Policies 03Rolling Out a Hierarchy 04Why Hierarchy Helps 05Hierarchy vs. Alternatives 06Goal Representations 07Supervising Each Level 08Subgoal Transitions 09Imitation with Language Goals 10Imitation with Image Goals 11Hierarchical RL 12Summary & Cheat Sheet

Chapter 01

The Long-Horizon Problem

Think about what "cook focaccia" actually requires. You need to gather flour, yeast, olive oil, salt. Mix them. Knead the dough. Let it rise for an hour. Dimple it with your fingers. Drizzle more oil. Bake at 425°F for 20 minutes. Each of those steps is itself a multi-step procedure. A single flat policy would need to learn the entire sequence, from opening the flour bag to pulling the bread from the oven, as one monolithic behavior.

This isn't unique to cooking. Consider these long-horizon tasks:

• Fix a bug that's causing neural network training loss to explode — requires reading logs, forming hypotheses, editing code, re-running training, interpreting results.

• Drive to Yosemite — navigate out of the parking lot, merge onto the highway, handle 3 hours of varied road conditions, find the campsite.

• Give feedback on an 8-page report — read each section, note issues, check citations, write coherent comments, re-read for consistency.

Why Are These Hard?

Three compounding difficulties make long-horizon tasks brutal for any learning algorithm:

1. Enormous state space. A 100-step task visits far more distinct states than a 10-step task. The number of possible trajectories grows exponentially with horizon length. If you have A possible actions per step and a horizon of T, the space of possible trajectories is A^T. Even with just 4 actions and a horizon of 50, that's 4⁵⁰ ≈ 10³⁰ possible trajectories.

Trajectory Space |trajectory space| = |A|^T
With 4 actions and T=50: 4⁵⁰ ≈ 1.27 × 10³⁰ trajectories

2. Compounding errors. At each timestep, the policy has some probability ε of making a mistake. Over T steps, the probability of a perfect rollout is (1-ε)^T. For ε = 0.05 and T = 100, that's 0.95¹⁰⁰ ≈ 0.006 — less than 1% chance of success even with a 95%-accurate policy.

Compounding Errors P(perfect trajectory) = (1 - ε)^T
ε=0.05, T=100: (0.95)¹⁰⁰ = 0.0059 — 0.6% success

3. Getting stuck. Long tasks create many opportunities for the agent to enter irrecoverable states. A robot that drops a key ingredient halfway through cooking has no path to success. In RL terms, many states become "absorbing failure states" from which no sequence of actions can reach the goal.

The Core Challenge

Long-horizon tasks are hard not because any individual step is hard, but because the number of steps compounds every source of error. A policy that's excellent at each individual action can still fail catastrophically over 100 steps. We need a way to structure the problem so we don't require perfection over the entire horizon.

Interactive — Compounding Error Probability

Error rate ε: 0.05

Chapter 02

The Main Idea: Two Policies

Here's the key insight: "bake a cheesecake" is impossibly hard as one monolithic task. But "buy ingredients" is manageable. "Go to the store" is easier still. "Walk to the door" is trivial. "Take a step" is reflexive. We naturally decompose complex tasks into a hierarchy of subtasks, each at a different level of abstraction.

The hierarchical policy framework makes this decomposition explicit with two (or more) policies:

Definition

High-Level Policy — π_HL

Takes the current observation o_t and the overall task description (e.g., "bake a cheesecake"), and outputs an intermediate goal g_t. This goal is a subcommand like "preheat the oven" or "mix the batter." The high-level policy runs at a low frequency — it only decides on new goals occasionally.

Definition

Low-Level Policy — π_LL

Takes the current observation o_t and the current goal g_t (from the high-level policy), and outputs primitive actions a_t (e.g., joint torques, motor commands). The low-level policy runs at a high frequency — it produces actions every timestep.

Hierarchical Policy Structure g_t = π_HL(· | o_t, task) ← slow, strategic
a_t = π_LL(· | o_t, g_t) ← fast, reactive

The variable g_t goes by many names in the literature: subgoals, subtasks, skills, options, or high-level actions. They all refer to the same thing: an intermediate target that the low-level policy tries to accomplish.

Key Insight

You can have more than two levels of hierarchy. A three-level system might have a strategic planner ("make dinner"), a tactical planner ("prepare the pasta sauce"), and a motor controller ("rotate wrist 15 degrees"). But two levels are by far the most common in practice, and the principles generalize.

Interactive — Hierarchical Policy Architecture

Chapter 03

Rolling Out a Hierarchy

How does a hierarchical policy actually execute? Let's trace through the rollout procedure step by step. This is the algorithm that converts two trained policies into actual behavior.

Algorithm: Hierarchical Policy Rollout

Initialize: Observe initial observation o₁ at t = 1
Plan: Query high-level policy for a goal: g_t ∼ π_HL(· | o_t, task prompt)
Execute: Until a new goal is selected (see Chapter 8 for when):
1. Execute action: a_t ∼ π_LL(· | o_t, g_t)
2. Step environment: t ← t + 1, observe new o_t
Replan: When the goal is "done" (or after n steps), go to step 2
Terminate: When the overall task is complete or time runs out

Notice the two timescales. The high-level policy might produce a new goal every 20 timesteps, while the low-level policy acts every single timestep. If the robot runs at 50 Hz control, the high-level policy might replan at 2.5 Hz — every 400ms it decides "what should we be doing now?", while the low-level policy handles the moment-to-moment motor commands at full speed.

A Concrete Example

Imagine a robot making a sandwich. Here's what the rollout looks like:

Time	π_HL Output (g_t)	π_LL Output (a_t)
t=1	"pick up bread"	move arm to bread, close gripper
t=20	(same goal, still executing)	lift arm, move to plate
t=35	"spread peanut butter"	pick up knife, scoop, spread motion
t=70	"place second slice"	reach for bread, position, release

The high-level policy only makes 3 decisions. The low-level policy makes 70+ decisions. Each low-level decision is simple (move arm slightly), while each high-level decision is strategic (what to do next).

Data Flow

Input to π_HL: RGB images from workspace cameras (e.g., 3×256×256), task string "make sandwich" → Output: language string or latent vector representing "pick up bread"

Input to π_LL: RGB images + joint positions (e.g., 7-DOF arm state) + goal "pick up bread" → Output: 7 joint target positions (or velocities) at 50 Hz

Chapter 04

Why Hierarchy Helps

Why not just train a single flat policy to go directly from observations to actions? Hierarchy provides four concrete advantages:

1. Supervision Signal Decomposition

A flat policy for "bake a cheesecake" gets a single reward at the very end: did the cheesecake turn out well? That's an incredibly sparse signal over hundreds of actions. With hierarchy, the low-level policy gets much denser feedback: did you successfully "pick up the bowl"? Did you "crack the egg"? Each subtask provides its own supervision signal, making learning tractable.

2. Knowledge Sharing Across Subtasks

The skill "pick up an object" appears in hundreds of tasks: cooking, cleaning, organizing, assembling. With a flat policy, the robot learns picking-up from scratch for each task. With hierarchy, the low-level policy learns "pick up" once and the high-level policy reuses it across many different tasks. This is the compositional benefit of hierarchy.

Compositionality

If you have K subtask skills and N tasks, a flat approach needs to learn all N tasks separately. A hierarchical approach needs K skills + N high-level plans. When K « N (many tasks reuse the same skills), hierarchy wins dramatically.

3. Structured Exploration (RL)

In reinforcement learning, exploration is critical. A flat policy explores in action space — randomly wiggling each joint. A hierarchical policy explores in goal space — "what if I tried going to that location?" or "what if I picked up the blue object instead?" Goal-space exploration is far more structured and covers meaningful variations much faster than random motor noise.

4. Practical Latency

Running a large VLM (vision-language model) takes time — maybe 200ms per inference. If you need actions at 50 Hz (20ms), you can't run the VLM every timestep. Hierarchy solves this naturally: the expensive VLM runs as the high-level policy at 5 Hz, while a fast, small neural network runs as the low-level policy at 50 Hz.

Latency Example — Physical Intelligence π0.5

π_HL: Pre-trained VLA (vision-language-action model), runs infrequently. Outputs subtask predictions like "pick up the pillow." Inference: ~300ms.

π_LL: Diffusion action expert (300K params), runs at full control frequency. Takes subtask + observation → continuous joint actions. Inference: ~5ms.

The high-level VLA decides what to do. The lightweight low-level model handles how to do it, fast enough for real-time robot control.

Chapter 05

Hierarchy vs. Alternatives

Hierarchy isn't the only way to handle long-horizon tasks. Let's compare three approaches head-to-head.

Option A: Flat Policy

Flat Policy a_t = π(· | o_t, task)
One policy does everything: observe → act. Simple but struggles with long horizons.

A single neural network maps observations directly to actions. No intermediate representation. This is the simplest approach, and it works well for short tasks (pick up an object, push a button). But for tasks requiring 50+ steps with diverse subtasks, flat policies struggle with the compounding error and sparse reward problems we discussed.

Option B: Chain-of-Thought Policy

Chain-of-Thought Policy (g_t, a_t) = π(· | o_t, task)
Single policy outputs BOTH the subgoal reasoning and the action. One network, two heads.

A single policy outputs both a subgoal (as a "thought") and an action. Think of this as the policy narrating its reasoning: "I should pick up the bread [thought] → move arm right 5cm [action]." This can benefit from the same structured supervision as hierarchy, and it's simpler to implement since there's only one model.

The Comparison

Property	Hierarchy	Flat	Chain-of-Thought
Number of models	2 (HL + LL)	1	1
Benefits from subtask labels	Yes	No	Yes
Different frequencies	Yes — key advantage	N/A	Possible but awkward
50 Hz real-time control	Yes (small LL)	Only if fast	Expensive at every step
Modularity	High — swap HL/LL independently	None	Low

No Conclusive Winner (Yet)

There is no conclusive empirical comparison between hierarchy and chain-of-thought for robotics. Chain-of-thought may be too computationally expensive for 50 Hz control loops. But for slower domains (planning, coding, document editing), chain-of-thought may be sufficient. This is an active research question. The latency argument is the strongest practical case for hierarchy in robotics.

Chapter 06

Goal Representations

The high-level policy outputs some goal g_t. But what is g_t concretely? A string? A vector? An image? This is one of the most important design decisions in hierarchical systems, and the best answer depends on the domain.

Language Goals

g_t is a natural language string: "pick up the bowl", "move arm to the left", "open the drawer." This is the most intuitive representation. Humans naturally describe subgoals in language, and we can collect language-annotated demonstrations easily. The high-level policy is a VLM that outputs text; the low-level policy is a language-conditioned behavior cloning (LCBC) policy.

Pros: Interpretable, easy to debug ("why did the robot fail? it chose 'pour into bag' when it should have chosen 'position bag under nozzle'"), easy to provide human corrections.

Cons: Requires segmented demonstrations with language annotations. Language can be ambiguous ("move it over there").

Image Goals

g_t is a target image showing what the world should look like after completing the subtask. The high-level policy is an image editing model: given the current image, it produces a modified image showing the desired next state. The low-level policy is a goal-image-conditioned policy.

Pros: No need for language annotations — can learn from unlabeled video. Can leverage human demonstration videos. Captures spatial details that language misses ("move the bowl 3 inches left" is hard to say precisely but easy to show in an image).

Cons: Harder to interpret. Image generation can introduce artifacts. The low-level policy needs to be robust to imperfect goal images.

State Goals

g_t is a target state vector: a desired (x, y, z) position of the robot's end-effector, or a desired relative position between the agent and objects. This is the most compact representation.

Pros: Very precise. Easy to measure completion (did the agent reach the target state?). Works naturally with hindsight relabeling in RL.

Cons: Limited expressiveness — "approach from the left" vs. "approach from the right" may have the same target position but very different required trajectories. Only works when you can define a meaningful state space.

Properties of Good Goal Representations

Three Properties

1. Expressive — can communicate many different low-level behaviors. If the goal space is too small, the high-level policy can't express important distinctions.

2. Structured — similar behaviors should have similar goal representations. "Pick up red bowl" and "pick up blue bowl" should be close in goal space, not arbitrary distant points.

3. Appropriate abstraction level — not so high-level that the low-level policy can't figure out what to do ("make it work"), and not so low-level that the high-level policy has to micromanage ("rotate joint 3 by 0.02 radians").

Goal Type	Expressiveness	Structure	Abstraction	Best For
Language	High	Inherited from LLMs	Natural	Manipulation with supervision
Image	Very high	Pixel space	Flexible	When unlabeled video available
State vector	Moderate	Euclidean	Low-level	Navigation, simple RL tasks
Latent vector	Depends on training	Learned	Learned	When no natural goal exists

Chapter 07

Supervising Each Level

You have two policies. Both need training. But each depends on the other, creating a chicken-and-egg problem.

The Low-Level Policy

The low-level policy is trained to accomplish a goal g — not the original long-horizon task. Its loss function looks like imitation learning (for IL) or goal-conditioned RL (for RL), conditioned on g:

Low-Level Objective (Imitation Learning) L_LL(θ_LL) = E_{(o, a, g) ~ D} [ -log π_LL(a | o, g; θ_LL) ]
Standard behavior cloning loss, but conditioned on the goal g from segmented demos.

Critical question: For which distribution of goals should we train the low-level policy? Ideally, whatever goals the high-level policy will actually output. But we haven't trained the high-level policy yet, so we don't know what goals it will produce. In imitation learning, we can use the goals from the demonstration data. In RL, we often train on a broad distribution of goals (uniformly sampled from a goal space, or using hindsight relabeling).

The High-Level Policy

The high-level policy is trained to accomplish the original long-horizon task. Its performance depends on the low-level policy actually being able to execute the goals it outputs.

High-Level Objective (Imitation Learning) L_HL(θ_HL) = E_{(o, g*) ~ D} [ -log π_HL(g* | o; θ_HL) ]
Predict the correct subgoal g* given observation o. g* comes from demonstration labels.

Critical question: For which low-level policy should we evaluate? Ideally with the learned low-level policy. But we haven't trained it yet, or it's still imperfect.

The Chicken-and-Egg Resolution

Key Resolution

The two policies can be trained separately first — LL on goal-reaching, HL on goal-prediction — then at least one must be adapted to the deficiencies of the other. If the LL policy is bad at certain goals, the HL policy should learn to avoid outputting those goals (or vice versa). Ideally, both are fine-tuned jointly, but single-sided adaptation also works.

A natural insight: LLMs are often good high-level policies! They already understand task decomposition from pre-training on internet text. A frozen LLM can output reasonable subgoal sequences for many tasks. You then only need to train the low-level policy to follow those subgoals.

Why Not End-to-End with Latent Goals?

You might think: why not let the goal representation be latent (a learned vector) and train everything end-to-end? The answer: the result is a flat policy. If g_t has no structure and both policies are trained jointly through g_t, the gradient will just learn to pass whatever information is useful through g_t — which is exactly what a flat policy with an internal hidden state does. The benefits of hierarchy come from giving g_t meaning (language, images, states) and training the levels with separate objectives.

Research Wisdom

"Think carefully about where the benefits are coming from!" If you can't articulate why your hierarchical design is better than a flat policy with the same capacity, it probably isn't. The benefits come from: (1) meaningful goal representations that enable transfer, (2) separate supervision signals at each level, (3) different operating frequencies, or (4) pre-trained components (like LLMs as high-level policies).

Chapter 08

Subgoal Transitions

We've been saying "until a new goal is selected" without specifying when the high-level policy gets re-queried. This turns out to be a critical design decision with no universally right answer.

Option 1: Completion-Based Transitions

Re-query π_HL when the low-level policy has completed the current goal g_t. This requires estimating progress toward the goal — for example, a learned classifier that predicts "has the subtask been achieved?"

Advantage: Ideal in principle — each goal runs exactly as long as it needs to.

Problems: Hard to estimate when a subtask is "done." If the completion estimator is wrong, the agent gets perpetually stuck — it thinks it hasn't finished, so it never moves on. Also: what if the agent makes a mistake that requires undoing a previous goal? The completion-based approach can't easily handle backtracking.

Fatal Errors

Errors in estimating completion are more fatal than errors in subgoal prediction. If the high-level policy picks a slightly wrong subgoal, the low-level policy might still do something reasonable. But if the transition detector gets stuck, the agent perpetually repeats the same subtask forever.

Option 2: Fixed-Interval Transitions

Re-query π_HL every n timesteps, regardless of progress. For example, replan every 20 steps (400ms at 50 Hz control).

Advantage: Simple. No completion estimation needed. Robust to getting stuck — even if the current subtask is going badly, the agent will replan in n steps.

Problems: If the high-level policy picks a wrong subtask, the low-level policy executes bad actions for n full steps before any correction. Small n means more compute spent on the (expensive) high-level policy. Large n means slower recovery from errors. There's a tradeoff between compute cost (more HL queries) and delay (longer before correction).

Interactive — Subgoal Transition Timing

Replan every n: 15

Property	Completion-Based	Fixed-Interval
Implementation complexity	High (need classifier)	Low (just count steps)
Failure mode	Agent gets stuck forever	n steps of wrong actions
Failure severity	Fatal — no recovery	Recoverable at next replan
Compute cost	Variable (adaptive)	Predictable (every n steps)
Used in practice	Less common	More common (e.g., π0.5)

Chapter 09

Hierarchical Imitation with Language Goals

Let's ground these abstractions in a real system. "Yell At Your Robot" (Shi, Hu, Zhao et al., RSS 2024) and "Hi Robot" (Shi et al., ICML 2025) are hierarchical imitation learning systems that use language as the goal representation.

The Data

Training data consists of segmented demonstrations with language annotations. Each demonstration is a video of a robot performing a task, manually segmented into subtask segments, each labeled with a language description:

Segment	Language Label	Robot Data
1	"pick up the bag"	Images + joint positions for 30 timesteps
2	"pick up the metal scoop"	Images + joint positions for 25 timesteps
3	"scoop M&Ms"	Images + joint positions for 40 timesteps
4	"pour into the bag"	Images + joint positions for 35 timesteps

The Architecture

The system uses an ALOHA bimanual robot workcell with multiple cameras. The two policies are:

System Component

High-Level Language Policy

Input: RGB images from workspace cameras (current observation)

Output: Language string (the next subtask to execute)

Architecture: Vision-Language Model — processes images through a vision encoder, concatenates with previous language context, produces next language command via autoregressive decoding.

System Component

Low-Level LCBC Policy

Input: RGB images + joint positions + language goal string

Output: Joint target positions (7-DOF per arm × 2 arms = 14 values)

Architecture: Language-Conditioned Behavior Cloning (LCBC) — a policy that takes a language instruction and produces motor actions via a Transformer or diffusion policy.

Training: High-Level DAgger

Both policies are initially trained with supervised learning on the segmented demonstrations. But can we improve the high-level policy after deployment? Yes — with DAgger applied at the high level.

Recall that DAgger collects on-policy data by running the learned policy, then getting expert corrections. For hierarchical systems, the expert corrections are language commands. A human watches the robot and says "no, pick up the scoop first" when the high-level policy makes a wrong decision.

Algorithm: High-Level DAgger for Hierarchical IL

Deploy: Run the hierarchical policy (HL + LL) on the robot
Intervene: Human provides language corrections when HL makes wrong decisions. Language corrections override the high-level policy's prediction — the LL policy receives the human's command instead
Freeze LL: Keep the low-level LCBC policy frozen (no updates to motor control)
Update HL: Fine-tune the high-level policy by supervising on the human language corrections (treat corrections as ground-truth labels for the observations where they occurred)
Repeat: Deploy updated HL policy, collect more corrections, iterate

Why This Is Elegant

In standard DAgger for flat policies, the human must provide motor-level corrections — physically teleoperating the robot. That's expensive and requires hardware access. With hierarchical DAgger, the human only provides language corrections — literally yelling at the robot. This is cheap, fast, and doesn't require any special equipment. You can even do it from another room.

After DAgger fine-tuning, the high-level policy can self-correct: if it starts to make a mistake (e.g., reaching for the wrong item), it has learned from past corrections to re-route to the right action without any human intervention.

Does It Work?

Two key empirical results from the Hi Robot and Yell At Your Robot papers:

1. Hierarchy beats flat. On long-horizon tasks, hierarchical VLA policies outperform flat VLA policies by 19-34% on average success rate. The gap grows with task horizon — the longer the task, the more hierarchy helps.

2. HL DAgger beats vanilla imitation. After high-level DAgger fine-tuning, task performance jumps by ~20% compared to the base hierarchical policy. The DAgger-trained policy approaches the performance of having a human oracle correct every mistake in real time.

Real-World Result — π0.5 (Physical Intelligence, 2025)

The same principles at industry scale. π0.5 uses a pre-trained VLA as the high-level policy that outputs subtask language labels ("pick up the pillow"), and a small diffusion action expert (300K parameters) as the low-level policy that converts subtask labels into continuous motor actions. The system handles household tasks with horizons of 100+ steps — cleaning bedrooms, organizing laundry — that no flat policy has achieved reliably.

Chapter 10

Hierarchical Imitation with Image Goals

What if you don't have language annotations for your demonstration data? SuSIE (Black, Nakamoto, Atreya, Walke, Finn, Kumar, Levine, ICLR 2024) replaces language goals with image goals, opening the door to learning from unlabeled video.

The Architecture

Instead of outputting a language string, the high-level policy is an image editing model. Given the current image, it produces a goal image showing what the scene should look like after completing the next subtask — for example, showing the bowl moved to a new position.

SuSIE Architecture g_t = ImageEdit(o_t, task) ← "what should the world look like next?"
a_t = π_LL(o_t, g_t) ← "how do I get there?"

The low-level policy is a goal-image-conditioned policy: given the current observation and a goal image, produce actions that transform the scene from the current state toward the goal state. No language is involved anywhere in the pipeline.

Training Data: No Language Required

The key benefit: you don't need segmented demonstrations with language labels. The high-level image editing model can be trained on unlabeled videos — just pairs of (current frame, future frame). These can come from robot data, or even from videos of humans performing tasks.

Human Videos as Training Data

Since the high-level policy only needs to understand "what the world should look like next" (not how to move robot joints), it can learn from human demonstration videos. A human opening a drawer looks different from a robot opening a drawer at the motor level, but the goal image — "drawer is now open" — is the same regardless of who performed the action.

Dataset	Scene B (Avg.)	Scene C (Avg.)
BridgeData (robot) only	0.30	0.80
BridgeData + Something-Something (human videos)	0.50	0.88

Adding human video data to the high-level policy training improved success rates by 10-20 percentage points. The human videos teach the model about object affordances and common manipulation sequences, even though they contain no robot-specific information.

The Hierarchy Insight for Data

Hierarchy enables asymmetric data requirements. The high-level policy (goal imagination) can train on abundant, cheap data (internet videos, human demos). The low-level policy (motor execution) needs expensive robot-specific data, but for much simpler tasks (reach this goal state). You're matching each level to its natural data source.

Chapter 11

Hierarchical Reinforcement Learning

Everything so far has used imitation learning — learning from demonstrations. Can we apply hierarchy to RL, where we learn from reward signals instead? Yes, but the design choices are different.

State-Reaching Goals: HIRO

HIRO (Nachum, Gu, Lee, Levine, NeurIPS 2018) uses state vectors as goals. The goal g_t specifies a target relative position — e.g., "move 2 meters forward and 1 meter left."

System

HIRO — HIerarchical Reinforcement learning with Off-policy correction

Low-level: Goal-conditioned policy trained with a simple goal-reaching reward: r_LL = -||current_state - goal_state||. No task-specific reward needed for the low-level.

High-level: Trained to output goal states using the task reward. HL actions are goal vectors. Trained with off-policy RL (like TD3 or SAC).

Hindsight Relabeling for the High Level

Here's the clever part. The high-level policy outputs a goal g_t, saying "go to position X." The low-level policy tries but ends up at position Y instead (because it's imperfect). In standard RL, this is just a failure. But with hindsight relabeling, we can pretend the high-level policy intended position Y all along. We relabel the high-level action g_t with the state the agent actually reached, and store this relabeled transition in the replay buffer.

Hindsight Relabeling Original: (o_t, g_t, reward, o_t+n) ← HL intended g_t, agent reached s_t+n
Relabeled: (o_t, s_t+n, reward, o_t+n) ← pretend HL intended s_t+n
The reward stays the same — we just relabel which goal was "intended."

This is crucial for off-policy learning. Without relabeling, the high-level transitions in the replay buffer would be useless because the low-level policy has changed since they were collected. With relabeling, we can reuse old experience even as the low-level policy improves.

Language Goals for HRL

Jiang, Gu, Murphy, and Finn (NeurIPS 2019) extended this idea to language goals. The high-level policy outputs language commands, and the low-level policy is language-conditioned. They use hindsight language relabeling: after a trajectory, they describe what the agent actually did in language and relabel the high-level action with that description.

Skill Discovery Without Supervision

A fascinating research direction: can you discover a diverse set of useful skills without any task-specific supervision?

DIAYN (Eysenbach, Gupta, Ibarz, Levine, ICLR 2019) answers yes. The idea: train a set of skills (low-level policies) to be as diverse as possible — each skill should produce different behavior, and you should be able to tell which skill was used by observing the outcome.

DIAYN Objective max I(z; s) - I(z; a | s)
Maximize mutual information between skill z and states visited,
while minimizing dependence on actions given state.
This means: skills should lead to distinguishable states, but each individual action should be predictable from the state + skill identity.

The result: without any reward function, the agent learns a library of diverse skills — different locomotion gaits, turning behaviors, jumping patterns. These can then be composed by a high-level policy for downstream tasks.

Open Research Direction

Fine-tuning large-scale hierarchical robot learning systems with RL is an open and important research direction. Most current industrial systems (PI's π0.5, NVIDIA Gr00t N1, Figure Helix, Google Gemini Robotics) use hierarchical imitation learning. Adding RL fine-tuning on top could improve performance further, but the combination of hierarchy + large models + RL has not yet been cracked at scale.

Interactive — Hierarchical RL with Hindsight Relabeling

LL skill level: 0.6

Chapter 12

Summary & Cheat Sheet

The Core Framework

Hierarchical Policy g_t = π_HL(· | o_t, task) slow, strategic, runs at ~2-5 Hz
a_t = π_LL(· | o_t, g_t) fast, reactive, runs at ~50 Hz

Decision Cheat Sheet

Design Choice	Options	Recommendation
Goal representation	Language, image, state, latent	Language if you have annotations; image if unlabeled video available; state for simple RL
Training paradigm	IL, RL, IL + RL fine-tuning	IL is proven at scale; RL for HL fine-tuning is promising but open
Supervision	Joint, separate, pre-trained HL	Pre-trained LLM/VLM as HL + train LL, or separate then adapt
Transitions	Fixed interval, completion-based	Fixed interval is simpler and more robust; completion risks getting stuck
HL improvement	More data, DAgger, RL	HL DAgger with language corrections is cheap and effective

Key Papers

Paper	Contribution	Goal Type	IL/RL
Yell At Your Robot (RSS 2024)	HL DAgger with language corrections	Language	IL
Hi Robot (ICML 2025)	Hierarchical VLA at scale	Language	IL
π0.5 (arXiv 2025)	Industry-scale hierarchical VLA + action expert	Language	IL
SuSIE (ICLR 2024)	Image editing as HL policy	Image	IL
HIRO (NeurIPS 2018)	Off-policy HRL with hindsight relabeling	State	RL
Language Abstraction (NeurIPS 2019)	Language goals for HRL	Language	RL
DIAYN (ICLR 2019)	Unsupervised skill discovery	Latent	RL

The Big Picture

Where This Fits

Hierarchy is one of three main approaches to long-horizon robot learning covered in CS 224R:

1. Hierarchy (this lecture) — decompose into HL + LL policies.

2. Multi-task & meta-learning (previous lectures) — learn shared representations across tasks.

3. RL fine-tuning of foundation models (next lectures) — start from VLAs, improve with RL + sim-to-real.

These approaches are complementary, not competing. State-of-the-art systems like π0.5 combine all three: hierarchical structure, multi-task pre-training, and (potentially) RL fine-tuning.

Hierarchy in Imitation & Reinforcement Learning

What You'll Master

The Long-Horizon Problem

Why Are These Hard?

The Main Idea: Two Policies

Rolling Out a Hierarchy

A Concrete Example

Why Hierarchy Helps

1. Supervision Signal Decomposition

2. Knowledge Sharing Across Subtasks

3. Structured Exploration (RL)

4. Practical Latency

Hierarchy vs. Alternatives

Option A: Flat Policy

Option B: Chain-of-Thought Policy

The Comparison

Goal Representations

Language Goals

Image Goals

State Goals

Properties of Good Goal Representations

Supervising Each Level

The Low-Level Policy

The High-Level Policy

The Chicken-and-Egg Resolution

Why Not End-to-End with Latent Goals?

Subgoal Transitions

Option 1: Completion-Based Transitions

Option 2: Fixed-Interval Transitions

Hierarchical Imitation with Language Goals

The Data

The Architecture

Training: High-Level DAgger

Does It Work?

Hierarchical Imitation with Image Goals

The Architecture

Training Data: No Language Required

Human Videos as Training Data

Hierarchical Reinforcement Learning

State-Reaching Goals: HIRO

Hindsight Relabeling for the High Level

Language Goals for HRL

Skill Discovery Without Supervision

Summary & Cheat Sheet

The Core Framework

Decision Cheat Sheet

Key Papers

The Big Picture