← Gleams
Deep RL · Lecture 15 · Hierarchical Policies

Hierarchy in Imitation & Reinforcement Learning

How to decompose impossibly long tasks into manageable subtasks. From design choices to state-of-the-art robot systems.

Long-horizon tasks HL / LL policies Language & image subgoals DAgger for hierarchy HIRO & skill discovery
Roadmap

What You'll Master

Chapter 01

The Long-Horizon Problem

Think about what "cook focaccia" actually requires. You need to gather flour, yeast, olive oil, salt. Mix them. Knead the dough. Let it rise for an hour. Dimple it with your fingers. Drizzle more oil. Bake at 425°F for 20 minutes. Each of those steps is itself a multi-step procedure. A single flat policy would need to learn the entire sequence, from opening the flour bag to pulling the bread from the oven, as one monolithic behavior.

This isn't unique to cooking. Consider these long-horizon tasks:

Fix a bug that's causing neural network training loss to explode — requires reading logs, forming hypotheses, editing code, re-running training, interpreting results.

Drive to Yosemite — navigate out of the parking lot, merge onto the highway, handle 3 hours of varied road conditions, find the campsite.

Give feedback on an 8-page report — read each section, note issues, check citations, write coherent comments, re-read for consistency.

Why Are These Hard?

Three compounding difficulties make long-horizon tasks brutal for any learning algorithm:

1. Enormous state space. A 100-step task visits far more distinct states than a 10-step task. The number of possible trajectories grows exponentially with horizon length. If you have A possible actions per step and a horizon of T, the space of possible trajectories is AT. Even with just 4 actions and a horizon of 50, that's 450 ≈ 1030 possible trajectories.

Trajectory Space |trajectory space| = |A|T
With 4 actions and T=50: 450 ≈ 1.27 × 1030 trajectories

2. Compounding errors. At each timestep, the policy has some probability ε of making a mistake. Over T steps, the probability of a perfect rollout is (1-ε)T. For ε = 0.05 and T = 100, that's 0.95100 ≈ 0.006 — less than 1% chance of success even with a 95%-accurate policy.

Compounding Errors P(perfect trajectory) = (1 - ε)T
ε=0.05, T=100: (0.95)100 = 0.0059 — 0.6% success

3. Getting stuck. Long tasks create many opportunities for the agent to enter irrecoverable states. A robot that drops a key ingredient halfway through cooking has no path to success. In RL terms, many states become "absorbing failure states" from which no sequence of actions can reach the goal.

The Core Challenge

Long-horizon tasks are hard not because any individual step is hard, but because the number of steps compounds every source of error. A policy that's excellent at each individual action can still fail catastrophically over 100 steps. We need a way to structure the problem so we don't require perfection over the entire horizon.

Interactive — Compounding Error Probability
0.05
Chapter 02

The Main Idea: Two Policies

Here's the key insight: "bake a cheesecake" is impossibly hard as one monolithic task. But "buy ingredients" is manageable. "Go to the store" is easier still. "Walk to the door" is trivial. "Take a step" is reflexive. We naturally decompose complex tasks into a hierarchy of subtasks, each at a different level of abstraction.

The hierarchical policy framework makes this decomposition explicit with two (or more) policies:

Definition
High-Level Policy — πHL

Takes the current observation ot and the overall task description (e.g., "bake a cheesecake"), and outputs an intermediate goal gt. This goal is a subcommand like "preheat the oven" or "mix the batter." The high-level policy runs at a low frequency — it only decides on new goals occasionally.

Definition
Low-Level Policy — πLL

Takes the current observation ot and the current goal gt (from the high-level policy), and outputs primitive actions at (e.g., joint torques, motor commands). The low-level policy runs at a high frequency — it produces actions every timestep.

Hierarchical Policy Structure gt = πHL(· | ot, task)  ← slow, strategic
at = πLL(· | ot, gt)  ← fast, reactive

The variable gt goes by many names in the literature: subgoals, subtasks, skills, options, or high-level actions. They all refer to the same thing: an intermediate target that the low-level policy tries to accomplish.

Key Insight

You can have more than two levels of hierarchy. A three-level system might have a strategic planner ("make dinner"), a tactical planner ("prepare the pasta sauce"), and a motor controller ("rotate wrist 15 degrees"). But two levels are by far the most common in practice, and the principles generalize.

Interactive — Hierarchical Policy Architecture
Chapter 03

Rolling Out a Hierarchy

How does a hierarchical policy actually execute? Let's trace through the rollout procedure step by step. This is the algorithm that converts two trained policies into actual behavior.

Algorithm: Hierarchical Policy Rollout
  1. Initialize: Observe initial observation o1 at t = 1
  2. Plan: Query high-level policy for a goal: gt ∼ πHL(· | ot, task prompt)
  3. Execute: Until a new goal is selected (see Chapter 8 for when):
    1. Execute action: at ∼ πLL(· | ot, gt)
    2. Step environment: t ← t + 1, observe new ot
  4. Replan: When the goal is "done" (or after n steps), go to step 2
  5. Terminate: When the overall task is complete or time runs out

Notice the two timescales. The high-level policy might produce a new goal every 20 timesteps, while the low-level policy acts every single timestep. If the robot runs at 50 Hz control, the high-level policy might replan at 2.5 Hz — every 400ms it decides "what should we be doing now?", while the low-level policy handles the moment-to-moment motor commands at full speed.

A Concrete Example

Imagine a robot making a sandwich. Here's what the rollout looks like:

TimeπHL Output (gt)πLL Output (at)
t=1"pick up bread"move arm to bread, close gripper
t=20(same goal, still executing)lift arm, move to plate
t=35"spread peanut butter"pick up knife, scoop, spread motion
t=70"place second slice"reach for bread, position, release

The high-level policy only makes 3 decisions. The low-level policy makes 70+ decisions. Each low-level decision is simple (move arm slightly), while each high-level decision is strategic (what to do next).

Data Flow

Input to πHL: RGB images from workspace cameras (e.g., 3×256×256), task string "make sandwich" → Output: language string or latent vector representing "pick up bread"

Input to πLL: RGB images + joint positions (e.g., 7-DOF arm state) + goal "pick up bread" → Output: 7 joint target positions (or velocities) at 50 Hz

Chapter 04

Why Hierarchy Helps

Why not just train a single flat policy to go directly from observations to actions? Hierarchy provides four concrete advantages:

1. Supervision Signal Decomposition

A flat policy for "bake a cheesecake" gets a single reward at the very end: did the cheesecake turn out well? That's an incredibly sparse signal over hundreds of actions. With hierarchy, the low-level policy gets much denser feedback: did you successfully "pick up the bowl"? Did you "crack the egg"? Each subtask provides its own supervision signal, making learning tractable.

2. Knowledge Sharing Across Subtasks

The skill "pick up an object" appears in hundreds of tasks: cooking, cleaning, organizing, assembling. With a flat policy, the robot learns picking-up from scratch for each task. With hierarchy, the low-level policy learns "pick up" once and the high-level policy reuses it across many different tasks. This is the compositional benefit of hierarchy.

Compositionality

If you have K subtask skills and N tasks, a flat approach needs to learn all N tasks separately. A hierarchical approach needs K skills + N high-level plans. When K « N (many tasks reuse the same skills), hierarchy wins dramatically.

3. Structured Exploration (RL)

In reinforcement learning, exploration is critical. A flat policy explores in action space — randomly wiggling each joint. A hierarchical policy explores in goal space — "what if I tried going to that location?" or "what if I picked up the blue object instead?" Goal-space exploration is far more structured and covers meaningful variations much faster than random motor noise.

4. Practical Latency

Running a large VLM (vision-language model) takes time — maybe 200ms per inference. If you need actions at 50 Hz (20ms), you can't run the VLM every timestep. Hierarchy solves this naturally: the expensive VLM runs as the high-level policy at 5 Hz, while a fast, small neural network runs as the low-level policy at 50 Hz.

Latency Example — Physical Intelligence π0.5

πHL: Pre-trained VLA (vision-language-action model), runs infrequently. Outputs subtask predictions like "pick up the pillow." Inference: ~300ms.

πLL: Diffusion action expert (300K params), runs at full control frequency. Takes subtask + observation → continuous joint actions. Inference: ~5ms.

The high-level VLA decides what to do. The lightweight low-level model handles how to do it, fast enough for real-time robot control.

Chapter 05

Hierarchy vs. Alternatives

Hierarchy isn't the only way to handle long-horizon tasks. Let's compare three approaches head-to-head.

Option A: Flat Policy

Flat Policy at = π(· | ot, task)
One policy does everything: observe → act. Simple but struggles with long horizons.

A single neural network maps observations directly to actions. No intermediate representation. This is the simplest approach, and it works well for short tasks (pick up an object, push a button). But for tasks requiring 50+ steps with diverse subtasks, flat policies struggle with the compounding error and sparse reward problems we discussed.

Option B: Chain-of-Thought Policy

Chain-of-Thought Policy (gt, at) = π(· | ot, task)
Single policy outputs BOTH the subgoal reasoning and the action. One network, two heads.

A single policy outputs both a subgoal (as a "thought") and an action. Think of this as the policy narrating its reasoning: "I should pick up the bread [thought] → move arm right 5cm [action]." This can benefit from the same structured supervision as hierarchy, and it's simpler to implement since there's only one model.

The Comparison

PropertyHierarchyFlatChain-of-Thought
Number of models2 (HL + LL)11
Benefits from subtask labelsYesNoYes
Different frequenciesYes — key advantageN/APossible but awkward
50 Hz real-time controlYes (small LL)Only if fastExpensive at every step
ModularityHigh — swap HL/LL independentlyNoneLow
No Conclusive Winner (Yet)

There is no conclusive empirical comparison between hierarchy and chain-of-thought for robotics. Chain-of-thought may be too computationally expensive for 50 Hz control loops. But for slower domains (planning, coding, document editing), chain-of-thought may be sufficient. This is an active research question. The latency argument is the strongest practical case for hierarchy in robotics.

Chapter 06

Goal Representations

The high-level policy outputs some goal gt. But what is gt concretely? A string? A vector? An image? This is one of the most important design decisions in hierarchical systems, and the best answer depends on the domain.

Language Goals

gt is a natural language string: "pick up the bowl", "move arm to the left", "open the drawer." This is the most intuitive representation. Humans naturally describe subgoals in language, and we can collect language-annotated demonstrations easily. The high-level policy is a VLM that outputs text; the low-level policy is a language-conditioned behavior cloning (LCBC) policy.

Pros: Interpretable, easy to debug ("why did the robot fail? it chose 'pour into bag' when it should have chosen 'position bag under nozzle'"), easy to provide human corrections.

Cons: Requires segmented demonstrations with language annotations. Language can be ambiguous ("move it over there").

Image Goals

gt is a target image showing what the world should look like after completing the subtask. The high-level policy is an image editing model: given the current image, it produces a modified image showing the desired next state. The low-level policy is a goal-image-conditioned policy.

Pros: No need for language annotations — can learn from unlabeled video. Can leverage human demonstration videos. Captures spatial details that language misses ("move the bowl 3 inches left" is hard to say precisely but easy to show in an image).

Cons: Harder to interpret. Image generation can introduce artifacts. The low-level policy needs to be robust to imperfect goal images.

State Goals

gt is a target state vector: a desired (x, y, z) position of the robot's end-effector, or a desired relative position between the agent and objects. This is the most compact representation.

Pros: Very precise. Easy to measure completion (did the agent reach the target state?). Works naturally with hindsight relabeling in RL.

Cons: Limited expressiveness — "approach from the left" vs. "approach from the right" may have the same target position but very different required trajectories. Only works when you can define a meaningful state space.

Properties of Good Goal Representations

Three Properties

1. Expressive — can communicate many different low-level behaviors. If the goal space is too small, the high-level policy can't express important distinctions.

2. Structured — similar behaviors should have similar goal representations. "Pick up red bowl" and "pick up blue bowl" should be close in goal space, not arbitrary distant points.

3. Appropriate abstraction level — not so high-level that the low-level policy can't figure out what to do ("make it work"), and not so low-level that the high-level policy has to micromanage ("rotate joint 3 by 0.02 radians").

Goal TypeExpressivenessStructureAbstractionBest For
LanguageHighInherited from LLMsNaturalManipulation with supervision
ImageVery highPixel spaceFlexibleWhen unlabeled video available
State vectorModerateEuclideanLow-levelNavigation, simple RL tasks
Latent vectorDepends on trainingLearnedLearnedWhen no natural goal exists
Chapter 07

Supervising Each Level

You have two policies. Both need training. But each depends on the other, creating a chicken-and-egg problem.

The Low-Level Policy

The low-level policy is trained to accomplish a goal g — not the original long-horizon task. Its loss function looks like imitation learning (for IL) or goal-conditioned RL (for RL), conditioned on g:

Low-Level Objective (Imitation Learning) LLLLL) = E(o, a, g) ~ D [ -log πLL(a | o, g; θLL) ]
Standard behavior cloning loss, but conditioned on the goal g from segmented demos.

Critical question: For which distribution of goals should we train the low-level policy? Ideally, whatever goals the high-level policy will actually output. But we haven't trained the high-level policy yet, so we don't know what goals it will produce. In imitation learning, we can use the goals from the demonstration data. In RL, we often train on a broad distribution of goals (uniformly sampled from a goal space, or using hindsight relabeling).

The High-Level Policy

The high-level policy is trained to accomplish the original long-horizon task. Its performance depends on the low-level policy actually being able to execute the goals it outputs.

High-Level Objective (Imitation Learning) LHLHL) = E(o, g*) ~ D [ -log πHL(g* | o; θHL) ]
Predict the correct subgoal g* given observation o. g* comes from demonstration labels.

Critical question: For which low-level policy should we evaluate? Ideally with the learned low-level policy. But we haven't trained it yet, or it's still imperfect.

The Chicken-and-Egg Resolution

Key Resolution

The two policies can be trained separately first — LL on goal-reaching, HL on goal-prediction — then at least one must be adapted to the deficiencies of the other. If the LL policy is bad at certain goals, the HL policy should learn to avoid outputting those goals (or vice versa). Ideally, both are fine-tuned jointly, but single-sided adaptation also works.

A natural insight: LLMs are often good high-level policies! They already understand task decomposition from pre-training on internet text. A frozen LLM can output reasonable subgoal sequences for many tasks. You then only need to train the low-level policy to follow those subgoals.

Why Not End-to-End with Latent Goals?

You might think: why not let the goal representation be latent (a learned vector) and train everything end-to-end? The answer: the result is a flat policy. If gt has no structure and both policies are trained jointly through gt, the gradient will just learn to pass whatever information is useful through gt — which is exactly what a flat policy with an internal hidden state does. The benefits of hierarchy come from giving gt meaning (language, images, states) and training the levels with separate objectives.

Research Wisdom

"Think carefully about where the benefits are coming from!" If you can't articulate why your hierarchical design is better than a flat policy with the same capacity, it probably isn't. The benefits come from: (1) meaningful goal representations that enable transfer, (2) separate supervision signals at each level, (3) different operating frequencies, or (4) pre-trained components (like LLMs as high-level policies).

Chapter 08

Subgoal Transitions

We've been saying "until a new goal is selected" without specifying when the high-level policy gets re-queried. This turns out to be a critical design decision with no universally right answer.

Option 1: Completion-Based Transitions

Re-query πHL when the low-level policy has completed the current goal gt. This requires estimating progress toward the goal — for example, a learned classifier that predicts "has the subtask been achieved?"

Advantage: Ideal in principle — each goal runs exactly as long as it needs to.

Problems: Hard to estimate when a subtask is "done." If the completion estimator is wrong, the agent gets perpetually stuck — it thinks it hasn't finished, so it never moves on. Also: what if the agent makes a mistake that requires undoing a previous goal? The completion-based approach can't easily handle backtracking.

Fatal Errors

Errors in estimating completion are more fatal than errors in subgoal prediction. If the high-level policy picks a slightly wrong subgoal, the low-level policy might still do something reasonable. But if the transition detector gets stuck, the agent perpetually repeats the same subtask forever.

Option 2: Fixed-Interval Transitions

Re-query πHL every n timesteps, regardless of progress. For example, replan every 20 steps (400ms at 50 Hz control).

Advantage: Simple. No completion estimation needed. Robust to getting stuck — even if the current subtask is going badly, the agent will replan in n steps.

Problems: If the high-level policy picks a wrong subtask, the low-level policy executes bad actions for n full steps before any correction. Small n means more compute spent on the (expensive) high-level policy. Large n means slower recovery from errors. There's a tradeoff between compute cost (more HL queries) and delay (longer before correction).

Interactive — Subgoal Transition Timing
15
PropertyCompletion-BasedFixed-Interval
Implementation complexityHigh (need classifier)Low (just count steps)
Failure modeAgent gets stuck forevern steps of wrong actions
Failure severityFatal — no recoveryRecoverable at next replan
Compute costVariable (adaptive)Predictable (every n steps)
Used in practiceLess commonMore common (e.g., π0.5)
Chapter 09

Hierarchical Imitation with Language Goals

Let's ground these abstractions in a real system. "Yell At Your Robot" (Shi, Hu, Zhao et al., RSS 2024) and "Hi Robot" (Shi et al., ICML 2025) are hierarchical imitation learning systems that use language as the goal representation.

The Data

Training data consists of segmented demonstrations with language annotations. Each demonstration is a video of a robot performing a task, manually segmented into subtask segments, each labeled with a language description:

SegmentLanguage LabelRobot Data
1"pick up the bag"Images + joint positions for 30 timesteps
2"pick up the metal scoop"Images + joint positions for 25 timesteps
3"scoop M&Ms"Images + joint positions for 40 timesteps
4"pour into the bag"Images + joint positions for 35 timesteps

The Architecture

The system uses an ALOHA bimanual robot workcell with multiple cameras. The two policies are:

System Component
High-Level Language Policy

Input: RGB images from workspace cameras (current observation)

Output: Language string (the next subtask to execute)

Architecture: Vision-Language Model — processes images through a vision encoder, concatenates with previous language context, produces next language command via autoregressive decoding.

System Component
Low-Level LCBC Policy

Input: RGB images + joint positions + language goal string

Output: Joint target positions (7-DOF per arm × 2 arms = 14 values)

Architecture: Language-Conditioned Behavior Cloning (LCBC) — a policy that takes a language instruction and produces motor actions via a Transformer or diffusion policy.

Training: High-Level DAgger

Both policies are initially trained with supervised learning on the segmented demonstrations. But can we improve the high-level policy after deployment? Yes — with DAgger applied at the high level.

Recall that DAgger collects on-policy data by running the learned policy, then getting expert corrections. For hierarchical systems, the expert corrections are language commands. A human watches the robot and says "no, pick up the scoop first" when the high-level policy makes a wrong decision.

Algorithm: High-Level DAgger for Hierarchical IL
  1. Deploy: Run the hierarchical policy (HL + LL) on the robot
  2. Intervene: Human provides language corrections when HL makes wrong decisions. Language corrections override the high-level policy's prediction — the LL policy receives the human's command instead
  3. Freeze LL: Keep the low-level LCBC policy frozen (no updates to motor control)
  4. Update HL: Fine-tune the high-level policy by supervising on the human language corrections (treat corrections as ground-truth labels for the observations where they occurred)
  5. Repeat: Deploy updated HL policy, collect more corrections, iterate
Why This Is Elegant

In standard DAgger for flat policies, the human must provide motor-level corrections — physically teleoperating the robot. That's expensive and requires hardware access. With hierarchical DAgger, the human only provides language corrections — literally yelling at the robot. This is cheap, fast, and doesn't require any special equipment. You can even do it from another room.

After DAgger fine-tuning, the high-level policy can self-correct: if it starts to make a mistake (e.g., reaching for the wrong item), it has learned from past corrections to re-route to the right action without any human intervention.

Does It Work?

Two key empirical results from the Hi Robot and Yell At Your Robot papers:

1. Hierarchy beats flat. On long-horizon tasks, hierarchical VLA policies outperform flat VLA policies by 19-34% on average success rate. The gap grows with task horizon — the longer the task, the more hierarchy helps.

2. HL DAgger beats vanilla imitation. After high-level DAgger fine-tuning, task performance jumps by ~20% compared to the base hierarchical policy. The DAgger-trained policy approaches the performance of having a human oracle correct every mistake in real time.

Real-World Result — π0.5 (Physical Intelligence, 2025)

The same principles at industry scale. π0.5 uses a pre-trained VLA as the high-level policy that outputs subtask language labels ("pick up the pillow"), and a small diffusion action expert (300K parameters) as the low-level policy that converts subtask labels into continuous motor actions. The system handles household tasks with horizons of 100+ steps — cleaning bedrooms, organizing laundry — that no flat policy has achieved reliably.

Chapter 10

Hierarchical Imitation with Image Goals

What if you don't have language annotations for your demonstration data? SuSIE (Black, Nakamoto, Atreya, Walke, Finn, Kumar, Levine, ICLR 2024) replaces language goals with image goals, opening the door to learning from unlabeled video.

The Architecture

Instead of outputting a language string, the high-level policy is an image editing model. Given the current image, it produces a goal image showing what the scene should look like after completing the next subtask — for example, showing the bowl moved to a new position.

SuSIE Architecture gt = ImageEdit(ot, task)  ← "what should the world look like next?"
at = πLL(ot, gt)    ← "how do I get there?"

The low-level policy is a goal-image-conditioned policy: given the current observation and a goal image, produce actions that transform the scene from the current state toward the goal state. No language is involved anywhere in the pipeline.

Training Data: No Language Required

The key benefit: you don't need segmented demonstrations with language labels. The high-level image editing model can be trained on unlabeled videos — just pairs of (current frame, future frame). These can come from robot data, or even from videos of humans performing tasks.

Human Videos as Training Data

Since the high-level policy only needs to understand "what the world should look like next" (not how to move robot joints), it can learn from human demonstration videos. A human opening a drawer looks different from a robot opening a drawer at the motor level, but the goal image — "drawer is now open" — is the same regardless of who performed the action.

DatasetScene B (Avg.)Scene C (Avg.)
BridgeData (robot) only0.300.80
BridgeData + Something-Something (human videos)0.500.88

Adding human video data to the high-level policy training improved success rates by 10-20 percentage points. The human videos teach the model about object affordances and common manipulation sequences, even though they contain no robot-specific information.

The Hierarchy Insight for Data

Hierarchy enables asymmetric data requirements. The high-level policy (goal imagination) can train on abundant, cheap data (internet videos, human demos). The low-level policy (motor execution) needs expensive robot-specific data, but for much simpler tasks (reach this goal state). You're matching each level to its natural data source.

Chapter 11

Hierarchical Reinforcement Learning

Everything so far has used imitation learning — learning from demonstrations. Can we apply hierarchy to RL, where we learn from reward signals instead? Yes, but the design choices are different.

State-Reaching Goals: HIRO

HIRO (Nachum, Gu, Lee, Levine, NeurIPS 2018) uses state vectors as goals. The goal gt specifies a target relative position — e.g., "move 2 meters forward and 1 meter left."

System
HIRO — HIerarchical Reinforcement learning with Off-policy correction

Low-level: Goal-conditioned policy trained with a simple goal-reaching reward: rLL = -||current_state - goal_state||. No task-specific reward needed for the low-level.

High-level: Trained to output goal states using the task reward. HL actions are goal vectors. Trained with off-policy RL (like TD3 or SAC).

Hindsight Relabeling for the High Level

Here's the clever part. The high-level policy outputs a goal gt, saying "go to position X." The low-level policy tries but ends up at position Y instead (because it's imperfect). In standard RL, this is just a failure. But with hindsight relabeling, we can pretend the high-level policy intended position Y all along. We relabel the high-level action gt with the state the agent actually reached, and store this relabeled transition in the replay buffer.

Hindsight Relabeling Original: (ot, gt, reward, ot+n)  ← HL intended gt, agent reached st+n
Relabeled: (ot, st+n, reward, ot+n)  ← pretend HL intended st+n
The reward stays the same — we just relabel which goal was "intended."

This is crucial for off-policy learning. Without relabeling, the high-level transitions in the replay buffer would be useless because the low-level policy has changed since they were collected. With relabeling, we can reuse old experience even as the low-level policy improves.

Language Goals for HRL

Jiang, Gu, Murphy, and Finn (NeurIPS 2019) extended this idea to language goals. The high-level policy outputs language commands, and the low-level policy is language-conditioned. They use hindsight language relabeling: after a trajectory, they describe what the agent actually did in language and relabel the high-level action with that description.

Skill Discovery Without Supervision

A fascinating research direction: can you discover a diverse set of useful skills without any task-specific supervision?

DIAYN (Eysenbach, Gupta, Ibarz, Levine, ICLR 2019) answers yes. The idea: train a set of skills (low-level policies) to be as diverse as possible — each skill should produce different behavior, and you should be able to tell which skill was used by observing the outcome.

DIAYN Objective max I(z; s) - I(z; a | s)
Maximize mutual information between skill z and states visited,
while minimizing dependence on actions given state.
This means: skills should lead to distinguishable states, but each individual action should be predictable from the state + skill identity.

The result: without any reward function, the agent learns a library of diverse skills — different locomotion gaits, turning behaviors, jumping patterns. These can then be composed by a high-level policy for downstream tasks.

Open Research Direction

Fine-tuning large-scale hierarchical robot learning systems with RL is an open and important research direction. Most current industrial systems (PI's π0.5, NVIDIA Gr00t N1, Figure Helix, Google Gemini Robotics) use hierarchical imitation learning. Adding RL fine-tuning on top could improve performance further, but the combination of hierarchy + large models + RL has not yet been cracked at scale.

Interactive — Hierarchical RL with Hindsight Relabeling
0.6
Chapter 12

Summary & Cheat Sheet

The Core Framework

Hierarchical Policy gt = πHL(· | ot, task)    slow, strategic, runs at ~2-5 Hz
at = πLL(· | ot, gt)     fast, reactive, runs at ~50 Hz

Decision Cheat Sheet

Design ChoiceOptionsRecommendation
Goal representationLanguage, image, state, latentLanguage if you have annotations; image if unlabeled video available; state for simple RL
Training paradigmIL, RL, IL + RL fine-tuningIL is proven at scale; RL for HL fine-tuning is promising but open
SupervisionJoint, separate, pre-trained HLPre-trained LLM/VLM as HL + train LL, or separate then adapt
TransitionsFixed interval, completion-basedFixed interval is simpler and more robust; completion risks getting stuck
HL improvementMore data, DAgger, RLHL DAgger with language corrections is cheap and effective

Key Papers

PaperContributionGoal TypeIL/RL
Yell At Your Robot (RSS 2024)HL DAgger with language correctionsLanguageIL
Hi Robot (ICML 2025)Hierarchical VLA at scaleLanguageIL
π0.5 (arXiv 2025)Industry-scale hierarchical VLA + action expertLanguageIL
SuSIE (ICLR 2024)Image editing as HL policyImageIL
HIRO (NeurIPS 2018)Off-policy HRL with hindsight relabelingStateRL
Language Abstraction (NeurIPS 2019)Language goals for HRLLanguageRL
DIAYN (ICLR 2019)Unsupervised skill discoveryLatentRL

The Big Picture

Where This Fits

Hierarchy is one of three main approaches to long-horizon robot learning covered in CS 224R:

1. Hierarchy (this lecture) — decompose into HL + LL policies.

2. Multi-task & meta-learning (previous lectures) — learn shared representations across tasks.

3. RL fine-tuning of foundation models (next lectures) — start from VLAs, improve with RL + sim-to-real.

These approaches are complementary, not competing. State-of-the-art systems like π0.5 combine all three: hierarchical structure, multi-task pre-training, and (potentially) RL fine-tuning.