Think about what "cook focaccia" actually requires. You need to gather flour, yeast, olive oil, salt. Mix them. Knead the dough. Let it rise for an hour. Dimple it with your fingers. Drizzle more oil. Bake at 425°F for 20 minutes. Each of those steps is itself a multi-step procedure. A single flat policy would need to learn the entire sequence, from opening the flour bag to pulling the bread from the oven, as one monolithic behavior.
This isn't unique to cooking. Consider these long-horizon tasks:
• Fix a bug that's causing neural network training loss to explode — requires reading logs, forming hypotheses, editing code, re-running training, interpreting results.
• Drive to Yosemite — navigate out of the parking lot, merge onto the highway, handle 3 hours of varied road conditions, find the campsite.
• Give feedback on an 8-page report — read each section, note issues, check citations, write coherent comments, re-read for consistency.
Why Are These Hard?
Three compounding difficulties make long-horizon tasks brutal for any learning algorithm:
1. Enormous state space. A 100-step task visits far more distinct states than a 10-step task. The number of possible trajectories grows exponentially with horizon length. If you have A possible actions per step and a horizon of T, the space of possible trajectories is AT. Even with just 4 actions and a horizon of 50, that's 450 ≈ 1030 possible trajectories.
Trajectory Space
|trajectory space| = |A|T With 4 actions and T=50: 450 ≈ 1.27 × 1030 trajectories
2. Compounding errors. At each timestep, the policy has some probability ε of making a mistake. Over T steps, the probability of a perfect rollout is (1-ε)T. For ε = 0.05 and T = 100, that's 0.95100 ≈ 0.006 — less than 1% chance of success even with a 95%-accurate policy.
3. Getting stuck. Long tasks create many opportunities for the agent to enter irrecoverable states. A robot that drops a key ingredient halfway through cooking has no path to success. In RL terms, many states become "absorbing failure states" from which no sequence of actions can reach the goal.
The Core Challenge
Long-horizon tasks are hard not because any individual step is hard, but because the number of steps compounds every source of error. A policy that's excellent at each individual action can still fail catastrophically over 100 steps. We need a way to structure the problem so we don't require perfection over the entire horizon.
Interactive — Compounding Error Probability
0.05
Chapter 02
The Main Idea: Two Policies
Here's the key insight: "bake a cheesecake" is impossibly hard as one monolithic task. But "buy ingredients" is manageable. "Go to the store" is easier still. "Walk to the door" is trivial. "Take a step" is reflexive. We naturally decompose complex tasks into a hierarchy of subtasks, each at a different level of abstraction.
The hierarchical policy framework makes this decomposition explicit with two (or more) policies:
Definition
High-Level Policy — πHL
Takes the current observation ot and the overall task description (e.g., "bake a cheesecake"), and outputs an intermediate goal gt. This goal is a subcommand like "preheat the oven" or "mix the batter." The high-level policy runs at a low frequency — it only decides on new goals occasionally.
Definition
Low-Level Policy — πLL
Takes the current observation ot and the current goal gt (from the high-level policy), and outputs primitive actions at (e.g., joint torques, motor commands). The low-level policy runs at a high frequency — it produces actions every timestep.
The variable gt goes by many names in the literature: subgoals, subtasks, skills, options, or high-level actions. They all refer to the same thing: an intermediate target that the low-level policy tries to accomplish.
Key Insight
You can have more than two levels of hierarchy. A three-level system might have a strategic planner ("make dinner"), a tactical planner ("prepare the pasta sauce"), and a motor controller ("rotate wrist 15 degrees"). But two levels are by far the most common in practice, and the principles generalize.
Interactive — Hierarchical Policy Architecture
Chapter 03
Rolling Out a Hierarchy
How does a hierarchical policy actually execute? Let's trace through the rollout procedure step by step. This is the algorithm that converts two trained policies into actual behavior.
Algorithm: Hierarchical Policy Rollout
Initialize: Observe initial observation o1 at t = 1
Plan: Query high-level policy for a goal: gt ∼ πHL(· | ot, task prompt)
Execute: Until a new goal is selected (see Chapter 8 for when):
Execute action: at ∼ πLL(· | ot, gt)
Step environment: t ← t + 1, observe new ot
Replan: When the goal is "done" (or after n steps), go to step 2
Terminate: When the overall task is complete or time runs out
Notice the two timescales. The high-level policy might produce a new goal every 20 timesteps, while the low-level policy acts every single timestep. If the robot runs at 50 Hz control, the high-level policy might replan at 2.5 Hz — every 400ms it decides "what should we be doing now?", while the low-level policy handles the moment-to-moment motor commands at full speed.
A Concrete Example
Imagine a robot making a sandwich. Here's what the rollout looks like:
Time
πHL Output (gt)
πLL Output (at)
t=1
"pick up bread"
move arm to bread, close gripper
t=20
(same goal, still executing)
lift arm, move to plate
t=35
"spread peanut butter"
pick up knife, scoop, spread motion
t=70
"place second slice"
reach for bread, position, release
The high-level policy only makes 3 decisions. The low-level policy makes 70+ decisions. Each low-level decision is simple (move arm slightly), while each high-level decision is strategic (what to do next).
Data Flow
Input to πHL: RGB images from workspace cameras (e.g., 3×256×256), task string "make sandwich" → Output: language string or latent vector representing "pick up bread"
Input to πLL: RGB images + joint positions (e.g., 7-DOF arm state) + goal "pick up bread" → Output: 7 joint target positions (or velocities) at 50 Hz
Chapter 04
Why Hierarchy Helps
Why not just train a single flat policy to go directly from observations to actions? Hierarchy provides four concrete advantages:
1. Supervision Signal Decomposition
A flat policy for "bake a cheesecake" gets a single reward at the very end: did the cheesecake turn out well? That's an incredibly sparse signal over hundreds of actions. With hierarchy, the low-level policy gets much denser feedback: did you successfully "pick up the bowl"? Did you "crack the egg"? Each subtask provides its own supervision signal, making learning tractable.
2. Knowledge Sharing Across Subtasks
The skill "pick up an object" appears in hundreds of tasks: cooking, cleaning, organizing, assembling. With a flat policy, the robot learns picking-up from scratch for each task. With hierarchy, the low-level policy learns "pick up" once and the high-level policy reuses it across many different tasks. This is the compositional benefit of hierarchy.
Compositionality
If you have K subtask skills and N tasks, a flat approach needs to learn all N tasks separately. A hierarchical approach needs K skills + N high-level plans. When K « N (many tasks reuse the same skills), hierarchy wins dramatically.
3. Structured Exploration (RL)
In reinforcement learning, exploration is critical. A flat policy explores in action space — randomly wiggling each joint. A hierarchical policy explores in goal space — "what if I tried going to that location?" or "what if I picked up the blue object instead?" Goal-space exploration is far more structured and covers meaningful variations much faster than random motor noise.
4. Practical Latency
Running a large VLM (vision-language model) takes time — maybe 200ms per inference. If you need actions at 50 Hz (20ms), you can't run the VLM every timestep. Hierarchy solves this naturally: the expensive VLM runs as the high-level policy at 5 Hz, while a fast, small neural network runs as the low-level policy at 50 Hz.
Latency Example — Physical Intelligence π0.5
πHL: Pre-trained VLA (vision-language-action model), runs infrequently. Outputs subtask predictions like "pick up the pillow." Inference: ~300ms.
πLL: Diffusion action expert (300K params), runs at full control frequency. Takes subtask + observation → continuous joint actions. Inference: ~5ms.
The high-level VLA decides what to do. The lightweight low-level model handles how to do it, fast enough for real-time robot control.
Chapter 05
Hierarchy vs. Alternatives
Hierarchy isn't the only way to handle long-horizon tasks. Let's compare three approaches head-to-head.
Option A: Flat Policy
Flat Policy
at = π(· | ot, task)
One policy does everything: observe → act. Simple but struggles with long horizons.
A single neural network maps observations directly to actions. No intermediate representation. This is the simplest approach, and it works well for short tasks (pick up an object, push a button). But for tasks requiring 50+ steps with diverse subtasks, flat policies struggle with the compounding error and sparse reward problems we discussed.
Option B: Chain-of-Thought Policy
Chain-of-Thought Policy
(gt, at) = π(· | ot, task)
Single policy outputs BOTH the subgoal reasoning and the action. One network, two heads.
A single policy outputs both a subgoal (as a "thought") and an action. Think of this as the policy narrating its reasoning: "I should pick up the bread [thought] → move arm right 5cm [action]." This can benefit from the same structured supervision as hierarchy, and it's simpler to implement since there's only one model.
The Comparison
Property
Hierarchy
Flat
Chain-of-Thought
Number of models
2 (HL + LL)
1
1
Benefits from subtask labels
Yes
No
Yes
Different frequencies
Yes — key advantage
N/A
Possible but awkward
50 Hz real-time control
Yes (small LL)
Only if fast
Expensive at every step
Modularity
High — swap HL/LL independently
None
Low
No Conclusive Winner (Yet)
There is no conclusive empirical comparison between hierarchy and chain-of-thought for robotics. Chain-of-thought may be too computationally expensive for 50 Hz control loops. But for slower domains (planning, coding, document editing), chain-of-thought may be sufficient. This is an active research question. The latency argument is the strongest practical case for hierarchy in robotics.
Chapter 06
Goal Representations
The high-level policy outputs some goal gt. But what is gt concretely? A string? A vector? An image? This is one of the most important design decisions in hierarchical systems, and the best answer depends on the domain.
Language Goals
gt is a natural language string: "pick up the bowl", "move arm to the left", "open the drawer." This is the most intuitive representation. Humans naturally describe subgoals in language, and we can collect language-annotated demonstrations easily. The high-level policy is a VLM that outputs text; the low-level policy is a language-conditioned behavior cloning (LCBC) policy.
Pros: Interpretable, easy to debug ("why did the robot fail? it chose 'pour into bag' when it should have chosen 'position bag under nozzle'"), easy to provide human corrections.
Cons: Requires segmented demonstrations with language annotations. Language can be ambiguous ("move it over there").
Image Goals
gt is a target image showing what the world should look like after completing the subtask. The high-level policy is an image editing model: given the current image, it produces a modified image showing the desired next state. The low-level policy is a goal-image-conditioned policy.
Pros: No need for language annotations — can learn from unlabeled video. Can leverage human demonstration videos. Captures spatial details that language misses ("move the bowl 3 inches left" is hard to say precisely but easy to show in an image).
Cons: Harder to interpret. Image generation can introduce artifacts. The low-level policy needs to be robust to imperfect goal images.
State Goals
gt is a target state vector: a desired (x, y, z) position of the robot's end-effector, or a desired relative position between the agent and objects. This is the most compact representation.
Pros: Very precise. Easy to measure completion (did the agent reach the target state?). Works naturally with hindsight relabeling in RL.
Cons: Limited expressiveness — "approach from the left" vs. "approach from the right" may have the same target position but very different required trajectories. Only works when you can define a meaningful state space.
Properties of Good Goal Representations
Three Properties
1. Expressive — can communicate many different low-level behaviors. If the goal space is too small, the high-level policy can't express important distinctions.
2. Structured — similar behaviors should have similar goal representations. "Pick up red bowl" and "pick up blue bowl" should be close in goal space, not arbitrary distant points.
3. Appropriate abstraction level — not so high-level that the low-level policy can't figure out what to do ("make it work"), and not so low-level that the high-level policy has to micromanage ("rotate joint 3 by 0.02 radians").
Goal Type
Expressiveness
Structure
Abstraction
Best For
Language
High
Inherited from LLMs
Natural
Manipulation with supervision
Image
Very high
Pixel space
Flexible
When unlabeled video available
State vector
Moderate
Euclidean
Low-level
Navigation, simple RL tasks
Latent vector
Depends on training
Learned
Learned
When no natural goal exists
Chapter 07
Supervising Each Level
You have two policies. Both need training. But each depends on the other, creating a chicken-and-egg problem.
The Low-Level Policy
The low-level policy is trained to accomplish a goal g — not the original long-horizon task. Its loss function looks like imitation learning (for IL) or goal-conditioned RL (for RL), conditioned on g:
Low-Level Objective (Imitation Learning)
LLL(θLL) = E(o, a, g) ~ D [ -log πLL(a | o, g; θLL) ]
Standard behavior cloning loss, but conditioned on the goal g from segmented demos.
Critical question: For which distribution of goals should we train the low-level policy? Ideally, whatever goals the high-level policy will actually output. But we haven't trained the high-level policy yet, so we don't know what goals it will produce. In imitation learning, we can use the goals from the demonstration data. In RL, we often train on a broad distribution of goals (uniformly sampled from a goal space, or using hindsight relabeling).
The High-Level Policy
The high-level policy is trained to accomplish the original long-horizon task. Its performance depends on the low-level policy actually being able to execute the goals it outputs.
High-Level Objective (Imitation Learning)
LHL(θHL) = E(o, g*) ~ D [ -log πHL(g* | o; θHL) ]
Predict the correct subgoal g* given observation o. g* comes from demonstration labels.
Critical question: For which low-level policy should we evaluate? Ideally with the learned low-level policy. But we haven't trained it yet, or it's still imperfect.
The Chicken-and-Egg Resolution
Key Resolution
The two policies can be trained separately first — LL on goal-reaching, HL on goal-prediction — then at least one must be adapted to the deficiencies of the other. If the LL policy is bad at certain goals, the HL policy should learn to avoid outputting those goals (or vice versa). Ideally, both are fine-tuned jointly, but single-sided adaptation also works.
A natural insight: LLMs are often good high-level policies! They already understand task decomposition from pre-training on internet text. A frozen LLM can output reasonable subgoal sequences for many tasks. You then only need to train the low-level policy to follow those subgoals.
Why Not End-to-End with Latent Goals?
You might think: why not let the goal representation be latent (a learned vector) and train everything end-to-end? The answer: the result is a flat policy. If gt has no structure and both policies are trained jointly through gt, the gradient will just learn to pass whatever information is useful through gt — which is exactly what a flat policy with an internal hidden state does. The benefits of hierarchy come from giving gt meaning (language, images, states) and training the levels with separate objectives.
Research Wisdom
"Think carefully about where the benefits are coming from!" If you can't articulate why your hierarchical design is better than a flat policy with the same capacity, it probably isn't. The benefits come from: (1) meaningful goal representations that enable transfer, (2) separate supervision signals at each level, (3) different operating frequencies, or (4) pre-trained components (like LLMs as high-level policies).
Chapter 08
Subgoal Transitions
We've been saying "until a new goal is selected" without specifying when the high-level policy gets re-queried. This turns out to be a critical design decision with no universally right answer.
Option 1: Completion-Based Transitions
Re-query πHL when the low-level policy has completed the current goal gt. This requires estimating progress toward the goal — for example, a learned classifier that predicts "has the subtask been achieved?"
Advantage: Ideal in principle — each goal runs exactly as long as it needs to.
Problems: Hard to estimate when a subtask is "done." If the completion estimator is wrong, the agent gets perpetually stuck — it thinks it hasn't finished, so it never moves on. Also: what if the agent makes a mistake that requires undoing a previous goal? The completion-based approach can't easily handle backtracking.
Fatal Errors
Errors in estimating completion are more fatal than errors in subgoal prediction. If the high-level policy picks a slightly wrong subgoal, the low-level policy might still do something reasonable. But if the transition detector gets stuck, the agent perpetually repeats the same subtask forever.
Option 2: Fixed-Interval Transitions
Re-query πHL every n timesteps, regardless of progress. For example, replan every 20 steps (400ms at 50 Hz control).
Advantage: Simple. No completion estimation needed. Robust to getting stuck — even if the current subtask is going badly, the agent will replan in n steps.
Problems: If the high-level policy picks a wrong subtask, the low-level policy executes bad actions for n full steps before any correction. Small n means more compute spent on the (expensive) high-level policy. Large n means slower recovery from errors. There's a tradeoff between compute cost (more HL queries) and delay (longer before correction).
Interactive — Subgoal Transition Timing
15
Property
Completion-Based
Fixed-Interval
Implementation complexity
High (need classifier)
Low (just count steps)
Failure mode
Agent gets stuck forever
n steps of wrong actions
Failure severity
Fatal — no recovery
Recoverable at next replan
Compute cost
Variable (adaptive)
Predictable (every n steps)
Used in practice
Less common
More common (e.g., π0.5)
Chapter 09
Hierarchical Imitation with Language Goals
Let's ground these abstractions in a real system. "Yell At Your Robot" (Shi, Hu, Zhao et al., RSS 2024) and "Hi Robot" (Shi et al., ICML 2025) are hierarchical imitation learning systems that use language as the goal representation.
The Data
Training data consists of segmented demonstrations with language annotations. Each demonstration is a video of a robot performing a task, manually segmented into subtask segments, each labeled with a language description:
Segment
Language Label
Robot Data
1
"pick up the bag"
Images + joint positions for 30 timesteps
2
"pick up the metal scoop"
Images + joint positions for 25 timesteps
3
"scoop M&Ms"
Images + joint positions for 40 timesteps
4
"pour into the bag"
Images + joint positions for 35 timesteps
The Architecture
The system uses an ALOHA bimanual robot workcell with multiple cameras. The two policies are:
System Component
High-Level Language Policy
Input: RGB images from workspace cameras (current observation)
Output: Language string (the next subtask to execute)
Architecture: Vision-Language Model — processes images through a vision encoder, concatenates with previous language context, produces next language command via autoregressive decoding.
System Component
Low-Level LCBC Policy
Input: RGB images + joint positions + language goal string
Output: Joint target positions (7-DOF per arm × 2 arms = 14 values)
Architecture: Language-Conditioned Behavior Cloning (LCBC) — a policy that takes a language instruction and produces motor actions via a Transformer or diffusion policy.
Training: High-Level DAgger
Both policies are initially trained with supervised learning on the segmented demonstrations. But can we improve the high-level policy after deployment? Yes — with DAgger applied at the high level.
Recall that DAgger collects on-policy data by running the learned policy, then getting expert corrections. For hierarchical systems, the expert corrections are language commands. A human watches the robot and says "no, pick up the scoop first" when the high-level policy makes a wrong decision.
Algorithm: High-Level DAgger for Hierarchical IL
Deploy: Run the hierarchical policy (HL + LL) on the robot
Intervene: Human provides language corrections when HL makes wrong decisions. Language corrections override the high-level policy's prediction — the LL policy receives the human's command instead
Freeze LL: Keep the low-level LCBC policy frozen (no updates to motor control)
Update HL: Fine-tune the high-level policy by supervising on the human language corrections (treat corrections as ground-truth labels for the observations where they occurred)
Repeat: Deploy updated HL policy, collect more corrections, iterate
Why This Is Elegant
In standard DAgger for flat policies, the human must provide motor-level corrections — physically teleoperating the robot. That's expensive and requires hardware access. With hierarchical DAgger, the human only provides language corrections — literally yelling at the robot. This is cheap, fast, and doesn't require any special equipment. You can even do it from another room.
After DAgger fine-tuning, the high-level policy can self-correct: if it starts to make a mistake (e.g., reaching for the wrong item), it has learned from past corrections to re-route to the right action without any human intervention.
Does It Work?
Two key empirical results from the Hi Robot and Yell At Your Robot papers:
1. Hierarchy beats flat. On long-horizon tasks, hierarchical VLA policies outperform flat VLA policies by 19-34% on average success rate. The gap grows with task horizon — the longer the task, the more hierarchy helps.
2. HL DAgger beats vanilla imitation. After high-level DAgger fine-tuning, task performance jumps by ~20% compared to the base hierarchical policy. The DAgger-trained policy approaches the performance of having a human oracle correct every mistake in real time.
Real-World Result — π0.5 (Physical Intelligence, 2025)
The same principles at industry scale. π0.5 uses a pre-trained VLA as the high-level policy that outputs subtask language labels ("pick up the pillow"), and a small diffusion action expert (300K parameters) as the low-level policy that converts subtask labels into continuous motor actions. The system handles household tasks with horizons of 100+ steps — cleaning bedrooms, organizing laundry — that no flat policy has achieved reliably.
Chapter 10
Hierarchical Imitation with Image Goals
What if you don't have language annotations for your demonstration data? SuSIE (Black, Nakamoto, Atreya, Walke, Finn, Kumar, Levine, ICLR 2024) replaces language goals with image goals, opening the door to learning from unlabeled video.
The Architecture
Instead of outputting a language string, the high-level policy is an image editing model. Given the current image, it produces a goal image showing what the scene should look like after completing the next subtask — for example, showing the bowl moved to a new position.
SuSIE Architecture
gt = ImageEdit(ot, task) ← "what should the world look like next?"
at = πLL(ot, gt) ← "how do I get there?"
The low-level policy is a goal-image-conditioned policy: given the current observation and a goal image, produce actions that transform the scene from the current state toward the goal state. No language is involved anywhere in the pipeline.
Training Data: No Language Required
The key benefit: you don't need segmented demonstrations with language labels. The high-level image editing model can be trained on unlabeled videos — just pairs of (current frame, future frame). These can come from robot data, or even from videos of humans performing tasks.
Human Videos as Training Data
Since the high-level policy only needs to understand "what the world should look like next" (not how to move robot joints), it can learn from human demonstration videos. A human opening a drawer looks different from a robot opening a drawer at the motor level, but the goal image — "drawer is now open" — is the same regardless of who performed the action.
Dataset
Scene B (Avg.)
Scene C (Avg.)
BridgeData (robot) only
0.30
0.80
BridgeData + Something-Something (human videos)
0.50
0.88
Adding human video data to the high-level policy training improved success rates by 10-20 percentage points. The human videos teach the model about object affordances and common manipulation sequences, even though they contain no robot-specific information.
The Hierarchy Insight for Data
Hierarchy enables asymmetric data requirements. The high-level policy (goal imagination) can train on abundant, cheap data (internet videos, human demos). The low-level policy (motor execution) needs expensive robot-specific data, but for much simpler tasks (reach this goal state). You're matching each level to its natural data source.
Chapter 11
Hierarchical Reinforcement Learning
Everything so far has used imitation learning — learning from demonstrations. Can we apply hierarchy to RL, where we learn from reward signals instead? Yes, but the design choices are different.
State-Reaching Goals: HIRO
HIRO (Nachum, Gu, Lee, Levine, NeurIPS 2018) uses state vectors as goals. The goal gt specifies a target relative position — e.g., "move 2 meters forward and 1 meter left."
System
HIRO — HIerarchical Reinforcement learning with Off-policy correction
Low-level: Goal-conditioned policy trained with a simple goal-reaching reward: rLL = -||current_state - goal_state||. No task-specific reward needed for the low-level.
High-level: Trained to output goal states using the task reward. HL actions are goal vectors. Trained with off-policy RL (like TD3 or SAC).
Hindsight Relabeling for the High Level
Here's the clever part. The high-level policy outputs a goal gt, saying "go to position X." The low-level policy tries but ends up at position Y instead (because it's imperfect). In standard RL, this is just a failure. But with hindsight relabeling, we can pretend the high-level policy intended position Y all along. We relabel the high-level action gt with the state the agent actually reached, and store this relabeled transition in the replay buffer.
Hindsight Relabeling
Original: (ot, gt, reward, ot+n) ← HL intended gt, agent reached st+n Relabeled: (ot, st+n, reward, ot+n) ← pretend HL intended st+n The reward stays the same — we just relabel which goal was "intended."
This is crucial for off-policy learning. Without relabeling, the high-level transitions in the replay buffer would be useless because the low-level policy has changed since they were collected. With relabeling, we can reuse old experience even as the low-level policy improves.
Language Goals for HRL
Jiang, Gu, Murphy, and Finn (NeurIPS 2019) extended this idea to language goals. The high-level policy outputs language commands, and the low-level policy is language-conditioned. They use hindsight language relabeling: after a trajectory, they describe what the agent actually did in language and relabel the high-level action with that description.
Skill Discovery Without Supervision
A fascinating research direction: can you discover a diverse set of useful skills without any task-specific supervision?
DIAYN (Eysenbach, Gupta, Ibarz, Levine, ICLR 2019) answers yes. The idea: train a set of skills (low-level policies) to be as diverse as possible — each skill should produce different behavior, and you should be able to tell which skill was used by observing the outcome.
DIAYN Objective
max I(z; s) - I(z; a | s)
Maximize mutual information between skill z and states visited, while minimizing dependence on actions given state. This means: skills should lead to distinguishable states, but each individual action should be predictable from the state + skill identity.
The result: without any reward function, the agent learns a library of diverse skills — different locomotion gaits, turning behaviors, jumping patterns. These can then be composed by a high-level policy for downstream tasks.
Open Research Direction
Fine-tuning large-scale hierarchical robot learning systems with RL is an open and important research direction. Most current industrial systems (PI's π0.5, NVIDIA Gr00t N1, Figure Helix, Google Gemini Robotics) use hierarchical imitation learning. Adding RL fine-tuning on top could improve performance further, but the combination of hierarchy + large models + RL has not yet been cracked at scale.
Interactive — Hierarchical RL with Hindsight Relabeling
0.6
Chapter 12
Summary & Cheat Sheet
The Core Framework
Hierarchical Policy
gt = πHL(· | ot, task) slow, strategic, runs at ~2-5 Hz
at = πLL(· | ot, gt) fast, reactive, runs at ~50 Hz
Decision Cheat Sheet
Design Choice
Options
Recommendation
Goal representation
Language, image, state, latent
Language if you have annotations; image if unlabeled video available; state for simple RL
Training paradigm
IL, RL, IL + RL fine-tuning
IL is proven at scale; RL for HL fine-tuning is promising but open
Supervision
Joint, separate, pre-trained HL
Pre-trained LLM/VLM as HL + train LL, or separate then adapt
Transitions
Fixed interval, completion-based
Fixed interval is simpler and more robust; completion risks getting stuck
HL improvement
More data, DAgger, RL
HL DAgger with language corrections is cheap and effective
Key Papers
Paper
Contribution
Goal Type
IL/RL
Yell At Your Robot (RSS 2024)
HL DAgger with language corrections
Language
IL
Hi Robot (ICML 2025)
Hierarchical VLA at scale
Language
IL
π0.5 (arXiv 2025)
Industry-scale hierarchical VLA + action expert
Language
IL
SuSIE (ICLR 2024)
Image editing as HL policy
Image
IL
HIRO (NeurIPS 2018)
Off-policy HRL with hindsight relabeling
State
RL
Language Abstraction (NeurIPS 2019)
Language goals for HRL
Language
RL
DIAYN (ICLR 2019)
Unsupervised skill discovery
Latent
RL
The Big Picture
Where This Fits
Hierarchy is one of three main approaches to long-horizon robot learning covered in CS 224R:
3. RL fine-tuning of foundation models (next lectures) — start from VLAs, improve with RL + sim-to-real.
These approaches are complementary, not competing. State-of-the-art systems like π0.5 combine all three: hierarchical structure, multi-task pre-training, and (potentially) RL fine-tuning.