HER — Veanors

Chapter 0: The Problem

You are a robotic arm. Your task: push a block to a red target position on a table. You get a reward of 0 when the block reaches the target and −1 at every other timestep. That is it. No partial credit for getting close. No shaping. Just: did you succeed, yes or no?

This is called a sparse binary reward. And it makes reinforcement learning nearly impossible.

Why? Because a random policy will never accidentally push the block to exactly the right spot. The agent receives −1 on every timestep of every episode. It has no gradient signal. No indication that one trajectory was better than another. Every failure looks identical from the reward's perspective.

The scale of the problem: In the bit-flipping environment from the paper (n=40 binary bits), there are 2⁴⁰ ≈ 1 trillion possible states. A random agent will never stumble onto the goal state. Standard DQN fails completely for n > 13. Yet the problem is trivially easy for a human: just flip the bits that differ.

The traditional fix is reward shaping — hand-crafting a dense reward function that provides gradient signal along the way (e.g., negative distance to the goal). But this has problems:

It requires domain expertise to design
The shaped reward may not align with the true objective (optimizing distance ≠ completing the task)
Surprisingly, the paper shows that shaped rewards often perform worse than sparse rewards + HER

Press "Run Episode" below to watch a random agent try to push a block to a target. Notice: it never gets any useful reward signal.

Why do sparse binary rewards make RL nearly impossible for manipulation tasks?

A random policy never accidentally achieves the goal, so the agent receives -1 on every timestep and has no gradient signal to learn from The reward is too large and causes gradient explosions Binary rewards require too much memory to store

Chapter 1: The Key Insight

Imagine you are learning to shoot a hockey puck into a net. You aim, you shoot — the puck sails wide, missing the net to the right. A standard RL algorithm concludes: "that sequence of actions did not lead to a goal." Reward = −1. Nothing learned.

But a human draws a different conclusion: "If the net had been placed further to the right, that would have been a perfect shot." The actions were fine — they just led to a different goal than intended.

The HER insight: After a failed episode, pretend the state you actually reached was the goal all along. Now you have a "successful" episode to learn from. The actions did not change. The environment dynamics did not change. Only the goal label on the replay buffer entry changed — and suddenly you have reward signal.

This is possible because of a key property: the goal does not affect the environment dynamics. Whether you were aiming for position A or position B, the physics of pushing a block is the same. The goal only determines the reward. So replaying a trajectory with a different goal produces valid training data for an off-policy algorithm.

Think of it like grading a student's exam against a different answer key. The student's work did not change — but against the new key, they got a perfect score. And we can learn from both the original grade and the re-graded version.

The visualization above shows the same trajectory viewed two ways. On the left, the original goal — the agent failed. On the right, the achieved state is substituted as the goal — now it is a success. Same actions, same physics, different label.

What does HER change when replaying a failed episode?

It substitutes the goal with a state that was actually achieved during the episode, turning a failure into a success for learning It replays the episode with different actions It changes the environment dynamics to make the episode succeed

Chapter 2: Goal-Conditioned RL

HER requires a specific RL setup: goal-conditioned policies. Instead of a policy that just maps states to actions, π(a|s), we have a policy that also takes a goal as input:

π(a | s, g)

"Given state s and goal g, what action should I take?" This is a Universal Value Function Approximator (UVFA) setup from Schaul et al. (2015).

The goal space

For the robotic manipulation tasks in the paper, goals are 3D positions: g ∈ R³. The goal specifies where the object should end up. A mapping function m(s) extracts the "achieved goal" from any state — typically just the current position of the object being manipulated.

The sparse reward

The reward function is brutally simple:

r(s, a, g) = −[f_g(s') = 0] = −[||s'_obj − g|| > ε]

In words: you get 0 if the object is within ε of the goal after executing action a, and −1 otherwise. No shaping, no partial credit.

The Q-function

The Q-function is also goal-conditioned:

Q^π(s, a, g) = E[R_t | s_t = s, a_t = a, goal = g]

Both the policy and the Q-function take (state, goal) as input. In practice, state and goal are concatenated: the network receives [s || g] as a single input vector.

Why goal-conditioning is essential for HER: If the policy were not conditioned on the goal, we could not replay trajectories with different goals. The whole trick is that the same network can evaluate any (state, action, goal) triple — so we can re-label the goal after the fact and get valid Q-value targets.

Why must the policy be goal-conditioned for HER to work?

Because HER re-labels goals in the replay buffer — the same network must be able to evaluate (state, action) pairs against any goal Because goal-conditioning makes the policy network smaller Because only goal-conditioned policies can use experience replay

Chapter 3: The HER Algorithm

The algorithm is elegantly simple. After collecting an episode, you store each transition twice (or more) in the replay buffer:

Original goal: Store (s_t || g, a_t, r_t, s_t+1 || g) with the original reward r_t = −1 (because we failed)
Hindsight goal: Store (s_t || g', a_t, r'_t, s_t+1 || g') where g' is a state actually achieved during the episode, and r'_t is recomputed against this new goal

The key realization: The transition (s_t, a_t, s_t+1) is a fact about the environment — it happened regardless of which goal we were pursuing. The goal only determines the reward label. So we can attach any goal and recompute the reward, producing valid off-policy training data.

Pseudocode

// HER algorithm (simplified)
for episode = 1 to M:
    sample goal g, initial state s₀
    for t = 0 to T-1:
        a_t = π(s_t || g) + noise      // behavioral policy
        s_t+1 = env.step(a_t)

    for t = 0 to T-1:
        // Standard replay
        r_t = reward(s_t, a_t, g)
        store (s_t||g, a_t, r_t, s_t+1||g) in buffer

        // Hindsight replay
        G' = sample_goals(episode)   // e.g. future states
        for g' in G':
            r' = reward(s_t, a_t, g')
            store (s_t||g', a_t, r', s_t+1||g') in buffer

    // Train with standard off-policy RL (e.g. DDPG)
    for t = 1 to N:
        minibatch = sample(buffer)
        update networks with minibatch

The beauty is that HER is algorithm-agnostic. It wraps around any off-policy algorithm (DQN, DDPG, NAF, SAC). The underlying RL algorithm does not even know that goal relabeling happened — it just sees a replay buffer with valid transitions.

When HER stores a transition with a hindsight goal g', what changes and what stays the same?

The state, action, and next state stay the same; only the goal label and the reward (recomputed against the new goal) change The actions are replayed with different noise The entire trajectory is re-simulated in the environment

Chapter 4: Why It Works

HER works because it creates an implicit curriculum. Early in training, the policy is nearly random. It pushes the block to random places. HER takes those random achieved states and uses them as goals — so the agent first learns to reach easy, nearby positions.

As the policy improves and can reach more states reliably, the achieved states in the replay buffer become more diverse. The agent gradually learns to reach harder, more distant positions. The curriculum emerges automatically from the agent's own experience — no manual design needed.

Implicit curriculum: Random policy → learn to reach random nearby states → policy improves → achieved states get more diverse → learn to reach harder goals → eventually learn to reach any goal. The difficulty ramps up naturally because the "goals" in the replay buffer are states the agent actually visited.

Why off-policy is essential

HER requires an off-policy algorithm. Why? Because we are learning from transitions that were generated under a different goal than the one we are now evaluating. The agent took action a_t while pursuing goal g, but we are asking: "what would the Q-value be if the goal had been g'?" Only off-policy methods (which can learn from data generated by any policy) can handle this.

On-policy methods like REINFORCE or PPO cannot use HER because they require that training data was generated by the current policy under the current objective.

Interactive: the implicit curriculum

Click "Step Curriculum" repeatedly to see how HER's implicit curriculum progresses. Early on, goals are easy (close to start). As the agent improves, goals become harder (further away).

Stage 1/5: random policy

What creates HER's implicit curriculum?

The hindsight goals are states the agent actually achieved — initially easy/nearby, becoming harder/more diverse as the policy improves A hand-designed schedule that increases goal difficulty over time The learning rate is gradually decreased

Chapter 5: Goal Substitution Strategies

When replaying with hindsight goals, we need to decide which achieved states to use as substitute goals. The paper explores four strategies:

The four strategies

future (k=4)

Sample k states from the remainder of the episode after the current transition. "Where did I end up after this point?"

final

Use the last state of the episode as the substitute goal. "Where did I end up at the end?"

episode

Sample k states uniformly from the entire episode. "Where was I at any point?"

random

Sample k states from the entire replay buffer. "Where has the agent ever been?"

The winner: future with k=4. The "future" strategy works best because it creates the most informative training signal. If the agent is at state s_t and we use a future state s_t+j as the goal, then the transitions s_t...s_t+j become a successful trajectory for reaching that goal. For each transition, we get k=4 additional training examples. This dramatically increases the density of successful experiences in the buffer.

Why does "future" beat "final"? Because "final" only provides one substitute goal per transition. With k=4 future goals, each transition generates 4 additional replay entries — quadrupling the amount of successful experience. More importantly, "future" goals have a natural distance gradient: goals sampled from t+1 are close (easy), goals from t+T are far (hard).

Why does "random" perform poorly? Because random goals from the buffer have no relationship to the current trajectory. The agent may never be close to those goals during this episode, so the relabeled reward is still −1 — no useful signal gained.

Why does the "future" strategy outperform the others?

Future states are guaranteed to be achievable from the current state (the agent already reached them), providing the densest positive reward signal with a natural difficulty gradient Future states have higher reward Future states require less memory

Chapter 6: Multi-Goal Replay

In the paper, HER is combined with DDPG (Deep Deterministic Policy Gradients) for continuous control. Let's walk through how the two pieces fit together.

DDPG recap

DDPG maintains two networks:

Actor π(s || g) → a: deterministic policy, outputs continuous actions
Critic Q(s || g, a) → R: estimates Q-value for state-goal-action triples

Both networks are goal-conditioned: they receive [state || goal] as input. The critic is trained to minimize the Bellman error:

L = E[(Q(s||g, a) − y)²] where y = r + γ Q'(s'||g, π'(s'||g))

The actor is trained to maximize the critic's estimate:

L_actor = −E[Q(s||g, π(s||g))]

The combined pipeline

// DDPG + HER training loop
for epoch = 1 to 200:
    for cycle = 1 to 50:
        // Collect experience
        for episode = 1 to 16:
            rollout with π(s||g) + N(0,0.2)
            store transitions + HER relabeled transitions

        // Train
        for step = 1 to 40:
            minibatch = sample(buffer, 256)
            update critic (Bellman loss)
            update actor (policy gradient through critic)
            soft-update target networks (τ=0.95)

Practical details

Observation normalization: Running mean/std normalization of inputs — crucial for stable training
Action space: 4D continuous — 3D gripper position delta + 1D finger distance
Parallel workers: 8 workers with averaged parameters
MLP architecture: 3 hidden layers of 256 units with ReLU

HER is drop-in: From DDPG's perspective, nothing changed. It just sees a replay buffer with transitions. Some of those transitions happen to have relabeled goals — but DDPG does not know or care. This is why HER can be combined with any off-policy algorithm: DQN, DDPG, SAC, TD3, etc.

Why can HER be combined with any off-policy RL algorithm?

HER only modifies the replay buffer contents — the RL algorithm just sees standard (state, action, reward, next_state) transitions and is unaware of the relabeling HER modifies the loss function of the RL algorithm HER is only compatible with actor-critic methods

Chapter 7: Results

The results are striking. Three robotic manipulation tasks with sparse binary rewards:

Pushing

Move a block to a target position on the table. Fingers locked (no grasping). DDPG alone: 0% success. DDPG+HER: ~100% success.

Sliding

Hit a puck so it slides to a target outside the arm's reach. Must control force and direction precisely. DDPG alone: 0%. DDPG+HER: ~100%.

Pick-and-place

Grasp a block and place it at a target position in the air. Requires grasping + lifting + placing. DDPG alone: 0%. DDPG+HER: ~100%.

Shaped rewards do not help

Surprisingly, the paper found that shaped rewards (e.g., r = −||s_obj − g||²) actually performed worse than sparse rewards + HER. Neither DDPG nor DDPG+HER could solve the tasks with shaped rewards. Why?

Shaped rewards create a gap between what you optimize (distance) and what you care about (binary success)
Shaped rewards can penalize exploration — the agent learns to not touch the block rather than risk moving it in the wrong direction

A counterintuitive finding: Domain-agnostic reward shaping is not just unnecessary with HER — it is actively harmful. The simplest possible reward (binary success/failure) combined with HER outperforms carefully designed shaped rewards. This is a major practical advantage: no reward engineering needed.

The bit-flipping benchmark

In the bit-flipping environment (n bits, goal = target bit string), standard DQN fails for n > 13. DQN + HER solves n = 50 easily. The improvement factor is enormous because HER provides reward signal that would otherwise be completely absent.

Why did shaped rewards perform worse than sparse rewards + HER?

Shaped rewards create a mismatch between what is optimized (distance) and the actual objective (binary success), and can penalize exploration Shaped rewards require more compute Shaped rewards make the replay buffer larger

Chapter 8: Sim-to-Real Transfer

The paper demonstrates that policies trained entirely in MuJoCo simulation can be deployed on a physical Fetch robot arm — and they work, with no fine-tuning on the real robot.

Why sim-to-real transfer works here

Simple observations: The policy only receives gripper position, relative object position, relative target position, and finger distance — all easily measured on a real robot
Position control: Actions specify desired gripper position deltas, not raw torques. The low-level controller handles the physics gap
Robustness from diversity: Training on multiple randomized goals makes the policy robust to perturbations

Sparse rewards help sim-to-real: Because the policy was trained with binary rewards (succeed/fail), it learned the actual task objective rather than a shaped proxy. A policy trained to minimize distance might exploit simulation artifacts. A policy trained on binary success must actually complete the task, making it more likely to transfer to reality.

Limitations observed

The sim-to-real transfer is demonstrated on pushing tasks. Pick-and-place transfer is harder because grasping is more sensitive to physics modeling errors (friction, contact dynamics). The paper shows successful pushing transfer but notes that more complex tasks would benefit from domain randomization or additional techniques.

What property of HER-trained policies aids sim-to-real transfer?

Training on binary success (the true task objective) rather than shaped rewards makes policies robust to simulation-reality gaps HER policies are smaller in parameter count HER policies do not use neural networks

Chapter 9: Connections

What HER built on

Universal Value Function Approximators (Schaul et al., 2015): Introduced goal-conditioned Q-functions Q(s, a, g). HER makes UVFA practical by solving the sparse reward problem that makes training goal-conditioned policies difficult.

Experience Replay (Lin, 1992): The idea of storing and reusing past transitions. HER extends this by also relabeling the stored transitions with different goals.

DDPG (Lillicrap et al., 2015): The off-policy continuous control algorithm used in the paper. HER wraps around it but could use any off-policy algorithm.

What HER enabled

Goal-Conditioned RL (GCRL): HER established goal-conditioned RL as a practical paradigm. Nearly all subsequent work on multi-goal RL builds on HER's relabeling idea.

Robotic Manipulation RL: Before HER, RL for manipulation required extensive reward engineering. After HER, sparse binary rewards became viable, making RL much more accessible for robotics researchers.

RIG (Nair et al., 2018): Combines HER with learned goal representations from images, enabling visual goal-conditioned RL.

Goal-Conditioned Behavior Cloning: The relabeling idea extends beyond RL. In imitation learning, you can relabel demonstrations with different goals — the same principle of reinterpreting what was "intended."

Foundation for OpenAI Robotics: HER was the engine behind OpenAI's subsequent robotics work, including the Rubik's cube solving hand (which combined HER with domain randomization and automatic curriculum).

HER's lasting insight: Failure is just success at a different task. This reframing principle — that any trajectory is a demonstration of reaching something — has become one of the most influential ideas in modern RL. It shows that the bottleneck in sparse reward RL was never exploration or optimization — it was our failure to extract all available information from the agent's experience.

Cheat sheet

Core idea

Replay failed episodes with achieved states as substitute goals → transform failures into successes

Requirements

Off-policy RL algorithm + goal-conditioned policy π(a|s,g) + mapping m(s) → achieved goal

Best strategy

"future" with k=4: sample 4 future achieved states per transition as substitute goals

Key result

Tasks impossible with sparse rewards (0% success) become solvable (~100% success) with HER

Legacy

Foundation of goal-conditioned RL, robotic manipulation RL, and the principle that failure = success at a different task

What is HER's lasting conceptual contribution to RL?

Any trajectory is a demonstration of reaching some state — failure is just success at a different task, and we can extract learning signal by relabeling the goal Off-policy algorithms are better than on-policy algorithms Robotic manipulation requires dense rewards

Hindsight Experience Replay