Andrychowicz, Wolski, Ray, Schneider, Fong, Welinder, McGrew, Tobin, Abbeel, Zaremba — OpenAI, 2017

Hindsight Experience Replay

After every failure, pretend the place you ended up was where you meant to go. Sparse binary rewards go from impossible to solved.

Prerequisites: Experience replay + Goal-conditioned RL
10
Chapters
5+
Simulations

Chapter 0: The Problem

You are a robotic arm. Your task: push a block to a red target position on a table. You get a reward of 0 when the block reaches the target and −1 at every other timestep. That is it. No partial credit for getting close. No shaping. Just: did you succeed, yes or no?

This is called a sparse binary reward. And it makes reinforcement learning nearly impossible.

Why? Because a random policy will never accidentally push the block to exactly the right spot. The agent receives −1 on every timestep of every episode. It has no gradient signal. No indication that one trajectory was better than another. Every failure looks identical from the reward's perspective.

The scale of the problem: In the bit-flipping environment from the paper (n=40 binary bits), there are 240 ≈ 1 trillion possible states. A random agent will never stumble onto the goal state. Standard DQN fails completely for n > 13. Yet the problem is trivially easy for a human: just flip the bits that differ.

The traditional fix is reward shaping — hand-crafting a dense reward function that provides gradient signal along the way (e.g., negative distance to the goal). But this has problems:

Press "Run Episode" below to watch a random agent try to push a block to a target. Notice: it never gets any useful reward signal.

Why do sparse binary rewards make RL nearly impossible for manipulation tasks?

Chapter 1: The Key Insight

Imagine you are learning to shoot a hockey puck into a net. You aim, you shoot — the puck sails wide, missing the net to the right. A standard RL algorithm concludes: "that sequence of actions did not lead to a goal." Reward = −1. Nothing learned.

But a human draws a different conclusion: "If the net had been placed further to the right, that would have been a perfect shot." The actions were fine — they just led to a different goal than intended.

The HER insight: After a failed episode, pretend the state you actually reached was the goal all along. Now you have a "successful" episode to learn from. The actions did not change. The environment dynamics did not change. Only the goal label on the replay buffer entry changed — and suddenly you have reward signal.

This is possible because of a key property: the goal does not affect the environment dynamics. Whether you were aiming for position A or position B, the physics of pushing a block is the same. The goal only determines the reward. So replaying a trajectory with a different goal produces valid training data for an off-policy algorithm.

Think of it like grading a student's exam against a different answer key. The student's work did not change — but against the new key, they got a perfect score. And we can learn from both the original grade and the re-graded version.

The visualization above shows the same trajectory viewed two ways. On the left, the original goal — the agent failed. On the right, the achieved state is substituted as the goal — now it is a success. Same actions, same physics, different label.

What does HER change when replaying a failed episode?

Chapter 2: Goal-Conditioned RL

HER requires a specific RL setup: goal-conditioned policies. Instead of a policy that just maps states to actions, π(a|s), we have a policy that also takes a goal as input:

π(a | s, g)

"Given state s and goal g, what action should I take?" This is a Universal Value Function Approximator (UVFA) setup from Schaul et al. (2015).

The goal space

For the robotic manipulation tasks in the paper, goals are 3D positions: g ∈ R3. The goal specifies where the object should end up. A mapping function m(s) extracts the "achieved goal" from any state — typically just the current position of the object being manipulated.

The sparse reward

The reward function is brutally simple:

r(s, a, g) = −[fg(s') = 0] = −[||s'obj − g|| > ε]

In words: you get 0 if the object is within ε of the goal after executing action a, and −1 otherwise. No shaping, no partial credit.

The Q-function

The Q-function is also goal-conditioned:

Qπ(s, a, g) = E[Rt | st = s, at = a, goal = g]

Both the policy and the Q-function take (state, goal) as input. In practice, state and goal are concatenated: the network receives [s || g] as a single input vector.

Why goal-conditioning is essential for HER: If the policy were not conditioned on the goal, we could not replay trajectories with different goals. The whole trick is that the same network can evaluate any (state, action, goal) triple — so we can re-label the goal after the fact and get valid Q-value targets.
Why must the policy be goal-conditioned for HER to work?

Chapter 3: The HER Algorithm

The algorithm is elegantly simple. After collecting an episode, you store each transition twice (or more) in the replay buffer:

  1. Original goal: Store (st || g, at, rt, st+1 || g) with the original reward rt = −1 (because we failed)
  2. Hindsight goal: Store (st || g', at, r't, st+1 || g') where g' is a state actually achieved during the episode, and r't is recomputed against this new goal
The key realization: The transition (st, at, st+1) is a fact about the environment — it happened regardless of which goal we were pursuing. The goal only determines the reward label. So we can attach any goal and recompute the reward, producing valid off-policy training data.

Pseudocode

// HER algorithm (simplified)
for episode = 1 to M:
    sample goal g, initial state s0
    for t = 0 to T-1:
        at = π(st || g) + noise      // behavioral policy
        st+1 = env.step(at)

    for t = 0 to T-1:
        // Standard replay
        rt = reward(st, at, g)
        store (st||g, at, rt, st+1||g) in buffer

        // Hindsight replay
        G' = sample_goals(episode)   // e.g. future states
        for g' in G':
            r' = reward(st, at, g')
            store (st||g', at, r', st+1||g') in buffer

    // Train with standard off-policy RL (e.g. DDPG)
    for t = 1 to N:
        minibatch = sample(buffer)
        update networks with minibatch

The beauty is that HER is algorithm-agnostic. It wraps around any off-policy algorithm (DQN, DDPG, NAF, SAC). The underlying RL algorithm does not even know that goal relabeling happened — it just sees a replay buffer with valid transitions.

When HER stores a transition with a hindsight goal g', what changes and what stays the same?

Chapter 4: Why It Works

HER works because it creates an implicit curriculum. Early in training, the policy is nearly random. It pushes the block to random places. HER takes those random achieved states and uses them as goals — so the agent first learns to reach easy, nearby positions.

As the policy improves and can reach more states reliably, the achieved states in the replay buffer become more diverse. The agent gradually learns to reach harder, more distant positions. The curriculum emerges automatically from the agent's own experience — no manual design needed.

Implicit curriculum: Random policy → learn to reach random nearby states → policy improves → achieved states get more diverse → learn to reach harder goals → eventually learn to reach any goal. The difficulty ramps up naturally because the "goals" in the replay buffer are states the agent actually visited.

Why off-policy is essential

HER requires an off-policy algorithm. Why? Because we are learning from transitions that were generated under a different goal than the one we are now evaluating. The agent took action at while pursuing goal g, but we are asking: "what would the Q-value be if the goal had been g'?" Only off-policy methods (which can learn from data generated by any policy) can handle this.

On-policy methods like REINFORCE or PPO cannot use HER because they require that training data was generated by the current policy under the current objective.

Interactive: the implicit curriculum

Click "Step Curriculum" repeatedly to see how HER's implicit curriculum progresses. Early on, goals are easy (close to start). As the agent improves, goals become harder (further away).

Stage 1/5: random policy
What creates HER's implicit curriculum?

Chapter 5: Goal Substitution Strategies

When replaying with hindsight goals, we need to decide which achieved states to use as substitute goals. The paper explores four strategies:

The four strategies

future (k=4)
Sample k states from the remainder of the episode after the current transition. "Where did I end up after this point?"
final
Use the last state of the episode as the substitute goal. "Where did I end up at the end?"
episode
Sample k states uniformly from the entire episode. "Where was I at any point?"
random
Sample k states from the entire replay buffer. "Where has the agent ever been?"
The winner: future with k=4. The "future" strategy works best because it creates the most informative training signal. If the agent is at state st and we use a future state st+j as the goal, then the transitions st...st+j become a successful trajectory for reaching that goal. For each transition, we get k=4 additional training examples. This dramatically increases the density of successful experiences in the buffer.

Why does "future" beat "final"? Because "final" only provides one substitute goal per transition. With k=4 future goals, each transition generates 4 additional replay entries — quadrupling the amount of successful experience. More importantly, "future" goals have a natural distance gradient: goals sampled from t+1 are close (easy), goals from t+T are far (hard).

Why does "random" perform poorly? Because random goals from the buffer have no relationship to the current trajectory. The agent may never be close to those goals during this episode, so the relabeled reward is still −1 — no useful signal gained.

Why does the "future" strategy outperform the others?

Chapter 6: Multi-Goal Replay

In the paper, HER is combined with DDPG (Deep Deterministic Policy Gradients) for continuous control. Let's walk through how the two pieces fit together.

DDPG recap

DDPG maintains two networks:

Both networks are goal-conditioned: they receive [state || goal] as input. The critic is trained to minimize the Bellman error:

L = E[(Q(s||g, a) − y)2]   where   y = r + γ Q'(s'||g, π'(s'||g))

The actor is trained to maximize the critic's estimate:

Lactor = −E[Q(s||g, π(s||g))]

The combined pipeline

// DDPG + HER training loop
for epoch = 1 to 200:
    for cycle = 1 to 50:
        // Collect experience
        for episode = 1 to 16:
            rollout with π(s||g) + N(0,0.2)
            store transitions + HER relabeled transitions

        // Train
        for step = 1 to 40:
            minibatch = sample(buffer, 256)
            update critic (Bellman loss)
            update actor (policy gradient through critic)
            soft-update target networks (τ=0.95)

Practical details

HER is drop-in: From DDPG's perspective, nothing changed. It just sees a replay buffer with transitions. Some of those transitions happen to have relabeled goals — but DDPG does not know or care. This is why HER can be combined with any off-policy algorithm: DQN, DDPG, SAC, TD3, etc.
Why can HER be combined with any off-policy RL algorithm?

Chapter 7: Results

The results are striking. Three robotic manipulation tasks with sparse binary rewards:

Pushing
Move a block to a target position on the table. Fingers locked (no grasping). DDPG alone: 0% success. DDPG+HER: ~100% success.
Sliding
Hit a puck so it slides to a target outside the arm's reach. Must control force and direction precisely. DDPG alone: 0%. DDPG+HER: ~100%.
Pick-and-place
Grasp a block and place it at a target position in the air. Requires grasping + lifting + placing. DDPG alone: 0%. DDPG+HER: ~100%.

Shaped rewards do not help

Surprisingly, the paper found that shaped rewards (e.g., r = −||sobj − g||2) actually performed worse than sparse rewards + HER. Neither DDPG nor DDPG+HER could solve the tasks with shaped rewards. Why?

A counterintuitive finding: Domain-agnostic reward shaping is not just unnecessary with HER — it is actively harmful. The simplest possible reward (binary success/failure) combined with HER outperforms carefully designed shaped rewards. This is a major practical advantage: no reward engineering needed.

The bit-flipping benchmark

In the bit-flipping environment (n bits, goal = target bit string), standard DQN fails for n > 13. DQN + HER solves n = 50 easily. The improvement factor is enormous because HER provides reward signal that would otherwise be completely absent.

Why did shaped rewards perform worse than sparse rewards + HER?

Chapter 8: Sim-to-Real Transfer

The paper demonstrates that policies trained entirely in MuJoCo simulation can be deployed on a physical Fetch robot arm — and they work, with no fine-tuning on the real robot.

Why sim-to-real transfer works here

Sparse rewards help sim-to-real: Because the policy was trained with binary rewards (succeed/fail), it learned the actual task objective rather than a shaped proxy. A policy trained to minimize distance might exploit simulation artifacts. A policy trained on binary success must actually complete the task, making it more likely to transfer to reality.

Limitations observed

The sim-to-real transfer is demonstrated on pushing tasks. Pick-and-place transfer is harder because grasping is more sensitive to physics modeling errors (friction, contact dynamics). The paper shows successful pushing transfer but notes that more complex tasks would benefit from domain randomization or additional techniques.

What property of HER-trained policies aids sim-to-real transfer?

Chapter 9: Connections

What HER built on

Universal Value Function Approximators (Schaul et al., 2015): Introduced goal-conditioned Q-functions Q(s, a, g). HER makes UVFA practical by solving the sparse reward problem that makes training goal-conditioned policies difficult.

Experience Replay (Lin, 1992): The idea of storing and reusing past transitions. HER extends this by also relabeling the stored transitions with different goals.

DDPG (Lillicrap et al., 2015): The off-policy continuous control algorithm used in the paper. HER wraps around it but could use any off-policy algorithm.

What HER enabled

Goal-Conditioned RL (GCRL): HER established goal-conditioned RL as a practical paradigm. Nearly all subsequent work on multi-goal RL builds on HER's relabeling idea.

Robotic Manipulation RL: Before HER, RL for manipulation required extensive reward engineering. After HER, sparse binary rewards became viable, making RL much more accessible for robotics researchers.

RIG (Nair et al., 2018): Combines HER with learned goal representations from images, enabling visual goal-conditioned RL.

Goal-Conditioned Behavior Cloning: The relabeling idea extends beyond RL. In imitation learning, you can relabel demonstrations with different goals — the same principle of reinterpreting what was "intended."

Foundation for OpenAI Robotics: HER was the engine behind OpenAI's subsequent robotics work, including the Rubik's cube solving hand (which combined HER with domain randomization and automatic curriculum).

HER's lasting insight: Failure is just success at a different task. This reframing principle — that any trajectory is a demonstration of reaching something — has become one of the most influential ideas in modern RL. It shows that the bottleneck in sparse reward RL was never exploration or optimization — it was our failure to extract all available information from the agent's experience.

Cheat sheet

Core idea
Replay failed episodes with achieved states as substitute goals → transform failures into successes
Requirements
Off-policy RL algorithm + goal-conditioned policy π(a|s,g) + mapping m(s) → achieved goal
Best strategy
"future" with k=4: sample 4 future achieved states per transition as substitute goals
Key result
Tasks impossible with sparse rewards (0% success) become solvable (~100% success) with HER
Legacy
Foundation of goal-conditioned RL, robotic manipulation RL, and the principle that failure = success at a different task
What is HER's lasting conceptual contribution to RL?