After every failure, pretend the place you ended up was where you meant to go. Sparse binary rewards go from impossible to solved.
You are a robotic arm. Your task: push a block to a red target position on a table. You get a reward of 0 when the block reaches the target and −1 at every other timestep. That is it. No partial credit for getting close. No shaping. Just: did you succeed, yes or no?
This is called a sparse binary reward. And it makes reinforcement learning nearly impossible.
Why? Because a random policy will never accidentally push the block to exactly the right spot. The agent receives −1 on every timestep of every episode. It has no gradient signal. No indication that one trajectory was better than another. Every failure looks identical from the reward's perspective.
The traditional fix is reward shaping — hand-crafting a dense reward function that provides gradient signal along the way (e.g., negative distance to the goal). But this has problems:
Press "Run Episode" below to watch a random agent try to push a block to a target. Notice: it never gets any useful reward signal.
Imagine you are learning to shoot a hockey puck into a net. You aim, you shoot — the puck sails wide, missing the net to the right. A standard RL algorithm concludes: "that sequence of actions did not lead to a goal." Reward = −1. Nothing learned.
But a human draws a different conclusion: "If the net had been placed further to the right, that would have been a perfect shot." The actions were fine — they just led to a different goal than intended.
This is possible because of a key property: the goal does not affect the environment dynamics. Whether you were aiming for position A or position B, the physics of pushing a block is the same. The goal only determines the reward. So replaying a trajectory with a different goal produces valid training data for an off-policy algorithm.
Think of it like grading a student's exam against a different answer key. The student's work did not change — but against the new key, they got a perfect score. And we can learn from both the original grade and the re-graded version.
The visualization above shows the same trajectory viewed two ways. On the left, the original goal — the agent failed. On the right, the achieved state is substituted as the goal — now it is a success. Same actions, same physics, different label.
HER requires a specific RL setup: goal-conditioned policies. Instead of a policy that just maps states to actions, π(a|s), we have a policy that also takes a goal as input:
"Given state s and goal g, what action should I take?" This is a Universal Value Function Approximator (UVFA) setup from Schaul et al. (2015).
For the robotic manipulation tasks in the paper, goals are 3D positions: g ∈ R3. The goal specifies where the object should end up. A mapping function m(s) extracts the "achieved goal" from any state — typically just the current position of the object being manipulated.
The reward function is brutally simple:
In words: you get 0 if the object is within ε of the goal after executing action a, and −1 otherwise. No shaping, no partial credit.
The Q-function is also goal-conditioned:
Both the policy and the Q-function take (state, goal) as input. In practice, state and goal are concatenated: the network receives [s || g] as a single input vector.
The algorithm is elegantly simple. After collecting an episode, you store each transition twice (or more) in the replay buffer:
// HER algorithm (simplified) for episode = 1 to M: sample goal g, initial state s0 for t = 0 to T-1: at = π(st || g) + noise // behavioral policy st+1 = env.step(at) for t = 0 to T-1: // Standard replay rt = reward(st, at, g) store (st||g, at, rt, st+1||g) in buffer // Hindsight replay G' = sample_goals(episode) // e.g. future states for g' in G': r' = reward(st, at, g') store (st||g', at, r', st+1||g') in buffer // Train with standard off-policy RL (e.g. DDPG) for t = 1 to N: minibatch = sample(buffer) update networks with minibatch
The beauty is that HER is algorithm-agnostic. It wraps around any off-policy algorithm (DQN, DDPG, NAF, SAC). The underlying RL algorithm does not even know that goal relabeling happened — it just sees a replay buffer with valid transitions.
HER works because it creates an implicit curriculum. Early in training, the policy is nearly random. It pushes the block to random places. HER takes those random achieved states and uses them as goals — so the agent first learns to reach easy, nearby positions.
As the policy improves and can reach more states reliably, the achieved states in the replay buffer become more diverse. The agent gradually learns to reach harder, more distant positions. The curriculum emerges automatically from the agent's own experience — no manual design needed.
HER requires an off-policy algorithm. Why? Because we are learning from transitions that were generated under a different goal than the one we are now evaluating. The agent took action at while pursuing goal g, but we are asking: "what would the Q-value be if the goal had been g'?" Only off-policy methods (which can learn from data generated by any policy) can handle this.
On-policy methods like REINFORCE or PPO cannot use HER because they require that training data was generated by the current policy under the current objective.
Click "Step Curriculum" repeatedly to see how HER's implicit curriculum progresses. Early on, goals are easy (close to start). As the agent improves, goals become harder (further away).
When replaying with hindsight goals, we need to decide which achieved states to use as substitute goals. The paper explores four strategies:
Why does "future" beat "final"? Because "final" only provides one substitute goal per transition. With k=4 future goals, each transition generates 4 additional replay entries — quadrupling the amount of successful experience. More importantly, "future" goals have a natural distance gradient: goals sampled from t+1 are close (easy), goals from t+T are far (hard).
Why does "random" perform poorly? Because random goals from the buffer have no relationship to the current trajectory. The agent may never be close to those goals during this episode, so the relabeled reward is still −1 — no useful signal gained.
In the paper, HER is combined with DDPG (Deep Deterministic Policy Gradients) for continuous control. Let's walk through how the two pieces fit together.
DDPG maintains two networks:
Both networks are goal-conditioned: they receive [state || goal] as input. The critic is trained to minimize the Bellman error:
The actor is trained to maximize the critic's estimate:
// DDPG + HER training loop for epoch = 1 to 200: for cycle = 1 to 50: // Collect experience for episode = 1 to 16: rollout with π(s||g) + N(0,0.2) store transitions + HER relabeled transitions // Train for step = 1 to 40: minibatch = sample(buffer, 256) update critic (Bellman loss) update actor (policy gradient through critic) soft-update target networks (τ=0.95)
The results are striking. Three robotic manipulation tasks with sparse binary rewards:
Surprisingly, the paper found that shaped rewards (e.g., r = −||sobj − g||2) actually performed worse than sparse rewards + HER. Neither DDPG nor DDPG+HER could solve the tasks with shaped rewards. Why?
In the bit-flipping environment (n bits, goal = target bit string), standard DQN fails for n > 13. DQN + HER solves n = 50 easily. The improvement factor is enormous because HER provides reward signal that would otherwise be completely absent.
The paper demonstrates that policies trained entirely in MuJoCo simulation can be deployed on a physical Fetch robot arm — and they work, with no fine-tuning on the real robot.
The sim-to-real transfer is demonstrated on pushing tasks. Pick-and-place transfer is harder because grasping is more sensitive to physics modeling errors (friction, contact dynamics). The paper shows successful pushing transfer but notes that more complex tasks would benefit from domain randomization or additional techniques.
Universal Value Function Approximators (Schaul et al., 2015): Introduced goal-conditioned Q-functions Q(s, a, g). HER makes UVFA practical by solving the sparse reward problem that makes training goal-conditioned policies difficult.
Experience Replay (Lin, 1992): The idea of storing and reusing past transitions. HER extends this by also relabeling the stored transitions with different goals.
DDPG (Lillicrap et al., 2015): The off-policy continuous control algorithm used in the paper. HER wraps around it but could use any off-policy algorithm.
Goal-Conditioned RL (GCRL): HER established goal-conditioned RL as a practical paradigm. Nearly all subsequent work on multi-goal RL builds on HER's relabeling idea.
Robotic Manipulation RL: Before HER, RL for manipulation required extensive reward engineering. After HER, sparse binary rewards became viable, making RL much more accessible for robotics researchers.
RIG (Nair et al., 2018): Combines HER with learned goal representations from images, enabling visual goal-conditioned RL.
Goal-Conditioned Behavior Cloning: The relabeling idea extends beyond RL. In imitation learning, you can relabel demonstrations with different goals — the same principle of reinterpreting what was "intended."
Foundation for OpenAI Robotics: HER was the engine behind OpenAI's subsequent robotics work, including the Rubik's cube solving hand (which combined HER with domain randomization and automatic curriculum).