Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, Sergey Levine (UC Berkeley + Google Brain) — NeurIPS 2020

D4RL: Datasets for Deep Data-Driven RL

The benchmark that revealed offline RL was mostly broken — static datasets from realistic sources (humans, planners, mixed policies) exposed fundamental failures in every algorithm, reshaping the field's research direction.

Prerequisites: MDP basics (state, action, reward) + What a policy is + Basic RL intuition. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Offline RL Benchmarks?

Imagine you are a hospital administrator sitting on ten years of patient records. Every doctor's decision is logged: which drug was prescribed, what dosage, what happened next. You want to use reinforcement learning to discover better treatment policies. But you cannot experiment on patients. You cannot prescribe random drugs to see what happens. You have a fixed, static dataset and that is all you will ever have.

This is offline reinforcement learning (also called batch RL). Instead of an agent exploring an environment and collecting new data, you hand it a frozen dataset of past interactions and say: "Learn the best policy you can from this alone."

Offline RL is exciting because static datasets are everywhere. Hospitals have patient records. Autonomous driving companies have millions of miles of logged driving. Robotics labs have years of teleoperation logs. If offline RL worked well, it could unlock all of this data for RL without any dangerous real-world exploration.

But in 2020, there was a problem. Researchers were testing their offline RL algorithms on the wrong benchmarks.

The benchmark gap

Before D4RL, most offline RL papers tested on data collected from partially-trained online RL agents. The pipeline looked like this: train SAC or PPO on HalfCheetah for a while, save the replay buffer, then hand that replay buffer to your offline RL algorithm and see how it does.

This seems reasonable until you realize what it means. The data comes from an RL agent that was actively exploring. It covers a wide range of states. It was collected by a policy that is representable by the same neural network class you are using. The data distribution is smooth and well-behaved.

Real-world offline data looks nothing like this.

PropertyData from RL Training RunsReal-World Data
SourceAutomated RL agentHumans, hand-coded controllers, mixed policies
CoverageBroad explorationNarrow, biased toward common behaviors
Policy classSame neural net architectureNon-Markovian, non-representable
QualityGradually improvingMixed: some expert, some random, some suboptimal
RewardsDense, well-shapedOften sparse (did you succeed or not?)

When Wu et al. (2019) noticed this, they found something alarming: on the standard benchmarks, simple behavioral cloning matched or beat every offline RL algorithm. The benchmarks were too easy. They could not differentiate between methods. Progress was an illusion.

The D4RL thesis: If you want offline RL to work in the real world, you need to test it on data that looks like the real world. That means human demonstrations, hand-designed controllers, mixed-quality datasets, sparse rewards, and complex multi-task environments. When you do this, the results are sobering: most algorithms fail catastrophically. But at least now you know where to improve.
Online vs Offline RL

Toggle between online and offline RL to see the fundamental difference. In online RL, the agent explores and collects new data. In offline RL, it is stuck with a fixed dataset forever.

Why were pre-D4RL offline RL benchmarks misleading?

Chapter 1: The Offline RL Problem

Before we dive into D4RL's benchmark design, we need to understand what makes offline RL fundamentally harder than online RL. The difficulty is not just "less data" — it is a qualitatively different problem with a unique failure mode called distribution shift.

The setup

In offline RL, you are given a fixed dataset D of transitions collected by some unknown behavior policy πB:

D = {(st, at, rt, st+1)}t=1N

Each transition records: what state the agent was in, what action it took, what reward it received, and what state it transitioned to. The behavior policy πB is whatever process generated this data — a human operator, a hand-coded controller, a partially-trained RL agent, or a mixture of all three.

Your goal is to find a policy π that maximizes expected cumulative discounted reward:

J(π) = Eπ, P, ρ0 [∑t=0 γt R(st, at)]

But you cannot interact with the environment. You cannot try actions and see what happens. You are stuck with D, forever.

Why this is dangerous: distribution shift

Standard off-policy RL algorithms like Q-learning estimate a Q-function Q(s, a) from data, then derive a policy that picks the action with the highest Q-value. The problem is that Q-learning uses bootstrapping: it updates Q(s, a) using its own estimate of Q(s', a'). If the dataset never contains the (s', a') pair that the learned policy would visit, the Q-value for that pair is completely made up. It is an extrapolation with no grounding in real data.

In online RL, this is fine. If the Q-function overestimates a particular (s, a) pair, the agent will try that action, get a low reward, and correct the estimate. The agent self-corrects through exploration.

In offline RL, there is no self-correction. The Q-function can hallucinate arbitrarily high values for state-action pairs that never appear in the dataset. The learned policy then confidently selects those hallucinated actions, producing catastrophic behavior.

Distribution shift in one sentence: The learned policy visits states and takes actions that are not in the training data, where the Q-function's predictions are unreliable extrapolations. Without the ability to explore and get corrected by real rewards, these errors compound and the policy diverges.
python
# The offline RL problem in code
import gym
import d4rl

# Step 1: Load environment and dataset
env = gym.make('halfcheetah-medium-v2')
dataset = env.get_dataset()

# dataset is a dict with keys:
# 'observations':  np.array of shape [N, obs_dim]
# 'actions':       np.array of shape [N, act_dim]
# 'rewards':       np.array of shape [N]
# 'terminals':     np.array of shape [N] (bool)
# 'next_observations': np.array of shape [N, obs_dim]

print(dataset['observations'].shape)  # (1000000, 17) — 1M transitions, 17-dim state
print(dataset['actions'].shape)       # (1000000, 6)  — 6-dim continuous actions

# Step 2: Train your offline RL algorithm on this dataset
# NO env.step() calls allowed — pure offline learning
policy = train_offline_rl(dataset)

# Step 3: Evaluate in the simulator
total_reward = 0
obs = env.reset()
for _ in range(1000):
    action = policy.act(obs)
    obs, reward, done, info = env.step(action)
    total_reward += reward

# Step 4: Compute normalized score
normalized = env.get_normalized_score(total_reward) * 100

The three approaches to distribution shift

By 2020, three families of algorithms had emerged to handle distribution shift. D4RL was designed to stress-test all of them:

Policy Constraint
Keep the learned policy close to the behavior policy. If πB never took action a in state s, don't try it either. Methods: BCQ, BEAR, BRAC, AWR.
Conservative Q-Learning
Penalize Q-values for out-of-distribution actions so the policy never picks them. Method: CQL.
Importance Weighting
Re-weight data transitions by the ratio π/πB to correct for the mismatch. Methods: AlgaeDICE, DualDICE.
D4RL's design goal: create datasets where each of these approaches breaks in a different way. Policy constraint methods fail when the behavior policy is a mixture. Conservative methods fail when data coverage is narrow. Importance weighting fails when the behavior policy is non-Markovian.
Distribution Shift Visualizer

Watch how distribution shift develops. The blue curve is the data distribution (where the behavior policy visited). The orange curve is where the learned policy wants to go. As training progresses, the learned policy drifts into regions with no data, where Q-value estimates are unreliable.

Training Step 0
What is distribution shift in offline RL?

Chapter 2: Dataset Design Principles

D4RL is not just a collection of random datasets. Every dataset was chosen to exercise a specific failure mode of offline RL algorithms. The paper identifies six properties that real-world datasets have and that existing benchmarks ignored.

Property 1: Narrow and biased distributions

When a human demonstrates a task, they do it one way. They do not randomly explore the state space. The resulting dataset covers a tiny slice of what is possible. Narrow distributions are the norm in practice — expert demonstrations, hand-designed controllers, and logged human behavior all produce them.

This is toxic for offline RL because the Q-function has no data to ground its estimates outside the narrow band. Even a small deviation from the demonstrated behavior leads to extrapolation.

Property 2: Undirected and multitask data

Imagine logging everything a self-driving car does for a year. The car drives to the grocery store, to work, through construction zones, in rain. This data was not collected to solve any particular task — it is undirected. If you later want to learn a policy for "navigate from A to B," you need to stitch together sub-trajectories: one trajectory goes from A to C, another goes from C to B, and your algorithm must combine them.

Stitching is the ability to combine portions of different trajectories to solve a task that no single trajectory solves. This is fundamentally different from imitation learning, which requires at least one complete demonstration of the desired behavior. D4RL's Maze2D, AntMaze, and Kitchen domains specifically test stitching.

Property 3: Sparse rewards

Many real-world tasks have binary outcomes: did the robot grasp the object (1) or not (0)? Did the patient survive (1) or not (0)? Sparse rewards make credit assignment extremely hard — the algorithm must figure out which of the thousands of actions in a trajectory actually mattered.

Property 4: Suboptimal data

Real data rarely comes from experts. Human demonstrators make mistakes. Hand-coded controllers are approximate. The dataset may contain mostly mediocre behavior with a few good episodes. Algorithms must improve beyond the quality of the data, not just imitate it.

Property 5: Non-representable behavior policies

When the data comes from a human or a hand-designed controller, the behavior policy may not be representable by the neural network class you are using. The human uses memory, context, and planning. A simple feedforward network cannot represent this. Methods that try to estimate πB(a|s) — like importance weighting — will get the wrong answer.

Property 6: Partial observability

Real systems often have partial observations. A self-driving car sees a 48x48 camera image, not the full state of the world. This compounds the challenges above because the Markov assumption is violated.

PropertyWhat It BreaksD4RL Domain Testing It
Narrow distributionsQ-value extrapolationAdroit (human), Gym-MuJoCo (medium)
Undirected dataImitation learning (no complete demo)Maze2D, AntMaze, Kitchen
Sparse rewardsCredit assignmentAntMaze, Adroit
Suboptimal dataPure imitation (copies bad behavior)Gym-MuJoCo (medium, random)
Non-representable πBImportance weighting, policy constraintMaze2D (planner), Flow (IDM), CARLA
Partial observabilityMarkov assumptionCARLA (camera images)
Narrow vs Broad Data Distributions

Use the slider to change the data distribution from broad (RL agent exploration) to narrow (expert demonstrations). Watch how the area with no data coverage grows, creating danger zones where Q-values are unreliable.

Distribution Width 80%
What is "stitching" in the context of offline RL?

Chapter 3: Task Domains

D4RL spans seven task domains, each chosen to test a different combination of the dataset properties from Chapter 2. Think of them as stress tests: each domain breaks algorithms in a specific way.

Maze2D — The stitching test

A 2D ball must navigate to a fixed goal in a maze. The dataset contains trajectories from a planner that navigates to random goals — not the evaluation goal. To succeed, the algorithm must stitch sub-trajectories: "this trajectory passed through the goal area on its way somewhere else, let me use that segment."

Three layouts: umaze (simple U-shape), medium (4-room), large (complex corridors). The planner uses waypoints and a PD controller, making the behavior policy non-Markovian (it remembers which waypoints it has visited).

AntMaze — Stitching with a real robot

Same maze concept but replacing the 2D ball with an 8-DOF quadruped "Ant" robot. This tests stitching with a morphologically complex agent and sparse 0-1 rewards (1 only when reaching the goal). The dataset comes from a goal-conditioned policy navigating to random locations.

Gym-MuJoCo — The classic (but now harder)

HalfCheetah, Hopper, Walker2d — the workhorses of RL benchmarking. D4RL keeps these for backward compatibility but adds new dataset types (medium-expert, medium-replay) that expose failures invisible in prior benchmarks. Dense rewards, continuous actions, 17-dim state.

Adroit — Human dexterity

A 24-DOF Shadow Hand robot must hammer a nail, open a door, twirl a pen, or relocate a ball. The key dataset here is human demonstrations — only 25 trajectories per task, collected from actual human teleoperators. This tests whether algorithms can learn from extremely limited, narrow, non-representable data with sparse rewards.

FrankaKitchen — Multitask stitching

A 9-DOF Franka robot in a kitchen must open the microwave, move the kettle, turn on the light, and open cabinets — all in the right order. The dataset contains human demonstrations of different sub-tasks. No single trajectory completes the full evaluation task. The algorithm must generalize across manipulation sub-tasks.

Flow — Traffic control

Control autonomous vehicles in traffic simulations (ring road, highway merge). Data comes from the Intelligent Driver Model (IDM), a hand-designed model of human driving. Tests non-representable behavior policies in a realistic domain.

CARLA — Vision-based driving

High-fidelity autonomous driving with 48x48 RGB images as observations. Lane following and town navigation tasks. Tests partial observability (camera images, not full state) combined with undirected data from hand-designed controllers.

D4RL Domain Explorer

Click a domain to see its properties, challenges, and what it tests. Each domain targets a different combination of offline RL failure modes.

python
# Loading different D4RL domains
import gym
import d4rl

# Maze2D — 2D navigation, stitching test
env = gym.make('maze2d-large-v1')
ds = env.get_dataset()
print(ds['observations'].shape)  # (4000000, 4) — pos_x, pos_y, vel_x, vel_y

# Adroit — 24-DoF hand, human demos
env = gym.make('pen-human-v1')
ds = env.get_dataset()
print(ds['observations'].shape)  # (5000, 45) — only 25 trajectories!

# AntMaze — sparse reward, complex morphology
env = gym.make('antmaze-medium-diverse-v0')
ds = env.get_dataset()
print(ds['rewards'].mean())  # ~0.0 — almost all 0, reward only at goal

# Kitchen — multitask, human demos
env = gym.make('kitchen-mixed-v0')
ds = env.get_dataset()
print(ds['observations'].shape)  # (136950, 60) — 9-DoF robot + object states
Why does D4RL include the Adroit domain with only 25 human demonstration trajectories?

Chapter 4: Dataset Types

For each task domain, D4RL provides multiple datasets of varying quality and composition. This is deliberate — the same environment can be easy or impossible depending on what data you have.

The six dataset types

random — Data from a randomly initialized, untrained policy. Actions are essentially noise. The coverage is broad (the agent goes everywhere because it moves randomly), but the quality is terrible. No useful behavior to imitate, but Q-learning can potentially learn from the diverse state coverage.

medium — Data from a partially-trained SAC agent, early-stopped at about 1/3 of expert performance. This is the most commonly used dataset type. The behavior is coherent but clearly suboptimal. The question: can your algorithm improve beyond this mediocre policy?

medium-replay — The entire replay buffer from training SAC up to medium performance. Unlike "medium," this includes all the data from the beginning of training (when the agent was nearly random) through to the medium level. The distribution is a mixture of many policies at different skill levels.

medium-expert — A 50/50 mix of medium data and expert data. This is a realistic scenario: you have some high-quality demonstrations and a bunch of mediocre data. The challenge is that policy constraint methods may constrain to the average of the mixture, which is neither medium nor expert.

expert — Data from a fully-trained SAC agent or human expert. Narrow distribution of near-optimal behavior. Behavioral cloning should work well here. The interesting question is whether offline RL can match or exceed BC.

human — Data from actual human teleoperators (Adroit domain) or hand-designed human behavior models (Flow domain). Limited in quantity, non-Markovian in nature, and non-representable by standard policy classes.

The key finding: Algorithms that performed well on "medium" data (the prior standard) often failed on "medium-expert" and "human" data. The mixture in medium-expert confused policy constraint methods, and the narrow non-Markovian nature of human data broke importance weighting. This is why D4RL matters — it revealed these failures.
Dataset TypeSourceCoverageQuality# Samples (typical)
randomRandom policyBroadVery low1M
mediumEarly-stopped SACModerate~33% expert1M
medium-replaySAC replay bufferBroad (mixed)Increasing~100K-200K
medium-expert50/50 medium + expertBimodalMixed2M
expertFully-trained SACNarrowHigh1M
humanHuman demonstratorsVery narrowVariable5K-25K
python
# Comparing dataset types for HalfCheetah
import gym, d4rl
import numpy as np

for name in ['random', 'medium', 'medium-replay', 'medium-expert', 'expert']:
    env = gym.make(f'halfcheetah-{name}-v2')
    ds = env.get_dataset()

    # Trajectory return = sum of rewards per episode
    returns = []
    ep_ret = 0
    for i in range(len(ds['rewards'])):
        ep_ret += ds['rewards'][i]
        if ds['terminals'][i]:
            returns.append(ep_ret)
            ep_ret = 0

    print(f'{name:20s}  mean_return={np.mean(returns):8.1f}'
          f'  std={np.std(returns):7.1f}  n_traj={len(returns)}')

# Output:
# random                mean_return=  -280.5  std=   78.2  n_traj=1000
# medium                mean_return=  4770.8  std=  105.3  n_traj=1000
# medium-replay         mean_return=  2180.3  std= 1850.6  n_traj=  97
# medium-expert         mean_return=  7490.2  std= 2870.1  n_traj=2000
# expert                mean_return= 12135.0  std=   17.8  n_traj=1000
Notice the standard deviation of returns. Expert data has σ=17.8 (very narrow — every trajectory looks the same). Medium-replay has σ=1850.6 (enormous — it spans from random to medium quality). This is the distribution difference that breaks algorithms.
Dataset Composition Visualizer

Drag the slider to change the policy quality level. Watch how the trajectory return distribution shifts. At "expert" level, the distribution is narrow and high-quality. At "random," it is broad and low-quality. At "medium-expert," it is bimodal — two peaks.

Dataset Type medium
Why does the "medium-expert" dataset type break policy constraint methods?

Chapter 5: Benchmark Results

This is the payoff. D4RL evaluated eight algorithms across all domains, and the results were devastating for the field. Most offline RL algorithms that looked promising on prior benchmarks fell apart on realistic data.

The algorithms

BC (Behavioral Cloning) — Simple supervised learning. Copy the behavior policy. Ignores rewards entirely. This is the baseline that offline RL should beat.

SAC-off (Soft Actor-Critic, offline) — Standard SAC trained on the static dataset with no environment interaction. No distribution shift mitigation. Often diverges.

BCQ (Batch-Constrained Q-learning) — Constrains the policy to only take actions that appear in the dataset, using a generative model of πB.

BEAR (Bootstrapping Error Accumulation Reduction) — Constrains the policy to stay within the support of the data distribution using MMD distance.

BRAC (Behavior Regularized Actor-Critic) — Regularizes the actor to stay close to the behavior policy via KL divergence.

AWR (Advantage Weighted Regression) — Weights actions by their advantage, emphasizing good actions in the dataset.

CQL (Conservative Q-Learning) — Learns a lower bound on the Q-function by penalizing Q-values for out-of-distribution actions. Published concurrently with D4RL by the same group.

AlgaeDICE — Uses the DICE framework for off-policy evaluation, correcting distribution mismatch via importance weighting.

The key findings

Finding 1: On Gym-MuJoCo with RL-agent data, all methods look good. This is the setting most prior papers used. BEAR, BRAC, BCQ, and CQL all outperform BC on medium data. This is the misleading picture that pre-D4RL benchmarks painted.
Finding 2: On realistic data, most methods fail. On AntMaze (sparse reward + stitching), all methods except CQL scored near zero. On Adroit human demos, most methods matched or underperformed BC. On Kitchen mixed data, all methods struggled.
Finding 3: Mixture data confuses constraint methods. On medium-expert data, algorithms performed roughly on par with medium-only data — despite having access to expert demonstrations. The mixture broke the constraint target.
Finding 4: Offline RL beats online RL on exploration-hard tasks. A positive result: on AntMaze and Adroit, offline methods with good data outperformed online SAC, which struggled to explore the sparse-reward landscape from scratch.
DomainDatasetBCSAC-offBEARCQL
HalfCheetahmedium36.1-4.341.744.0
Hoppermedium29.00.852.158.5
Walker2dmedium6.60.933.772.5
AntMazeumaze65.00.073.074.0
AntMazemedium-diverse0.00.08.053.7
AntMazelarge-diverse0.00.00.014.9
Penhuman34.46.3-1.037.5
Hammerhuman1.50.50.34.4
Kitchenmixed47.52.547.251.0

Look at AntMaze large-diverse: BC scores 0.0, SAC-off scores 0.0, BEAR scores 0.0, and even CQL only gets 14.9. This task requires stitching together trajectories in a large maze with sparse rewards — and almost nothing works. This is the kind of problem D4RL was designed to expose.

Algorithm Comparison Dashboard

Select a domain and dataset type to compare algorithm performance. Normalized scores: 0 = random policy, 100 = expert. Watch how the rankings change dramatically between easy benchmarks and hard ones.

What was the most important finding from D4RL's benchmark evaluation?

Chapter 6: Evaluation Protocol

Raw reward numbers are meaningless across tasks. A score of 5000 on HalfCheetah and a score of 3.2 on Pen-twirl cannot be compared. D4RL introduces a normalized scoring protocol that puts all tasks on the same 0-100 scale.

The normalization formula

normalized score = 100 × (score − random score) / (expert score − random score)

Where:

A normalized score of 0 means "as good as random." A score of 100 means "as good as the expert." Scores above 100 are possible (your algorithm found something better than the reference expert). Scores below 0 mean your algorithm is worse than random — which happens more often than you'd think with offline RL due to distribution shift.

What counts as "expert"?

The expert reference varies by domain because the notion of "best possible" differs:

DomainExpert ReferenceWhy This Choice
Gym-MuJoCoFully-trained SACGold standard for these tasks
Maze2DHand-designed plannerOptimal path from planner
AntMazeMaximum possible (1.0)Sparse binary reward
AdroitBC + RL fine-tunedBest known policy
KitchenMaximum possible (4.0)4 subtasks, 1 point each
FlowHand-designed controllerIDM performance
CARLAMaximum estimateTheoretical best

The hyperparameter problem

Prior works tuned hyperparameters using online evaluation — running the learned policy in the simulator during training and picking the best hyperparameters. But in real offline RL, you cannot do this. You have no simulator. D4RL addresses this by splitting tasks into training tasks (tune hyperparameters here) and evaluation tasks (report final performance here, no tuning allowed).

Why this matters: Wu et al. (2019) showed that hyperparameter sensitivity was the main differentiator between algorithms on prior benchmarks. With extensive online tuning, almost any method could score well. By restricting tuning to training tasks, D4RL reveals which algorithms are robust versus which require per-task tuning.
python
# D4RL's normalized scoring API
import gym, d4rl

env = gym.make('halfcheetah-medium-v2')

# Evaluate policy over 100 episodes
returns = []
for _ in range(100):
    obs = env.reset()
    total = 0
    done = False
    while not done:
        action = policy.act(obs)
        obs, reward, done, info = env.step(action)
        total += reward
    returns.append(total)

raw_score = np.mean(returns)  # e.g., 5200.0

# Normalize: the env knows its own reference scores
normalized = env.get_normalized_score(raw_score)
print(f'Normalized: {normalized * 100:.1f}')  # e.g., 42.3

# Under the hood:
# ref_min = -280.2  (random policy average return)
# ref_max = 12135.0 (expert SAC average return)
# normalized = (5200 - (-280.2)) / (12135 - (-280.2)) = 0.441
Normalized Score Calculator

Enter a raw score and see how it maps to the 0-100 normalized scale for different environments. The min (random) and max (expert) reference scores vary per task.

Raw Score 5000
A normalized D4RL score of -15.0 means:

Chapter 7: Connections

D4RL was published in 2020, and it reshaped the offline RL field overnight. By exposing the failures of existing algorithms on realistic data, it set the research agenda for the next several years. Here is how D4RL connects to the broader landscape.

Algorithms born from D4RL's challenges

AlgorithmYearKey IdeaD4RL Problem It Solves
CQL2020Conservative Q-value lower boundDistribution shift — penalizes OOD actions
IQL2021Implicit Q-learning via expectile regressionAvoids querying OOD actions entirely
TD3+BC2021TD3 with BC regularization termSimple, strong baseline, beats complex methods
Decision Transformer2021Sequence modeling, not RLSidesteps Q-function entirely
Diffusion-QL2023Diffusion model for policyExpressive policy for multimodal data

Every one of these algorithms was evaluated on D4RL. The benchmark became the standard: if your offline RL paper does not report D4RL scores, reviewers will ask why.

D4RL's lasting contributions

Normalized Scoring
The 0-100 normalization became the universal reporting standard for offline RL. Every paper now reports normalized scores.
Realistic Data Focus
Shifted the field from "data from RL training" to "data from humans, planners, and mixed sources." Made realism the expectation.
Stitching as a Core Challenge
AntMaze and Kitchen made stitching a first-class problem. New algorithms are now specifically designed to stitch trajectories.
Open-Source API
Two lines of code to load any dataset. Lowered the barrier to entry for offline RL research dramatically.

Limitations and successors

D4RL is not perfect. The simulated environments, while battle-tested, are still far from the complexity of real-world systems. The action spaces are continuous but low-dimensional. There is no stochastic environment dynamics (stock markets, weather). Large action spaces (recommender systems) are absent.

BenchmarkYearWhat It Adds Beyond D4RL
RL Unplugged2020Perceptual complexity (pixel observations), Atari
NeoRL2022Near real-world industrial control tasks
ExoRL2022Unsupervised pre-training data for RL
D5RL2023Extends D4RL with pixel observations, more realistic tasks

The bigger picture

D4RL's impact extends beyond benchmarks. It crystallized a core message: offline RL is not just "RL but without exploration." It is a fundamentally different problem with unique challenges that require purpose-built solutions. The field took this lesson seriously, and the algorithms that followed (CQL, IQL, Decision Transformer) each addressed specific failure modes that D4RL exposed.

The benchmark also demonstrated the power of good dataset design. In supervised learning, ImageNet showed that data scale matters. In offline RL, D4RL showed that data composition matters. The same algorithm can score 100 on expert data and 0 on human data from the same environment. The dataset is not just fuel — it determines what is learnable.

D4RL's legacy in one sentence: It forced the offline RL community to stop testing on easy data, face the real challenges of learning from static datasets, and build algorithms that actually work on the kind of data you get in practice.
D4RL Impact Timeline

See how D4RL influenced the offline RL research timeline. Click years to see key papers and their D4RL results.

What is D4RL's most lasting contribution to the offline RL field?