Offline RL without querying out-of-distribution actions — approximate policy improvement implicitly through expectile regression on the value function.
Imagine you have a giant dataset of robot trajectories. Some are good, some mediocre, some terrible. You want to learn a policy that's better than any single trajectory in the dataset. That's the promise of offline reinforcement learning.
But here's the fundamental dilemma. Standard Q-learning computes:
That max operation queries the Q-function at actions a' that the agent might never have seen in the dataset. If a' is out-of-distribution (OOD), the Q-network can hallucinate arbitrarily high values for it. The max then selects these hallucinated values, which feed back into the Bellman update, creating a vicious cycle of overestimation that diverges.
The blue region shows the data distribution. The red curve shows Q-values learned by standard Q-learning — note how Q explodes outside the data support. Click "Run Bellman" to watch overestimation propagate.
Every prior offline RL method tried to solve the OOD action problem by either:
All of these still query Q-values at actions not in the dataset — they just try to mitigate the damage. IQL takes a radically different approach: never query the Q-function at unseen actions at all.
The key idea: approximate the maximization over actions implicitly through expectile regression on the value function. Instead of computing maxa' Q(s', a'), learn a state-value function V(s) that approximates this maximum using only (state, action) pairs that appear in the dataset.
This is the fundamental contribution: by replacing the explicit max with an implicit approximation via expectile regression, IQL avoids querying Q-values outside the data support entirely. The entire training pipeline operates only on in-distribution state-action pairs.
Let's formalize the offline RL problem. You have a fixed dataset D = {(s, a, r, s')} collected by some unknown behavior policy πβ. The goal: learn a policy π that maximizes expected return, using only this dataset — no additional environment interaction.
Standard Q-learning updates:
The maxa' is the culprit. With function approximation, Q-networks generalize — they produce outputs for any input (s, a), even ones never seen during training. In the online setting, this is fine: if Q overestimates an action, the agent tries it, gets a low reward, and the error self-corrects.
Offline, there's no self-correction. The Q-network can assign arbitrarily high values to OOD actions. The max operator selects these overestimates as targets. These inflated targets then train the Q-network to be even more overoptimistic at nearby states. Over many iterations, Q-values explode.
The deeper issue is distributional shift. The Q-function is trained on transitions from πβ but evaluated on actions from π (the learned policy). If π differs significantly from πβ, the Q-function is extrapolating — producing outputs in regions of action space where it has no data. Neural networks are notoriously bad extrapolators.
The blue distribution shows actions in the dataset (behavior policy πβ). The teal distribution shows what the learned policy π tries to do. The gap between them is where Q-values are unreliable. Drag the slider to shift the learned policy.
Before diving into IQL's algorithm, we need to understand the key mathematical tool: expectile regression.
You know quantiles: the median is the value where 50% of data falls below. Expectiles are similar, but instead of counting data points, they weight squared errors asymmetrically.
The τ-expectile of a random variable X is the value mτ that minimizes:
where the asymmetric squared loss is:
When τ = 0.5, this is ordinary least squares — the solution is the mean. When τ > 0.5, the loss penalizes underestimates more than overestimates, pushing the solution above the mean. When τ → 1, the expectile converges to the maximum.
Think of expectiles as an asymmetric averaging operation. The mean treats all errors equally. The τ-expectile says: "penalize errors below me τ/(1−τ) times more than errors above me." At τ = 0.9, underestimates are penalized 9x more than overestimates — so the solution is pulled toward the top of the distribution. At τ = 0.99, it's 99x — nearly at the maximum.
Drag τ to see how the expectile (teal line) moves from the mean (τ=0.5) toward the maximum (τ→1). The warm dots are data samples; the asymmetric penalty is shown below.
IQL trains three networks with three losses. Let's build them up one at a time.
The value function Vψ(s) is trained to approximate the upper expectile of Q-values over in-distribution actions:
This fits V(s) to be the τ-expectile of Q(s, a) over the distribution of actions a that appear in the dataset at state s. With τ close to 1, V(s) ≈ maxa in data Q(s, a) — an approximation of the optimal value that never queries Q at unseen actions.
This is a standard TD loss — but instead of maxa' Q(s', a') as the bootstrap target, we use V(s'). Since V was trained via expectile regression to approximate the max, this avoids explicitly maximizing over actions. The target network Qθ̂ is used for stability (slow Polyak average of θ).
In practice, this is implemented as weighted behavioral cloning:
where A(s, a) = Q(s, a) − V(s) is the advantage. Actions with positive advantage (better than average) get upweighted. The temperature β controls how aggressively we favor high-advantage actions.
Diagram showing how the three losses interact. V approximates the implicit max of Q, Q bootstraps from V, and the policy is extracted via advantage-weighted BC.
This is the mathematical heart of IQL. Why does expectile regression on V give us a good approximation of the max over actions?
Consider the following constrained optimization problem at a single state s:
This asks: what's the best policy we can find that stays near the behavior policy? The solution to this problem — when the closeness constraint uses a specific f-divergence — turns out to be equivalent to the τ-expectile of Q(s, a) under πβ.
More precisely, the optimal value of this constrained problem equals Vτ(s) when the constraint strength is linked to τ. As τ → 1, the constraint loosens and V approaches the unconstrained max. As τ → 0.5, the constraint tightens and V approaches the mean (behavioral cloning).
Crucially, the max is taken only over actions in the support of πβ. If an action never appears in the dataset, it cannot influence V. This is what makes IQL safe offline:
Q-values for different actions at a single state. The warm dots show Q(s, a) for in-distribution actions. The teal line is Vτ(s). Drag τ to see V move from mean toward max. The red region shows OOD actions — never queried.
One of the most important capabilities an offline RL algorithm can have is trajectory stitching — the ability to combine good parts from different suboptimal trajectories to produce a better-than-any-single-trajectory policy.
Consider an AntMaze task: a robot ant must navigate from start to goal. Your dataset contains:
Neither trajectory solves the task. But a smart algorithm can stitch the first half of A with the second half of B to find the complete solution. This requires multi-step dynamic programming — the algorithm must propagate value information across trajectories.
IQL's Bellman backup with V targets enables stitching naturally:
Because V(s') approximates the best achievable value from s' (over in-distribution actions), the Q-function at state s can "see" the value of reaching s' even if the original trajectory from s never reached the goal. As long as the dataset covers the transition from s to s' and from s' onward (possibly from a different trajectory), the value propagates backward through the Bellman backup.
Methods like advantage-weighted regression (AWR) without multi-step Bellman backups also struggle. They weight trajectories by their total return — so a trajectory that gets halfway to the goal still gets low weight because it never receives the goal reward. IQL's dynamic programming propagates goal rewards backward through Q/V, giving credit to states that are on the path to success even if the trajectory that visited them failed.
Two suboptimal trajectories (orange and blue) neither of which reaches the goal. IQL stitches them via Bellman backups. Click "Run Stitching" to watch value propagation connect the trajectories.
IQL achieves state-of-the-art or competitive performance across all D4RL task categories. The results demonstrate IQL's key advantage: it excels on tasks that require stitching while matching or exceeding prior methods on standard tasks.
On MuJoCo locomotion tasks (HalfCheetah, Hopper, Walker2d) with medium, medium-replay, and medium-expert datasets:
This is where IQL truly shines. AntMaze tasks require navigating a simulated ant through mazes, and the dataset contains suboptimal trajectories that don't individually solve the task.
On the hardest tasks (large mazes), IQL more than doubles CQL's success rate. This directly reflects trajectory stitching: the large maze requires combining 3-5 suboptimal trajectory segments, and IQL's multi-step value propagation handles this far better than CQL's conservative Q-function.
Success rates on D4RL AntMaze tasks. IQL (teal) dominates on tasks requiring trajectory stitching, especially the large mazes.
An often overlooked advantage of IQL: it's an excellent initialization for online RL. After offline pretraining, you can deploy the policy in the environment and continue improving it with online data.
Many offline RL methods learn representations that are overly conservative or specialized to the offline data. When switched to online learning, they need to "unlearn" conservative biases before making progress. IQL doesn't have this problem because:
On AntMaze-large-diverse, online fine-tuning from an IQL initialization reaches ~90% success within 250k online steps. Starting from scratch (no offline pretraining) doesn't reach 90% even after 2M steps. Starting from CQL pretraining is slower to improve because it must first overcome the conservative bias.
BCQ (Fujimoto et al., 2019): Batch-Constrained Q-learning — first identified the OOD action problem in offline RL and constrained the policy to a VAE-decoded neighborhood of the data. IQL eliminates the need for a generative model of the data distribution.
CQL (Kumar et al., 2020): Conservative Q-Learning — adds a regularizer that pushes down Q-values at OOD actions while pushing up Q-values at in-distribution actions. IQL avoids the need for any OOD query by design, rather than mitigating it with penalties.
AWR (Peng et al., 2019): Advantage-Weighted Regression — IQL's policy extraction step is AWR, but IQL adds multi-step value propagation via Q/V learning. Pure AWR without Bellman backups can't stitch trajectories.
TD3+BC (Fujimoto & Gu, 2021): Adds a behavioral cloning regularizer to TD3. Simpler than IQL but less effective on stitching tasks.
Decision Transformer (Chen et al., 2021): Concurrent work that frames offline RL as sequence modeling. DT doesn't do dynamic programming (no Bellman backups), so it can't stitch. IQL and DT represent two philosophies: value-based stitching vs sequence-based return conditioning.
Cal-QL (Nakamoto et al., 2024): Calibrated Q-Learning — combines CQL-style conservatism with IQL-style expectile regression for better offline-to-online transfer.
DPO (Rafailov et al., 2023): Direct Preference Optimization for language models has a structural similarity: both IQL and DPO avoid explicit reward/value maximization by implicitly solving the optimization through a different loss function. DPO avoids learning a reward model; IQL avoids computing the max over actions.
Offline RL for robotics: IQL became a go-to baseline for real-robot offline RL due to its simplicity and stability — no adversarial training, no generative models, just three regression losses.