CS224R — Offline RL: The Complete Guide

Roadmap

What You'll Master

01RL Recap & Value Functions 02Why Offline RL? 03Data Stitching 04Distribution Shift 05Filtered Behavior Cloning 06Advantage-Weighted Regression 07AWAC 08Implicit Q-Learning 09Summary & Connections

Chapter 01

RL Recap & Value Functions

Before we dive into offline RL, we need the machinery of value functions and temporal difference learning. These are the building blocks every offline method uses to extract signal from a fixed dataset.

Definition

State-Value Function — V^π(s)

V^π(s) = 𝔼_π[ Σ_t=0^∞ γ^t r(s_t, a_t) | s₀ = s ]

The expected discounted return starting from state s and following policy π thereafter. "How good is it to be in this state?"

Definition

Action-Value Function — Q^π(s, a)

Q^π(s, a) = 𝔼_π[ Σ_t=0^∞ γ^t r(s_t, a_t) | s₀ = s, a₀ = a ]

The expected return starting from state s, taking action a, and following π thereafter. "How good is this specific action in this state?"

Definition

Advantage Function — A^π(s, a)

A^π(s, a) = Q^π(s, a) − V^π(s)

How much better is action a compared to the average action in state s? Positive = better than average. Negative = worse. Zero = exactly average.

Fitting Value Functions from Data

Given a dataset of transitions (s, a, r, s'), how do we learn V or Q?

Definition

Monte Carlo (MC) Target

V(s_t) ← Σ_k=0^T-t γ^k r_t+k

Sum all future rewards in the trajectory. Unbiased but high variance (depends on all future randomness). Requires complete trajectories.

Definition

Temporal Difference (TD) Target

V(s_t) ← r_t + γ V(s_t+1)

Use one-step reward plus the estimated value of the next state. Biased (bootstraps from its own estimate) but low variance (only depends on one transition). Works with individual transitions.

Definition

TD for Q-function

Q(s_t, a_t) ← r_t + γ Q(s_t+1, a_t+1)

Same idea, but for action-values. The key question: where does a_t+1 come from? If from the dataset (SARSA), we evaluate π_β. If from π_θ, we evaluate the learned policy.

Online RL Algorithm Summary

Algorithm	Type	Key Idea	Value Fn
Vanilla PG	On-policy	Weight log-probs by returns	None (or baseline V)
PPO	On-policy	Clipped IS ratio	V for advantage
SAC	Off-policy	Max entropy + replay buffer	Q (two critics)
DQN	Off-policy	Q-learning + target net	Q only

The key difference for offline

All these algorithms assume you can collect more data after each update. Offline RL removes this assumption entirely. You get a fixed dataset. No more interactions. This simple change breaks almost everything.

Interactive: hover to see V(s) and Q(s,a) for a simple gridworld. Red = low value, green = high value.

Chapter 02

Why Offline RL?

You're an engineer at a hospital. You have five years of patient treatment records — thousands of trajectories of diagnoses, treatments, and outcomes. You want to learn an improved treatment policy. But you can't just "try random actions" on real patients to explore. Online RL would require experimenting with people's lives.

This is the core motivation for offline reinforcement learning (also called batch RL): learning a policy from a fixed dataset of previously collected experience, without any further interaction with the environment.

Definition

Offline RL Problem

Given D = { (s_i, a_i, r_i, s'_i) }_i=1^N collected by π_β

Find a policy π_θ that maximizes expected return, without ever interacting with the environment again. The dataset D is all you get. π_β is the behavior policy — whatever policy (or mix of policies) generated the data.

Definition

Behavior Policy — π_β

The policy (or mixture of policies) that generated the dataset. Could be a single expert, a random agent, a mix of human operators over time, or the output of a previous RL algorithm. We often don't know π_β explicitly — we only observe its data.

Where Offline Data Comes From

Domain	Data Source	Why Online RL Fails
Healthcare	Patient records (5+ years)	Can't experiment on patients
Autonomous driving	Millions of miles of logs	Dangerous to explore
Robotics	Teleoperation demos	Robots break, data is slow
Dialogue	Human conversation logs	Users don't tolerate bad outputs
Recommendation	User interaction history	Bad recs lose engagement

The offline RL promise

Train policies entirely from logged data. No simulator needed. No dangerous exploration. No expensive environment interactions. Just D = {(s, a, r, s')} and an algorithm.

Online vs. Offline: The Fundamental Difference

In online RL, the agent collects data, updates its policy, then collects more data with the improved policy. The data distribution improves over time because the agent explores more promising regions.

In offline RL, the dataset is fixed. The policy was collected by π_β — which might be suboptimal, might cover only a small part of state space, might have dangerous gaps. You must learn the best policy you can from whatever data you got.

Left: Online RL iterates between data collection and policy improvement. Right: Offline RL learns from a fixed dataset only.

Chapter 03

Data Stitching

Why not just do imitation learning on the best trajectories? After all, if the dataset contains some good trajectories, can't we just clone those?

The answer reveals the deepest insight of offline RL: data stitching. The optimal policy might never appear as a complete trajectory in the data, but it might be constructible from pieces of different trajectories.

The Stitching Example

Consider a navigation task with states s₁ through s₉ (s₉ = goal):

Trajectory A: s₁ → s₃ → s₅ → s₇ (dead end, reward = 3)

Trajectory B: s₄ → s₃ → s₆ → s₉ (reaches goal from s₃, reward = 10)

Stitched policy: s₁ → s₃ (from A) → s₆ → s₉ (from B). Reward = 10!

Neither trajectory alone reaches the goal from s₁. But combining the first step of A with the tail of B creates a policy that does. This is stitching.

Why imitation learning can't stitch

Behavioral cloning learns to mimic entire trajectories. It copies actions from s₁ → s₃ → s₅ → s₇ (the only trajectory starting at s₁). It never discovers that going s₃ → s₆ is better, because that transition only appears in trajectory B which starts at s₄.

TD Learning Enables Stitching, MC Does Not

The ability to stitch comes from how we estimate value.

Monte Carlo estimates V(s₃) by averaging the returns of trajectories passing through s₃. Trajectory A gives return 3 from s₃. Trajectory B gives return 10 from s₃. MC average: V(s₃) ≈ 6.5. But crucially, MC assigns this value to s₃ independently of how you arrived there — and it can't propagate this back to s₁ because no single trajectory achieves return 10 from s₁.

Temporal Difference propagates value through the Bellman equation: V(s₁) = r + γ V(s₃). Since V(s₃) is high (because from s₃ you can reach s₉), TD assigns high value to s₁ — even though no trajectory starting at s₁ ever reached s₉. TD stitches across trajectories via the Bellman backup.

Why TD Stitches V(s₁) = r(s₁,a) + γ V(s₃) [from trajectory A: s₁ → s₃]
V(s₃) = r(s₃,a) + γ V(s₆) [from trajectory B: s₃ → s₆]
V(s₆) = r(s₆,a) + γ V(s₉) [from trajectory B: s₆ → s₉]

↑ Value propagates backward through Bellman chains, across trajectories

Data stitching: TD combines segments from different trajectories (blue and orange) to find the optimal path (green).

Stitching requires TD — and TD requires querying values at next states

The Bellman backup V(s) = r + γV(s') or Q(s,a) = r + γQ(s',a') requires evaluating the value function at the next state (or state-action). In offline RL, this is where trouble begins — what if s' or a' is outside the data distribution?

Checkpoint — Before you move on

You now understand stitching: TD learning combines partial trajectories to find optimal paths that never appeared in the data. But there's a catch. What could go wrong when TD bootstraps from Q(s', a') and the action a' was never taken in the dataset? Think about what the Q-function "knows" vs what it's "guessing."

✓ Gate cleared

Model Answer

When Q(s', a') is queried for an action a' that never appeared in the dataset at state s', the neural network is extrapolating. Neural networks are universal approximators for interpolation but terrible at extrapolation. The Q-function has no training signal for these out-of-distribution (OOD) actions, so its estimate is essentially random — and often overestimates the true value (because the policy is optimized to find high-Q actions, and random extrapolation noise is more likely to be picked up when it's positive). This OOD overestimation is the core failure mode of naive offline RL.

Chapter 04

Distribution Shift Deep Dive

Here's the fundamental problem. You have a dataset collected by π_β. You want to learn a policy π_θ that's better than π_β. To improve, you need to evaluate actions that π_β never took. But your Q-function was only trained on actions that π_β did take. You're asking the Q-function to extrapolate to regions it's never seen.

Definition

Out-of-Distribution (OOD) Action

An action a at state s that has zero or negligible probability under the behavior policy: π_β(a|s) ≈ 0. The dataset contains no (or almost no) examples of (s, a) pairs where this action was taken in this state. The Q-function's estimate Q(s, a) for OOD actions is unreliable.

Definition

Data Support

The set of (s, a) pairs that appear in the dataset D with non-negligible frequency. supp(D) = {(s, a) : π_β(a|s) > 0}. Actions outside the support are OOD.

The Off-Policy Actor-Critic on Static Data

Consider running a standard off-policy actor-critic (like SAC) on a fixed dataset. The algorithm has two components:

Q-function update (critic) Q(s, a) ← r(s, a) + γ Q(s', a') where a' = arg max_a Q(s', a)

Policy update (actor) π_θ ← arg max_θ 𝔼_{s ~ D}[ Q(s, π_θ(s)) ]

The actor finds actions that maximize Q. The critic bootstraps from Q(s', a') where a' comes from the actor. But a' is the action the learned policy would take — not the action in the dataset. If π_θ ≠ π_β, then a' is likely OOD.

The Overestimation Feedback Loop

Here's where it gets catastrophic. It's not just that Q is wrong for OOD actions — the error compounds:

Hand Calculation: Error Explosion

Suppose the true Q*(s, a) = 5 for all (s, a). The behavior policy covers actions a₁, a₂, a₃ at state s. There exists an OOD action a₄ that π_β never takes.

Step 1: Neural net randomly assigns Q(s, a₄) = 8 (extrapolation error ε = +3).

Step 2: Policy selects a₄ because Q(s, a₄) = 8 > Q(s, a_1..3) = 5.

Step 3: At some other state s', the Bellman target becomes r + γ Q(s, a₄) = r + 0.99 × 8 = r + 7.92 instead of r + 4.95. This overestimate propagates to Q(s').

Step 4: The inflated Q(s') causes MORE OOD actions to look good, creating more overestimates...

After k steps: Error grows as ε/(1−γ). With ε = 3, γ = 0.99: max error ≈ 3/(1−0.99) = 300. Q-values explode to hundreds while true values are 5.

OOD Error Propagation Bound max|Q̂(s,a) − Q*(s,a)| ≤ ε_OOD / (1 − γ)

ε_OOD = max extrapolation error on OOD actions. With γ close to 1, this blows up.

The vicious cycle

1. Policy picks OOD action (high Q) → 2. Bellman backup propagates overestimate → 3. More states get inflated Q → 4. Policy picks more OOD actions → repeat. This is NOT a convergence issue — training longer makes it worse.

Q-value overestimation: blue = true Q values (in-distribution), red = estimated Q on OOD actions. Click "Run" to watch the feedback loop explode.

The Core Challenge of Offline RL

Every offline RL algorithm is, at its core, a different answer to one question: how do we prevent the policy from exploiting unreliable Q-estimates on OOD actions?

Three families of solutions:

Approach	Strategy	Methods
Constrain the policy	Keep π_θ close to π_β	Filtered BC, AWR, AWAC
Constrain the value	Penalize Q on OOD actions	CQL, BEAR
Avoid OOD queries	Never evaluate Q on actions outside data	IQL

🔨 Derivation Derive the OOD Error Propagation Bound: ε/(1−γ) ▶ ✓ ATTEMPTED

The Bellman equation for the estimated Q is: Q̂(s,a) = r + γ max_a' Q̂(s',a'). Suppose that for any OOD action, the estimate has error at most ε: |Q̂(s,a_OOD) − Q*(s,a_OOD)| ≤ ε.

Your task: (1) Show that after one Bellman backup, the error at the backed-up state is at most γε + ε. (2) Show that after infinite backups, the maximum error converges to ε/(1−γ). (3) Compute the numerical bound for γ=0.99 and ε=1.

If the policy picks an OOD action at s' because Q̂(s',a'_OOD) is overestimated by ε, then the Bellman target r + γQ̂(s',a') is off by at most γε. Add the local error ε at (s,a) itself and you get γε + ε = ε(1+γ) for one step. But the recursive structure means this compounds.

Let E_k = max error after k backups. E₀ = ε. E_k+1 = ε + γE_k (local error + discounted propagated error). This is a geometric series: E_∞ = ε + γε + γ²ε + ... = ε/(1−γ).

Part 1 — One backup:

Q̂(s,a) = r + γ max_a' Q̂(s',a'). The max selects an OOD action with error ε. True target: r + γQ*(s',a*) where a* is the truly optimal action.

|Q̂(s,a) − Q*(s,a)| = |γ(Q̂(s',a'_OOD) − Q*(s',a*))| ≤ γ · ε (from the OOD error at s').

But Q̂(s,a) itself may also have local error ε, giving total ≤ ε + γε.

Part 2 — Infinite backups:

E_k+1 = ε + γ E_k. Unrolling: E_k = ε(1 + γ + γ² + ... + γ^k) = ε(1 − γ^k+1)/(1 − γ).

As k → ∞: E_∞ = ε/(1 − γ) ■

Part 3 — Numerical: γ=0.99, ε=1: Error bound = 1/(1−0.99) = 100. A "small" extrapolation error of 1 becomes a Q-value error of 100 after full propagation.

The key insight: Discount factor γ determines the effective horizon 1/(1−γ). The same factor that lets TD "see far into the future" also amplifies errors far into the future. High γ = powerful stitching BUT catastrophic error amplification. This is the fundamental tension of offline RL.

Chapter 05

Filtered Behavior Cloning

The simplest offline RL algorithm. So simple it barely counts as RL — but it's a strong baseline that many complex methods fail to beat.

The idea: rank all trajectories in D by their total return. Keep only the top k%. Do imitation learning on this filtered subset.

Algorithm: Filtered Behavior Cloning

Compute returns: For each trajectory τ_i in D, compute R(τ_i) = Σ_t r_t
Rank & filter: Sort trajectories by R(τ). Keep top k% (e.g., k=10)
Imitate: θ* = arg min_θ Σ_{(s,a) ∈ D_top} −log π_θ(a|s)

Hand Calculation

Dataset has 100 trajectories with returns: [2, 5, 1, 8, 3, 7, 9, 4, 6, 10, ...].

Filter top 10%: keep the 10 trajectories with highest returns (R ≥ 8).

Train policy to imitate only these 10 "good" trajectories.

Result: Policy mimics the best behavior seen in the data. Cannot exceed the best trajectory's return. Cannot stitch.

Strengths and weaknesses

Strengths: Simple. No Q-function. No OOD problem (only imitates data actions). No distribution shift. Surprisingly competitive on many benchmarks.

Weaknesses: (1) Cannot stitch — bounded by best trajectory. (2) Wastes information — ignores what makes bad trajectories bad. (3) The "top k%" threshold is a hyperparameter that requires tuning.

Filtered BC: slide k% to see how filtering threshold affects learned behavior. Only shaded trajectories are imitated.

Chapter 06

Advantage-Weighted Regression

Filtered BC is binary: keep or discard. But some trajectories are slightly good, some are very good. Shouldn't better trajectories get more weight? AWR makes the weighting continuous.

The Core Idea

Instead of hard filtering, weight each transition (s, a) by exp(A(s,a) / α) where A is the advantage and α is a temperature:

AWR Objective π_θ = arg max_θ 𝔼_{(s,a) ~ D}[ exp(A(s,a) / α) · log π_θ(a|s) ]

Definition

Temperature — α

Controls how aggressively we weight good actions over bad ones. α → ∞: uniform weighting (behavior cloning). α → 0: puts all weight on the single best action (hard argmax). Typical values: α ∈ [0.1, 10].

Why Exponential Weighting?

AWR Approximates KL-Constrained Optimization

AWR is the closed-form solution to:

max_π 𝔼_{s~D, a~π}[A(s,a)] − α · D_KL(π || π_β)

Maximize advantage while staying close to the behavior policy. The solution is:

π*(a|s) ∝ π_β(a|s) · exp(A(s,a) / α)

Since we can't sample from π* directly (we don't know π_β or the normalizer), we project onto our policy class via weighted maximum likelihood — which gives the AWR objective.

Computing the Advantage

AWR (original version) uses Monte Carlo returns for the advantage:

MC Advantage Estimate A(s_t, a_t) = R_t − V(s_t)
R_t = Σ_k=0^T-t γ^k r_t+k (MC return)
V(s) ≈ (1/|visits(s)|) Σ_{t: s_t=s} R_t (or a learned V network)

Algorithm: Advantage-Weighted Regression (AWR)

Fit V: Train V_φ(s) to predict MC returns: minimize Σ_t (V_φ(s_t) − R_t)²
Compute advantages: A_t = R_t − V_φ(s_t) for all transitions in D
Compute weights: w_t = exp(A_t / α), then normalize: w_t ← w_t / Σ w
Weighted BC: θ ← arg max_θ Σ_t w_t · log π_θ(a_t|s_t)

Hand Calculation: AWR Weights

Three transitions at state s: (s, a₁, R=3), (s, a₂, R=7), (s, a₃, R=1). V(s) = 3.67 (average return).

Advantages: A₁ = 3 − 3.67 = −0.67, A₂ = 7 − 3.67 = 3.33, A₃ = 1 − 3.67 = −2.67

With α = 1: weights = exp(−0.67) = 0.51, exp(3.33) = 28.0, exp(−2.67) = 0.07

Normalized: w₁ = 0.018, w₂ = 0.979, w₃ = 0.002

Action a₂ gets 97.9% of the weight! The policy almost exclusively imitates the best action.

With α = 5: weights = exp(−0.13) = 0.88, exp(0.67) = 1.95, exp(−0.53) = 0.59

Normalized: w₁ = 0.257, w₂ = 0.571, w₃ = 0.172

More uniform — higher temperature smooths the distribution.

AWR weighting: adjust temperature α to see how exponential weights concentrate on high-advantage actions.

AWR Limitation: No Stitching

AWR uses MC returns → each transition's advantage depends on the trajectory it came from. A great action followed by bad luck gets low advantage. AWR cannot stitch because MC doesn't propagate value across trajectories.

🔨 Derivation Prove: AWR Solves the KL-Constrained Policy Improvement ▶ ✓ ATTEMPTED

The KL-constrained objective is: max_π 𝔼_a~π[A(s,a)] − α D_KL(π(.|s) || π_β(.|s)).

Your task: (1) Write out the Lagrangian/functional derivative. (2) Solve for the optimal π*(a|s). (3) Show that π*(a|s) ∝ π_β(a|s) exp(A(s,a)/α). (4) Explain why we can't sample from π* directly and need the weighted regression step.

D_KL(π || π_β) = Σ_a π(a|s) log(π(a|s)/π_β(a|s)). So the objective becomes: Σ_a π(a|s)[A(s,a) − α log(π(a|s)/π_β(a|s))]. Take the functional derivative w.r.t. π(a|s) and set to zero (don't forget the constraint Σπ=1 via a Lagrange multiplier).

Setting ∂/∂π(a) = 0: A(s,a) − α[log(π/π_β) + 1] − λ = 0. Solving: log π*(a|s) = log π_β(a|s) + A(s,a)/α + const. Exponentiating: π*(a|s) ∝ π_β(a|s) exp(A(s,a)/α).

Step 1 — Write the Lagrangian:

Step 2 — Functional derivative:

∂L/∂π(a|s) = A(s,a) − α[log(π(a|s)/π_β(a|s)) + 1] − λ = 0

Step 3 — Solve:

log(π*(a|s)/π_β(a|s)) = A(s,a)/α − 1 − λ/α

π*(a|s) = π_β(a|s) · exp(A(s,a)/α) · exp(−1 − λ/α)

The last term is a constant (normalizer Z(s)): π*(a|s) = π_β(a|s) exp(A(s,a)/α) / Z(s) ■

Step 4 — Why weighted regression:

We can't sample from π* because: (a) we don't know π_β explicitly (only have its samples), (b) we can't compute Z(s) without summing over all actions. Instead, we use the dataset samples (which come from π_β) and weight them by exp(A/α). The weighted MLE: max_θ Σ_(s,a)~D exp(A(s,a)/α) log π_θ(a|s) projects π* onto our parametric policy class π_θ.

Chapter 07

AWAC: Advantage-Weighted Actor-Critic

AWR's fatal flaw: MC returns prevent stitching. AWAC fixes this by replacing MC with TD-learned Q-values.

AWAC = AWR + TD Q-function

Keep the beautiful exponential weighting from AWR. But compute the advantage using a learned Q-function (TD updates) instead of MC returns. This enables stitching while maintaining the policy constraint.

The AWAC Objective

AWAC Policy Update π_θ = arg max_θ 𝔼_{(s,a) ~ D}[ exp((Q(s,a) − V(s)) / α) · log π_θ(a|s) ]

Same as AWR but advantage = Q(s,a) − V(s) from learned value functions

The Q-function Update

Here's where AWAC gets subtle. The Q-function is updated with standard TD:

AWAC Critic Update Q_φ(s,a) ← r + γ Q_φ(s', a') where a' ~ π_θ(.|s')

The next-state action a' comes from the learned policy π_θ, not the behavior policy. This means Q evaluates π_θ (the thing we're improving), not π_β (the thing that collected data).

Wait — doesn't this have the OOD problem?

Yes! When we query Q(s', a') for a' ~ π_θ, that action might be OOD. AWAC's defense: the policy constraint (exponential weighting on data actions only) keeps π_θ close to π_β, so a' ~ π_θ is usually in-distribution. The KL penalty implicitly prevents the policy from drifting too far into OOD territory. But this is a soft constraint — not a guarantee.

Algorithm: AWAC (Advantage-Weighted Actor-Critic)

Initialize policy π_θ, Q-function Q_φ
Repeat:
1. Critic step: Sample (s, a, r, s') from D. Compute target: y = r + γ Q_φ'(s', a'), a' ~ π_θ(.|s'). Update: φ ← φ − α_Q ∇_φ(Q_φ(s,a) − y)²
2. Actor step: Sample (s, a) from D. Compute advantage: A = Q_φ(s,a) − V(s). Weight: w = exp(A/λ). Update: θ ← θ + α_π ∇_θ(w · log π_θ(a|s))

AWAC vs AWR: The Stitching Test

Back to our navigation example. At state s₃:

AWR (MC): R(s₃, go-right) = 3 (from trajectory A ending in dead end). R(s₃, go-left) = 10 (from trajectory B reaching goal). AWR correctly weights go-left higher — but only because trajectory B happened to pass through s₃. It cannot propagate this back to s₁.

AWAC (TD): Q(s₃, go-left) = r + γQ(s₆, ...) = r + γ(r + γQ(s₉,...)) = high. This high Q propagates back via Bellman: Q(s₁, go-to-s₃) = r + γQ(s₃, go-left) = high. AWAC at s₁ weights "go to s₃" heavily — it stitched!

AWAC combines TD-learned Q-values (enabling stitching) with constrained policy updates (preventing OOD exploitation).

Chapter 08

Implicit Q-Learning (IQL)

AWAC still has a weakness: it queries Q(s', a') where a' comes from the learned policy. If π_θ occasionally picks OOD actions, the Q-function gets corrupted. Can we build a method that never evaluates Q on actions outside the dataset?

IQL's Key Insight

Standard Bellman backup: Q(s,a) = r + γ max_a' Q(s',a'). The max requires evaluating Q on actions we might never have seen. IQL replaces max with a learned V function that estimates the best in-data value — without ever querying Q on OOD actions.

The Three-Step IQL Update

IQL separates the update into three decoupled steps, each using only in-distribution quantities:

Step 1: Fit V with Expectile Regression V_ψ(s) ← arg min_ψ 𝔼_{(s,a) ~ D}[ L₂^τ( Q_φ(s,a) − V_ψ(s) ) ]

where L₂^τ(u) = |τ − 1(u < 0)| · u²

Step 2: Standard TD for Q (using V, not max Q) Q_φ(s,a) ← r(s,a) + γ V_ψ(s')

No max! No policy query! Just r + γV(s') where V is from step 1.

Step 3: Extract Policy with AWR π_θ ← arg max_θ 𝔼_{(s,a) ~ D}[ exp((Q_φ(s,a) − V_ψ(s)) / α) · log π_θ(a|s) ]

The Magic: Expectile Regression

Definition

Expectile Regression Loss — L₂^τ

L₂^τ(u) = |τ − 1(u < 0)| · u²

Expanded: L₂^τ(u) = τ · u² if u ≥ 0 (underestimate)
= (1−τ) · u² if u < 0 (overestimate)

An asymmetric squared loss. When τ = 0.5, this is standard MSE. When τ > 0.5, underestimates (u ≥ 0) are penalized τ/(1−τ) times more than overestimates. The minimizer converges to the τ-th expectile of the target distribution.

What Expectile Regression Does

Consider Q-values for state s in the dataset: Q(s, a₁) = 3, Q(s, a₂) = 5, Q(s, a₃) = 8.

τ = 0.5: V(s) minimizes mean squared error. Solution: V(s) ≈ 5.33 (the mean).

τ = 0.7: Underestimates penalized more. V(s) shifts toward 8. Solution: V(s) ≈ 6.2.

τ = 0.9: Strong asymmetry. V(s) ≈ 7.4 (close to the max).

τ → 1.0: V(s) → max Q(s,a) = 8 (approaches the maximum in the data).

High τ makes V(s) approximate max_{a ∈ data} Q(s,a) without ever computing a max over actions.

Why this avoids OOD

Standard Bellman: Q(s,a) = r + γ max_a' Q(s',a'). The max is over ALL actions, including OOD ones. IQL: Q(s,a) = r + γ V(s'), where V(s') ≈ max_{a' ∈ data} Q(s',a'). The "max" is implicit — V learns to output the maximum over data actions only, through the asymmetric loss. We never query Q on any action outside the dataset.

Expectile regression: adjust τ to see how V(s) shifts from the mean toward the maximum of in-data Q-values.

Algorithm: Implicit Q-Learning (IQL)

Initialize Q_φ, V_ψ, π_θ, target Q_φ'
Repeat (sample mini-batch from D):
1. Update V: ψ ← ψ − α_V ∇_ψ 𝔼[ L₂^τ(Q_φ'(s,a) − V_ψ(s)) ]
2. Update Q: φ ← φ − α_Q ∇_φ 𝔼[ (Q_φ(s,a) − (r + γ V_ψ(s')))² ]
3. Update π: θ ← θ + α_π ∇_θ 𝔼[ exp((Q_φ'(s,a) − V_ψ(s))/β) · log π_θ(a|s) ]
4. Update target: φ' ← (1−ρ) φ' + ρ φ

python
import torch
import torch.nn as nn

class IQL:
    def __init__(self, q_net, v_net, policy, tau=0.7, beta=3.0, gamma=0.99):
        self.q = q_net       # Q(s, a) -> scalar
        self.v = v_net       # V(s) -> scalar
        self.pi = policy     # pi(a|s) -> distribution
        self.tau = tau
        self.beta = beta
        self.gamma = gamma

    def expectile_loss(self, diff):
        # Asymmetric squared loss
        weight = torch.where(diff > 0, self.tau, 1 - self.tau)
        return (weight * diff**2).mean()

    def update_v(self, s, a):
        # Step 1: V toward high-Q actions via expectile regression
        with torch.no_grad():
            q_target = self.q_target(s, a)
        v_pred = self.v(s)
        loss = self.expectile_loss(q_target - v_pred)
        return loss

    def update_q(self, s, a, r, s_next):
        # Step 2: Standard TD using V (NO max, NO policy query)
        with torch.no_grad():
            target = r + self.gamma * self.v(s_next)
        q_pred = self.q(s, a)
        loss = ((q_pred - target)**2).mean()
        return loss

    def update_policy(self, s, a):
        # Step 3: AWR-style extraction
        with torch.no_grad():
            adv = self.q_target(s, a) - self.v(s)
            weights = torch.exp(adv / self.beta)
            weights = weights / weights.sum()  # normalize
        log_probs = self.pi.log_prob(s, a)
        loss = -(weights * log_probs).sum()
        return loss

🔨 Derivation Prove: Expectile Regression Converges to the τ-th Expectile ▶ ✓ ATTEMPTED

The expectile loss is L₂^τ(u) = |τ − 1(u<0)| · u². The τ-th expectile m_τ of a distribution F is defined by: τ · 𝔼[(X − m)⁺]² = (1−τ) · 𝔼[(m − X)⁺]², where (x)⁺ = max(x,0).

Your task: Show that arg min_m 𝔼[L₂^τ(X − m)] = m_τ. Take the derivative, set to zero, and recover the expectile defining equation.

𝔼[L₂^τ(X−m)] = τ · 𝔼[(X−m)² | X≥m] P(X≥m) + (1−τ) · 𝔼[(X−m)² | X<m] P(X<m). Take d/dm and set to zero.

L(m) = 𝔼[L₂^τ(X − m)] = τ ∫_m^∞ (x−m)² dF(x) + (1−τ) ∫_−∞^m (x−m)² dF(x)

Taking d/dm:

dL/dm = −2τ ∫_m^∞(x−m) dF(x) + 2(1−τ) ∫_−∞^m(m−x) dF(x) = 0

This gives: τ 𝔼[(X−m)⁺] = (1−τ) 𝔼[(m−X)⁺]

Which is exactly the expectile defining equation (in its first-moment form; squaring both sides of the L2 form gives the same condition). ■

Key insight: As τ → 1, the penalty on underestimation dominates, pushing m toward the maximum of the support. This is how IQL approximates "max Q over data actions" without ever computing an explicit max.

⚙ Implementation IQL Three-Step Update From Scratch ▶ ✓ ATTEMPTED

Implement the full IQL training step. Given a batch of (s, a, r, s', done) tuples, compute the losses for V, Q, and policy networks. Pay attention to: (1) which networks use stop-gradient, (2) the asymmetric loss for V, (3) using V(s') NOT max Q(s',a') for the Q target.

signaturedef iql_step(batch, q_net, v_net, policy, q_target_net, tau=0.7, beta=3.0, gamma=0.99): """ Args: batch: dict with keys 's', 'a', 'r', 's_next', 'done' (all tensors) q_net: Q(s, a) network v_net: V(s) network policy: pi(a|s) network with .log_prob(s, a) method q_target_net: target Q network (EMA of q_net) tau: expectile parameter (0.5 = mean, 0.9 = near-max) beta: temperature for AWR policy extraction gamma: discount factor Returns: v_loss, q_loss, pi_loss (scalars) """

Test Case

# With tau=0.5, V should converge to mean of Q values (standard MSE)
# With tau=1.0, V should converge to max of Q values
# Q target should NEVER involve max_a Q(s', a) or policy(s')

python
def iql_step(batch, q_net, v_net, policy, q_target_net, tau=0.7, beta=3.0, gamma=0.99):
    s, a, r, s_next, done = batch['s'], batch['a'], batch['r'], batch['s_next'], batch['done']

    # Step 1: V loss — expectile regression against target Q
    with torch.no_grad():
        q_values = q_target_net(s, a)        # Q from target net
    v_pred = v_net(s)
    diff = q_values - v_pred                  # positive = V underestimates
    weight = torch.where(diff > 0, tau, 1.0 - tau)
    v_loss = (weight * diff**2).mean()

    # Step 2: Q loss — TD with V(s'), NOT max Q(s',a')
    with torch.no_grad():
        v_next = v_net(s_next)                # V(s') — no policy needed!
        target = r + gamma * (1 - done) * v_next
    q_pred = q_net(s, a)
    q_loss = ((q_pred - target)**2).mean()

    # Step 3: Policy loss — AWR with Q-V as advantage
    with torch.no_grad():
        adv = q_target_net(s, a) - v_net(s)
        weights = torch.exp(adv / beta)
        weights = weights / weights.sum()     # normalize for stability
    log_probs = policy.log_prob(s, a)
    pi_loss = -(weights * log_probs).sum()

    return v_loss, q_loss, pi_loss

Adversarial Quiz: What does IQL with τ = 0.5 reduce to?

Setting: You run IQL with expectile parameter τ = 0.5. All other hyperparameters are standard. You observe that V(s) converges to the mean of Q(s, a) over data actions at state s.

Standard Q-learning (because V approximates max Q) AWAC (because we still use TD for Q) SARSA-style policy evaluation of π_β (no improvement) Behavioral cloning (because all advantages become zero)

Follow-up: Why doesn't τ=0.5 give policy improvement? What role does the asymmetry play in extracting a better-than-behavior policy?

With τ=0.5, V(s) = mean of Q(s,a) over data actions = expected value under π_β. The Q update uses r + γV(s'), which just evaluates π_β (SARSA-style). There's no "implicit max" happening — V doesn't favor high-Q actions over low-Q ones. The advantage Q(s,a) − V(s) tells you how much better an action is compared to π_β's average, but the policy can't improve beyond π_β's best because V doesn't bootstrap from the best action.

The asymmetry (τ > 0.5) is essential: it makes V(s) approach max_a∈data Q(s,a), which means the Q-update r + γV(s') effectively does r + γ max Q(s',a') — the Bellman optimality equation — but restricted to in-data actions. Without asymmetry, there's no optimality pressure.

Chapter 09

Summary & Connections

The Progression of Ideas

Method	Policy Constraint	Value Learning	Stitching?	OOD Risk
Filtered BC	Hard (only data actions)	None	No	None
AWR	Soft (exp weighting)	MC returns	No	None
AWAC	Soft (exp weighting)	TD (Q with policy)	Yes	Moderate (policy query)
IQL	Soft (exp weighting)	TD (Q with V)	Yes	None (no OOD query)

The key tradeoff

More stitching power = more risk of OOD exploitation. Filtered BC is safe but limited. AWR adds soft weighting but still can't stitch. AWAC adds TD stitching but risks OOD through policy queries. IQL achieves stitching with zero OOD risk by replacing max with asymmetric regression. Each method trades off between expressiveness and safety.

The Three Solutions to Distribution Shift

Taxonomy

Solution 1: Constrain the Policy

Keep π_θ close to π_β so it only takes actions seen in data. Methods: Filtered BC, AWR, AWAC, BCQ.

Taxonomy

Solution 2: Constrain the Value

Explicitly penalize Q-values on OOD actions (push them down). Methods: CQL (adds penalty α 𝔼_a~π[Q(s,a)] − α 𝔼_a~D[Q(s,a)] to the Q-loss).

Taxonomy

Solution 3: Avoid OOD Entirely

Never query Q on actions outside data. Use implicit maximization through expectile regression. Methods: IQL.

Method comparison: each approach's tradeoff between stitching ability (performance ceiling) and safety (resistance to OOD errors).

🧪 Break-It Lab Offline RL Failure Modes ▶ ✓ ATTEMPTED

Toggle off key components to see how offline RL breaks. The canvas below shows Q-value estimates and policy performance.

Remove conservatism (allow OOD exploitation) ACTIVE

Without conservatism, the policy exploits spurious high Q-values on OOD actions. Q-values explode. Actual performance crashes because the "good" actions the policy found don't actually work.

Disable stitching (use MC instead of TD) ACTIVE

Without TD/stitching, the method is bounded by the best complete trajectory in the dataset. Performance ceiling drops dramatically on tasks requiring combining sub-trajectories.

Remove advantage weighting (uniform BC) ACTIVE

Without advantage weighting, we imitate all data actions equally — including bad ones. The policy averages over good and bad behavior, performing about as well as the average trajectory in the dataset.

🏗 Design Challenge Design Offline RL for a Hospital Treatment Optimizer ▶ ✓ ATTEMPTED

A hospital has 5 years of ICU patient records: vitals every hour, medication doses, lab results, and outcomes (survival/length of stay). You must design an offline RL system to recommend treatment adjustments. Safety is paramount — a bad recommendation could kill someone.

States

48-dim vitals + labs

Actions

25 med/dose combos

Dataset

50k trajectories

Safety

Must not diverge from clinical practice

Horizon

~72 hours

Reward

Survival + reduced stay

Decisions to make: (1) Which offline RL algorithm? (2) How conservative should it be? (3) How do you handle the action space? (4) How do you validate before deployment? (5) What τ / α values?

🔗 Cross-Domain Connection

Offline RL ↔ RLHF (Reinforcement Learning from Human Feedback)

Offline RL

Fixed dataset D from π_β
Reward: environment signal r(s,a)
Challenge: OOD actions overestimated
Solution: stay close to π_β

RLHF / DPO

Fixed dataset D of human preferences
Reward: learned from comparisons
Challenge: reward hacking (OOD text)
Solution: KL penalty to reference model

Both domains share the same core structure: learn from fixed data while staying close to a reference behavior. In RLHF, the "behavior policy" is the supervised fine-tuned (SFT) model. The KL penalty D_KL(π || π_SFT) serves exactly the same role as the policy constraint in offline RL — preventing the model from exploiting unreliable reward signals in OOD regions.

DPO (Direct Preference Optimization) is essentially AWR for language models: it derives a closed-form policy update from the KL-constrained objective, avoiding explicit reward modeling entirely — just like AWR derives weights from advantages without needing a separate policy optimization loop.

If RLHF has the same structure as offline RL, what's the analog of "data stitching" for language model alignment?

The Complete Guide to Offline RL

What You'll Master

RL Recap & Value Functions

Fitting Value Functions from Data

Online RL Algorithm Summary

Why Offline RL?

Where Offline Data Comes From

Online vs. Offline: The Fundamental Difference

Data Stitching

TD Learning Enables Stitching, MC Does Not

Distribution Shift Deep Dive

The Off-Policy Actor-Critic on Static Data

The Overestimation Feedback Loop

The Core Challenge of Offline RL

Filtered Behavior Cloning

Advantage-Weighted Regression

The Core Idea

Why Exponential Weighting?

Computing the Advantage

AWAC: Advantage-Weighted Actor-Critic

The AWAC Objective

The Q-function Update

Implicit Q-Learning (IQL)

The Three-Step IQL Update

The Magic: Expectile Regression

Summary & Connections

The Progression of Ideas

The Three Solutions to Distribution Shift