Goldie, Mirhoseini, Zhou, Cai, Manning — Stanford & Google DeepMind, 2025

Step-Wise Reinforcement Learning

Single-step RL can't teach multi-step reasoning. SWiRL decomposes trajectories into sub-trajectories and applies RL on each step — improving tool use by up to 21.5% and generalizing across tasks.

Prerequisites: DPO / RLHF basics + multi-step reasoning + tool-augmented LLMs
10
Chapters
4+
Simulations

Chapter 0: The Problem

Imagine you ask a language model: "Who is older, Glenn Hughes or Ross Lynch?" The model can't answer from memory alone. It needs to search for Glenn Hughes's age, search for Ross Lynch's age, then compare. Three steps, each depending on the last.

Standard RLHF and DPO treat the entire response as a single action. The model produces one output, gets one reward. This works fine for single-turn tasks like "summarize this article" or "translate this sentence." But for multi-step reasoning and tool use, single-step RL has a fundamental limitation.

The compounding error problem: In a multi-step chain, an error at step 2 corrupts all subsequent steps. Single-step RL only sees the final answer. If the answer is wrong, the model has no idea which step went wrong. Was the first search query bad? Did it misinterpret the search result? Did it fail at comparison? A single final reward gives no signal about the intermediate reasoning.

Consider a 5-step trajectory where the model searches, reads, searches again, reads again, then answers. With single-step RL, the model gets a thumbs-up or thumbs-down on the entire chain. The credit assignment problem is severe: reward must be distributed across all 5 steps, but the model has no mechanism to know which steps were good and which were bad.

Single-Step vs. Step-Wise RL

Toggle between single-step RL (one reward for the whole trajectory) and step-wise RL (feedback at each step). Notice how step-wise RL pinpoints the problematic step.

This isn't just a theoretical concern. The paper demonstrates it empirically: models trained with single-step RL (standard DPO on full trajectories) actually degrade on multi-step tasks compared to the base model. Supervised fine-tuning on the same data also hurts performance. The multi-step setting demands a fundamentally different approach.

Why does standard single-step RL struggle with multi-step reasoning tasks?

Chapter 1: The Key Insight

SWiRL's core idea is disarmingly simple:

Don't optimize the whole trajectory at once. Break each multi-step trajectory into sub-trajectories at every action boundary. Then apply RL independently on each sub-trajectory. A reward model evaluates each action given its context — not the final answer.

Here's what this means concretely. A trajectory with 3 tool calls produces 3 sub-trajectories:

  1. Sub-trajectory 1: The prompt → the model's first thought + first search query
  2. Sub-trajectory 2: The prompt + first action + search result → the model's second thought + second search query
  3. Sub-trajectory 3: Everything so far → the model's final reasoning + answer

Each sub-trajectory is scored independently by a reward model: "Given everything that has happened so far, is this next action reasonable?" The reward model doesn't need the ground-truth answer. It just evaluates whether each step makes sense in context.

This transforms the credit assignment problem. Instead of asking "was the final answer right?" you ask "was each individual decision reasonable?" An incorrect final answer might still contain 4 good steps and 1 bad one. SWiRL can learn from the good steps even when the trajectory fails.

No golden labels needed. Unlike outcome-based filtering (which requires knowing the correct answer), SWiRL's process-based approach only asks whether each step is reasonable. A model judge (Gemini 1.5 Pro in the paper) evaluates this. This means SWiRL can learn from trajectories that end in wrong answers — and the paper shows this actually helps performance.
What is SWiRL's core decomposition strategy?

Chapter 2: Multi-Step Trajectories

Before we can decompose trajectories, we need to understand their structure. In SWiRL, a trajectory is a sequence of alternating states and actions:

τ = (s1, a1, s2, a2, ..., sK, aK)

Let's define each piece precisely:

Here's a concrete example. The question is: "Who is older, Glenn Hughes or Ross Lynch?"

State s1 (Prompt)
"Who is older, Glenn Hughes or Ross Lynch?"
Action a1 (Thought + Tool Call)
"To figure out who is older, I should first search for the age of Glenn Hughes. <search_query>age of Glenn Hughes</search_query>"
Environment Response
"Glenn Hughes (born August 21, 1951) is an English rock bassist..."
Action a2 (Thought + Tool Call)
"Glenn Hughes was born in 1951. Next, I should find Ross Lynch's age. <search_query>Ross Lynch age</search_query>"
Environment Response
"Ross Shor Lynch (born December 29, 1995) is an American singer..."
Action a3 (Final Answer)
"Given the results of my previous searches, Glenn Hughes (born 1951) is older than Ross Lynch (born 1995). <answer>Glenn Hughes</answer>"

Notice the pattern: the model generates a chain of thought (its reasoning), then takes an action (either a tool call wrapped in special tags, or a final answer). The tool response comes from the environment (a search engine, calculator, etc.), not from the model. The trajectory ends when the model produces an <answer> tag.

Tool markers: The model uses <search_query>...</search_query> for search, <math_exp>...</math_exp> for calculator calls, and <answer>...</answer> to signal it's done. These tags are parsed and executed automatically — the model never sees raw API calls.
In SWiRL's trajectory notation τ = (s1, a1, ..., sK, aK), what does state si contain for i > 1?

Chapter 3: Step-Wise Decomposition

This is the mechanical heart of SWiRL. Given a trajectory with K actions, we create K sub-trajectories. Each sub-trajectory contains all the context up to step i, and the action at step i is the thing we're evaluating and optimizing.

The decomposition rule: A trajectory with K actions produces K sub-trajectories. Sub-trajectory i contains context (s1, a1, ..., si) as the "prompt" and ai as the "response." Each sub-trajectory is an independent (context, action) pair for RL.

Let's see this concretely. Our 3-step Glenn Hughes trajectory becomes 3 sub-trajectories:

Trajectory Decomposition

Click each sub-trajectory to see what context the model receives and which action is being evaluated. The orange box is the action under evaluation; teal is prior context.

The key property: each sub-trajectory is self-contained. The reward model scores ai given context si — "is this action reasonable given everything that happened before?" It doesn't need to know what comes after, and it doesn't need to know the correct final answer.

Formally, the objective becomes:

J(θ) = Es~T, a~πθ(s) [R(a | s)]

where T is the set of all states across all sub-trajectories in the dataset, and R(a | s) is the reward for action a given context s. Each sub-trajectory contributes one (state, action) pair to this expectation.

Compare this to single-step RL, which would be:

Jsingle(θ) = Eτ~D [R(aK | s1)]

Single-step RL only optimizes the final action given the original prompt. SWiRL optimizes every action given its full context.

A trajectory with 5 tool calls produces how many sub-trajectories in SWiRL?

Chapter 4: Synthetic Data Generation

SWiRL has two stages. Stage 1 is about generating and filtering the multi-step training data. You don't need human annotators writing multi-step trajectories — the model generates them itself.

Generation

The process is straightforward:

  1. Take a dataset of questions (e.g., 10,000 from HotPotQA, 7,500 from GSM8K)
  2. For each question, generate 5 trajectories using the base model (Gemma 2-27B)
  3. At each step, the model either calls a tool (search engine or calculator) or produces an answer
  4. Tool calls are executed automatically and results injected into context
  5. Trajectory ends when the model produces an <answer> tag or hits the step limit (5 for QA, 10 for math)

This produces 50,000 trajectories for HotPotQA and 37,500 for GSM8K. The entire process can be parallelized since tool calls are pre-recorded.

Filtering

Not all trajectories are good training data. The paper explores four filtering strategies:

No Filtering
Use all generated trajectories, good or bad. Surprisingly, this still improves over the baseline.
Process Filtering ← BEST
A model judge (Gemini 1.5 Pro) evaluates each step: "Is action ai reasonable given context si?" Keep trajectories where every step passes. No golden labels needed.
Outcome Filtering
Keep trajectories where the final answer matches the ground truth. Requires golden labels.
Process + Outcome
Intersection: every step is reasonable AND the final answer is correct. Most restrictive.
The surprising finding: Process-only filtering gives the best results. Outcome filtering actually hurts. SWiRL benefits from seeing trajectories that reason well but arrive at wrong answers — these contain valuable signal about good intermediate reasoning. This is the opposite of what you'd expect from supervised fine-tuning, where outcome filtering is critical.

Why does outcome filtering hurt? The paper hypothesizes that SWiRL benefits from having both positive and negative examples of final answers, as long as the intermediate steps are sound. When you filter for correct outcomes only, you lose the negative examples that help the RL optimizer learn what not to do at the answer step.

Which filtering strategy gives SWiRL the best results, and why is this surprising?

Chapter 5: RL Optimization

Stage 2 of SWiRL takes the filtered sub-trajectories and uses them to fine-tune the base model with RL. The objective is the expected sum of step-wise rewards:

J(θ) = Es~T, a~πθ(s) [R(a | s)]

Here πθ is the base model being fine-tuned, T is the set of all states in the decomposed sub-trajectories, and R(a | s) is the reward from a generative reward model (Gemini 1.5 Pro) that scores each action given its context.

The reward signal

The reward model is a generative judge, not a trained classifier. For each sub-trajectory, Gemini 1.5 Pro is prompted: "Given this context, is this action reasonable?" It produces a binary judgment. No golden labels, no human annotations.

This is different from standard RLHF reward models in two important ways:

Policy gradient optimization

SWiRL uses the same policy gradient algorithm as Gemma 2. The key difference from standard RL is that optimization happens at the step level: each sub-trajectory is an independent training example. The model receives feedback on each individual action, not just the final answer.

Offline RL advantage: Because all trajectories are generated and scored offline (Stage 1), SWiRL doesn't need to interact with a live environment during training. The search results, calculator outputs, and reward scores are all pre-computed. This makes training reproducible and avoids the cost of running tools during RL optimization.

What the model learns

Through step-wise optimization, the model learns two types of skills simultaneously:

  1. Local decision-making: At each step, what's the best next action? Should I search for more information, or do I have enough to answer? What query should I use?
  2. Global trajectory optimization: How do I structure a multi-step plan? When do I stop searching? How do I synthesize information from multiple sources?

The paper confirms this by measuring average process label accuracy: SWiRL-trained models produce trajectories where each individual step is scored higher by the reward model, even on out-of-distribution tasks.

How does SWiRL's reward signal differ from standard RLHF?

Chapter 6: Results

SWiRL is evaluated on 5 benchmarks: HotPotQA, CofCA, MuSiQue, BeerQA (all multi-hop QA), and GSM8K (math reasoning). The model is Gemma 2-27B trained on HotPotQA trajectories with process filtering.

SWiRL Improvements Over Base Model

Relative accuracy improvements of SWiRL (Gemma 2-27B) over the base model across 5 benchmarks. All trained on HotPotQA only.

The numbers are striking:

SWiRL vs. SFT

The paper finds that supervised fine-tuning (SFT) on the same trajectories actually degrades performance compared to the base model. This is consistent with prior work showing SFT can harm reasoning capabilities. The key difference: SFT memorizes specific trajectories, while SWiRL learns general step-wise reasoning skills.

SFT memorizes, SWiRL generalizes. SFT performs best with process+outcome filtered data (the cleanest examples to memorize). SWiRL performs best with process-only filtered data (a richer set including negative outcomes). This asymmetry reveals a fundamental difference: SFT teaches patterns by imitation, while RL teaches strategies by rewarding good decisions.

SWiRL vs. proprietary models

On HotPotQA, SWiRL Gemma 2-27B (76.9 partial match) outperforms GPT-4 (74.8), GPT-3.5 (62.8), and Bing Chat (72.1). It approaches o1-preview (76.9 vs. 76.9). A 27B open-source model matching frontier proprietary models on multi-hop QA — that's the power of step-wise optimization.

Why does SFT on multi-step trajectories actually hurt performance while SWiRL improves it?

Chapter 7: Cross-Task Generalization

The most exciting result in the paper isn't the in-domain numbers. It's this:

Train on QA, improve on math. SWiRL trained only on HotPotQA (text question-answering with a search engine) improves zero-shot performance on GSM8K (mathematical reasoning with a calculator) by 16.9% relative. The two tasks use different domains, different tools, and different reasoning patterns. Yet step-wise RL transfers.
Cross-Task Transfer

Each cell shows accuracy. Rows = training data, columns = evaluation benchmark. Green = improvement over base model.

The transfer goes both ways:

Why does this work? The paper's hypothesis: SWiRL doesn't just teach task-specific skills. It teaches general multi-step reasoning — when to search, how to formulate queries, when to stop, how to synthesize. These meta-skills transfer across domains because the structure of multi-step problem-solving is shared even when the content differs.

The paper supports this with process label analysis. After SWiRL training on HotPotQA, the model's average process label accuracy improves from 87.5% to 91.6% on GSM8K trajectories — an out-of-distribution task. The model generates better intermediate steps even on problems it wasn't trained on.

Smaller models don't generalize as well

The generalization story depends on scale. Gemma 2-27B shows strong cross-task transfer. Gemma 2-9B and 2-2B can benefit from in-domain SWiRL (training and testing on the same task), but don't display the same zero-shot generalization to new tasks. This suggests that a certain model capacity is needed to extract and transfer abstract reasoning patterns.

Why does SWiRL training on text QA improve math performance despite using different domains and tools?

Chapter 8: Iterative Training

SWiRL isn't a one-shot process. The paper describes an iterative loop: generate trajectories, filter, train with RL, then use the improved model to generate better trajectories, and repeat.

Round 1: Generate
Base model (Gemma 2-27B) generates 50K trajectories with tool access. Filter with process judge.
Round 1: Train
Decompose into sub-trajectories. Apply step-wise RL. Produce SWiRL model v1.
Round 2: Generate
SWiRL v1 generates new trajectories. These are higher quality since the model is better at multi-step reasoning.
Round 2: Train
Decompose and train again. SWiRL v2 improves further.
Repeat
Continue until convergence or diminishing returns.

This iterative approach has a self-improving dynamic: as the model gets better at multi-step reasoning, it generates better training data, which further improves the model. It's reminiscent of expert iteration (ExIt) and AlphaZero's self-play loop, adapted for language model reasoning.

Scaling the dataset

The paper measures how SWiRL performance scales with dataset size:

Data efficiency: 1,000 trajectories is enough for significant improvement. At ~3 steps per trajectory, that's about 3,000 sub-trajectories. For a 27B parameter model, this is remarkably little data — suggesting that the step-wise decomposition creates a much richer training signal per trajectory than single-step approaches.

Comparison against the reward model

A natural question: is SWiRL just distilling the reward model (Gemini 1.5 Pro) into Gemma 2-27B? The paper shows that SWiRL actually outperforms Gemini 1.5 Pro on some out-of-distribution benchmarks (CofCA and BeerQA). This means SWiRL isn't just copying the reward model — it's extracting general reasoning patterns that the reward model itself hasn't mastered.

How does SWiRL demonstrate that it's not simply distilling the reward model?

Chapter 9: Connections

What SWiRL builds on

RLHF (Christiano et al., 2017): The foundation of preference-based RL for LLMs. SWiRL extends RLHF from single-step to multi-step by decomposing trajectories into sub-trajectories.

DPO (Rafailov et al., 2023): Showed that RL objectives can be optimized directly on preference data. SWiRL takes this further by applying the optimization step-wise, not on full responses.

ReAct (Yao et al., 2023): The reasoning-and-acting paradigm that interleaves thought and tool use. SWiRL's trajectory format follows ReAct's thought-action-observation pattern.

RLEF (Gehring et al., 2025): RL from Execution Feedback — uses environment signals (like code test results) as reward. SWiRL goes beyond execution feedback by using a generative judge for process-level evaluation.

Related multi-step RL approaches

DQO (Liu et al., 2024): Offline RL for multi-step reasoning, but operates at the token level. SWiRL works at the step level, which the paper argues is more effective.

OREO (Wang et al., 2024): Multi-step offline RL that requires training a separate value network alongside the policy. SWiRL avoids this by using a generative reward model, making it simpler and cheaper.

Math-Shepherd (Wang et al., 2023): Process reward model for math reasoning. SWiRL generalizes this idea to tool use and multi-hop QA, not just math.

PRIME (Cui et al., 2025): Online multi-step RL, but doesn't support tool use or offline training. SWiRL's offline approach enables pre-computing all tool results.

What SWiRL enables

Agentic RL at scale: As models increasingly act as agents — browsing the web, writing code, using APIs — SWiRL's step-wise approach becomes the natural training paradigm. Each tool call is a distinct RL action.

Cross-domain agents: The generalization results suggest that step-wise RL could train general-purpose agents that transfer reasoning skills across tools and domains.

Self-improving agents: The iterative generation-training loop is a template for agents that improve from their own experience, generating progressively better trajectories.

Cheat sheet

Core idea
Decompose multi-step trajectories into sub-trajectories. Apply RL on each step independently with a per-step reward.
Key finding
Process-only filtering beats outcome filtering. SWiRL learns from trajectories with wrong final answers.
Generalization
Train on QA → +16.9% on math. Step-wise RL teaches general multi-step reasoning.
Practical
Offline, no golden labels, 1K trajectories sufficient, outperforms proprietary models on multi-hop QA.
Objective
J(θ) = Es~T, a~π[R(a|s)], where T is all states across all sub-trajectories.
What distinguishes SWiRL from prior multi-step RL methods like DQO and OREO?