Single-step RL can't teach multi-step reasoning. SWiRL decomposes trajectories into sub-trajectories and applies RL on each step — improving tool use by up to 21.5% and generalizing across tasks.
Imagine you ask a language model: "Who is older, Glenn Hughes or Ross Lynch?" The model can't answer from memory alone. It needs to search for Glenn Hughes's age, search for Ross Lynch's age, then compare. Three steps, each depending on the last.
Standard RLHF and DPO treat the entire response as a single action. The model produces one output, gets one reward. This works fine for single-turn tasks like "summarize this article" or "translate this sentence." But for multi-step reasoning and tool use, single-step RL has a fundamental limitation.
Consider a 5-step trajectory where the model searches, reads, searches again, reads again, then answers. With single-step RL, the model gets a thumbs-up or thumbs-down on the entire chain. The credit assignment problem is severe: reward must be distributed across all 5 steps, but the model has no mechanism to know which steps were good and which were bad.
Toggle between single-step RL (one reward for the whole trajectory) and step-wise RL (feedback at each step). Notice how step-wise RL pinpoints the problematic step.
This isn't just a theoretical concern. The paper demonstrates it empirically: models trained with single-step RL (standard DPO on full trajectories) actually degrade on multi-step tasks compared to the base model. Supervised fine-tuning on the same data also hurts performance. The multi-step setting demands a fundamentally different approach.
SWiRL's core idea is disarmingly simple:
Here's what this means concretely. A trajectory with 3 tool calls produces 3 sub-trajectories:
Each sub-trajectory is scored independently by a reward model: "Given everything that has happened so far, is this next action reasonable?" The reward model doesn't need the ground-truth answer. It just evaluates whether each step makes sense in context.
This transforms the credit assignment problem. Instead of asking "was the final answer right?" you ask "was each individual decision reasonable?" An incorrect final answer might still contain 4 good steps and 1 bad one. SWiRL can learn from the good steps even when the trajectory fails.
Before we can decompose trajectories, we need to understand their structure. In SWiRL, a trajectory is a sequence of alternating states and actions:
Let's define each piece precisely:
Here's a concrete example. The question is: "Who is older, Glenn Hughes or Ross Lynch?"
Notice the pattern: the model generates a chain of thought (its reasoning), then takes an action (either a tool call wrapped in special tags, or a final answer). The tool response comes from the environment (a search engine, calculator, etc.), not from the model. The trajectory ends when the model produces an <answer> tag.
<search_query>...</search_query> for search, <math_exp>...</math_exp> for calculator calls, and <answer>...</answer> to signal it's done. These tags are parsed and executed automatically — the model never sees raw API calls.This is the mechanical heart of SWiRL. Given a trajectory with K actions, we create K sub-trajectories. Each sub-trajectory contains all the context up to step i, and the action at step i is the thing we're evaluating and optimizing.
Let's see this concretely. Our 3-step Glenn Hughes trajectory becomes 3 sub-trajectories:
Click each sub-trajectory to see what context the model receives and which action is being evaluated. The orange box is the action under evaluation; teal is prior context.
The key property: each sub-trajectory is self-contained. The reward model scores ai given context si — "is this action reasonable given everything that happened before?" It doesn't need to know what comes after, and it doesn't need to know the correct final answer.
Formally, the objective becomes:
where T is the set of all states across all sub-trajectories in the dataset, and R(a | s) is the reward for action a given context s. Each sub-trajectory contributes one (state, action) pair to this expectation.
Compare this to single-step RL, which would be:
Single-step RL only optimizes the final action given the original prompt. SWiRL optimizes every action given its full context.
SWiRL has two stages. Stage 1 is about generating and filtering the multi-step training data. You don't need human annotators writing multi-step trajectories — the model generates them itself.
The process is straightforward:
<answer> tag or hits the step limit (5 for QA, 10 for math)This produces 50,000 trajectories for HotPotQA and 37,500 for GSM8K. The entire process can be parallelized since tool calls are pre-recorded.
Not all trajectories are good training data. The paper explores four filtering strategies:
Why does outcome filtering hurt? The paper hypothesizes that SWiRL benefits from having both positive and negative examples of final answers, as long as the intermediate steps are sound. When you filter for correct outcomes only, you lose the negative examples that help the RL optimizer learn what not to do at the answer step.
Stage 2 of SWiRL takes the filtered sub-trajectories and uses them to fine-tune the base model with RL. The objective is the expected sum of step-wise rewards:
Here πθ is the base model being fine-tuned, T is the set of all states in the decomposed sub-trajectories, and R(a | s) is the reward from a generative reward model (Gemini 1.5 Pro) that scores each action given its context.
The reward model is a generative judge, not a trained classifier. For each sub-trajectory, Gemini 1.5 Pro is prompted: "Given this context, is this action reasonable?" It produces a binary judgment. No golden labels, no human annotations.
This is different from standard RLHF reward models in two important ways:
SWiRL uses the same policy gradient algorithm as Gemma 2. The key difference from standard RL is that optimization happens at the step level: each sub-trajectory is an independent training example. The model receives feedback on each individual action, not just the final answer.
Through step-wise optimization, the model learns two types of skills simultaneously:
The paper confirms this by measuring average process label accuracy: SWiRL-trained models produce trajectories where each individual step is scored higher by the reward model, even on out-of-distribution tasks.
SWiRL is evaluated on 5 benchmarks: HotPotQA, CofCA, MuSiQue, BeerQA (all multi-hop QA), and GSM8K (math reasoning). The model is Gemma 2-27B trained on HotPotQA trajectories with process filtering.
Relative accuracy improvements of SWiRL (Gemma 2-27B) over the base model across 5 benchmarks. All trained on HotPotQA only.
The numbers are striking:
The paper finds that supervised fine-tuning (SFT) on the same trajectories actually degrades performance compared to the base model. This is consistent with prior work showing SFT can harm reasoning capabilities. The key difference: SFT memorizes specific trajectories, while SWiRL learns general step-wise reasoning skills.
On HotPotQA, SWiRL Gemma 2-27B (76.9 partial match) outperforms GPT-4 (74.8), GPT-3.5 (62.8), and Bing Chat (72.1). It approaches o1-preview (76.9 vs. 76.9). A 27B open-source model matching frontier proprietary models on multi-hop QA — that's the power of step-wise optimization.
The most exciting result in the paper isn't the in-domain numbers. It's this:
Each cell shows accuracy. Rows = training data, columns = evaluation benchmark. Green = improvement over base model.
The transfer goes both ways:
Why does this work? The paper's hypothesis: SWiRL doesn't just teach task-specific skills. It teaches general multi-step reasoning — when to search, how to formulate queries, when to stop, how to synthesize. These meta-skills transfer across domains because the structure of multi-step problem-solving is shared even when the content differs.
The paper supports this with process label analysis. After SWiRL training on HotPotQA, the model's average process label accuracy improves from 87.5% to 91.6% on GSM8K trajectories — an out-of-distribution task. The model generates better intermediate steps even on problems it wasn't trained on.
The generalization story depends on scale. Gemma 2-27B shows strong cross-task transfer. Gemma 2-9B and 2-2B can benefit from in-domain SWiRL (training and testing on the same task), but don't display the same zero-shot generalization to new tasks. This suggests that a certain model capacity is needed to extract and transfer abstract reasoning patterns.
SWiRL isn't a one-shot process. The paper describes an iterative loop: generate trajectories, filter, train with RL, then use the improved model to generate better trajectories, and repeat.
This iterative approach has a self-improving dynamic: as the model gets better at multi-step reasoning, it generates better training data, which further improves the model. It's reminiscent of expert iteration (ExIt) and AlphaZero's self-play loop, adapted for language model reasoning.
The paper measures how SWiRL performance scales with dataset size:
A natural question: is SWiRL just distilling the reward model (Gemini 1.5 Pro) into Gemma 2-27B? The paper shows that SWiRL actually outperforms Gemini 1.5 Pro on some out-of-distribution benchmarks (CofCA and BeerQA). This means SWiRL isn't just copying the reward model — it's extracting general reasoning patterns that the reward model itself hasn't mastered.
RLHF (Christiano et al., 2017): The foundation of preference-based RL for LLMs. SWiRL extends RLHF from single-step to multi-step by decomposing trajectories into sub-trajectories.
DPO (Rafailov et al., 2023): Showed that RL objectives can be optimized directly on preference data. SWiRL takes this further by applying the optimization step-wise, not on full responses.
ReAct (Yao et al., 2023): The reasoning-and-acting paradigm that interleaves thought and tool use. SWiRL's trajectory format follows ReAct's thought-action-observation pattern.
RLEF (Gehring et al., 2025): RL from Execution Feedback — uses environment signals (like code test results) as reward. SWiRL goes beyond execution feedback by using a generative judge for process-level evaluation.
DQO (Liu et al., 2024): Offline RL for multi-step reasoning, but operates at the token level. SWiRL works at the step level, which the paper argues is more effective.
OREO (Wang et al., 2024): Multi-step offline RL that requires training a separate value network alongside the policy. SWiRL avoids this by using a generative reward model, making it simpler and cheaper.
Math-Shepherd (Wang et al., 2023): Process reward model for math reasoning. SWiRL generalizes this idea to tool use and multi-hop QA, not just math.
PRIME (Cui et al., 2025): Online multi-step RL, but doesn't support tool use or offline training. SWiRL's offline approach enables pre-computing all tool results.
Agentic RL at scale: As models increasingly act as agents — browsing the web, writing code, using APIs — SWiRL's step-wise approach becomes the natural training paradigm. Each tool call is a distinct RL action.
Cross-domain agents: The generalization results suggest that step-wise RL could train general-purpose agents that transfer reasoning skills across tools and domains.
Self-improving agents: The iterative generation-training loop is a template for agents that improve from their own experience, generating progressively better trajectories.