SWiRL — Veanors

Chapter 0: The Problem

Imagine you ask a language model: "Who is older, Glenn Hughes or Ross Lynch?" The model can't answer from memory alone. It needs to search for Glenn Hughes's age, search for Ross Lynch's age, then compare. Three steps, each depending on the last.

Standard RLHF and DPO treat the entire response as a single action. The model produces one output, gets one reward. This works fine for single-turn tasks like "summarize this article" or "translate this sentence." But for multi-step reasoning and tool use, single-step RL has a fundamental limitation.

The compounding error problem: In a multi-step chain, an error at step 2 corrupts all subsequent steps. Single-step RL only sees the final answer. If the answer is wrong, the model has no idea which step went wrong. Was the first search query bad? Did it misinterpret the search result? Did it fail at comparison? A single final reward gives no signal about the intermediate reasoning.

Consider a 5-step trajectory where the model searches, reads, searches again, reads again, then answers. With single-step RL, the model gets a thumbs-up or thumbs-down on the entire chain. The credit assignment problem is severe: reward must be distributed across all 5 steps, but the model has no mechanism to know which steps were good and which were bad.

Single-Step vs. Step-Wise RL

Toggle between single-step RL (one reward for the whole trajectory) and step-wise RL (feedback at each step). Notice how step-wise RL pinpoints the problematic step.

This isn't just a theoretical concern. The paper demonstrates it empirically: models trained with single-step RL (standard DPO on full trajectories) actually degrade on multi-step tasks compared to the base model. Supervised fine-tuning on the same data also hurts performance. The multi-step setting demands a fundamentally different approach.

Why does standard single-step RL struggle with multi-step reasoning tasks?

It gives one reward for the entire trajectory, making it impossible to identify which intermediate step caused errors — the credit assignment problem It requires too much compute It can only optimize for one task at a time

Chapter 1: The Key Insight

SWiRL's core idea is disarmingly simple:

Don't optimize the whole trajectory at once. Break each multi-step trajectory into sub-trajectories at every action boundary. Then apply RL independently on each sub-trajectory. A reward model evaluates each action given its context — not the final answer.

Here's what this means concretely. A trajectory with 3 tool calls produces 3 sub-trajectories:

Sub-trajectory 1: The prompt → the model's first thought + first search query
Sub-trajectory 2: The prompt + first action + search result → the model's second thought + second search query
Sub-trajectory 3: Everything so far → the model's final reasoning + answer

Each sub-trajectory is scored independently by a reward model: "Given everything that has happened so far, is this next action reasonable?" The reward model doesn't need the ground-truth answer. It just evaluates whether each step makes sense in context.

This transforms the credit assignment problem. Instead of asking "was the final answer right?" you ask "was each individual decision reasonable?" An incorrect final answer might still contain 4 good steps and 1 bad one. SWiRL can learn from the good steps even when the trajectory fails.

No golden labels needed. Unlike outcome-based filtering (which requires knowing the correct answer), SWiRL's process-based approach only asks whether each step is reasonable. A model judge (Gemini 1.5 Pro in the paper) evaluates this. This means SWiRL can learn from trajectories that end in wrong answers — and the paper shows this actually helps performance.

What is SWiRL's core decomposition strategy?

Break each multi-step trajectory into sub-trajectories at each action boundary, and apply RL on each sub-trajectory independently with a per-step reward Apply RL only on the final step of each trajectory Decompose tokens into individual RL actions

Chapter 2: Multi-Step Trajectories

Before we can decompose trajectories, we need to understand their structure. In SWiRL, a trajectory is a sequence of alternating states and actions:

τ = (s₁, a₁, s₂, a₂, ..., s_K, a_K)

Let's define each piece precisely:

State s₁ is the original prompt — the question the user asked
Action a_i is the model's complete response at step i — a chain of thought followed by either a tool call or a final answer
State s_i+1 contains everything so far: all previous states, actions, and tool responses. It's the full context window the model sees at step i+1
Action a_K (the last action) is the model's final answer

Here's a concrete example. The question is: "Who is older, Glenn Hughes or Ross Lynch?"

State s₁ (Prompt)

"Who is older, Glenn Hughes or Ross Lynch?"

↓

Action a₁ (Thought + Tool Call)

"To figure out who is older, I should first search for the age of Glenn Hughes. <search_query>age of Glenn Hughes</search_query>"

↓

Environment Response

"Glenn Hughes (born August 21, 1951) is an English rock bassist..."

↓

Action a₂ (Thought + Tool Call)

"Glenn Hughes was born in 1951. Next, I should find Ross Lynch's age. <search_query>Ross Lynch age</search_query>"

↓

Environment Response

"Ross Shor Lynch (born December 29, 1995) is an American singer..."

↓

Action a₃ (Final Answer)

"Given the results of my previous searches, Glenn Hughes (born 1951) is older than Ross Lynch (born 1995). <answer>Glenn Hughes</answer>"

Notice the pattern: the model generates a chain of thought (its reasoning), then takes an action (either a tool call wrapped in special tags, or a final answer). The tool response comes from the environment (a search engine, calculator, etc.), not from the model. The trajectory ends when the model produces an <answer> tag.

Tool markers: The model uses <search_query>...</search_query> for search, <math_exp>...</math_exp> for calculator calls, and <answer>...</answer> to signal it's done. These tags are parsed and executed automatically — the model never sees raw API calls.

In SWiRL's trajectory notation τ = (s₁, a₁, ..., s_K, a_K), what does state s_i contain for i > 1?

The full context so far: the original prompt, all previous actions, and all environment responses — the complete history the model sees at step i Only the most recent environment response A fixed embedding of the trajectory

Chapter 3: Step-Wise Decomposition

This is the mechanical heart of SWiRL. Given a trajectory with K actions, we create K sub-trajectories. Each sub-trajectory contains all the context up to step i, and the action at step i is the thing we're evaluating and optimizing.

The decomposition rule: A trajectory with K actions produces K sub-trajectories. Sub-trajectory i contains context (s₁, a₁, ..., s_i) as the "prompt" and a_i as the "response." Each sub-trajectory is an independent (context, action) pair for RL.

Let's see this concretely. Our 3-step Glenn Hughes trajectory becomes 3 sub-trajectories:

Trajectory Decomposition

Click each sub-trajectory to see what context the model receives and which action is being evaluated. The orange box is the action under evaluation; teal is prior context.

The key property: each sub-trajectory is self-contained. The reward model scores a_i given context s_i — "is this action reasonable given everything that happened before?" It doesn't need to know what comes after, and it doesn't need to know the correct final answer.

Formally, the objective becomes:

J(θ) = E_{s~T, a~π_θ(s)} [R(a | s)]

where T is the set of all states across all sub-trajectories in the dataset, and R(a | s) is the reward for action a given context s. Each sub-trajectory contributes one (state, action) pair to this expectation.

Compare this to single-step RL, which would be:

J_single(θ) = E_τ~D [R(a_K | s₁)]

Single-step RL only optimizes the final action given the original prompt. SWiRL optimizes every action given its full context.

A trajectory with 5 tool calls produces how many sub-trajectories in SWiRL?

5 — one per action, where each sub-trajectory contains all prior context and the action at that step 1 — the whole trajectory is one unit 10 — two per action (one for the thought, one for the tool call)

Chapter 4: Synthetic Data Generation

SWiRL has two stages. Stage 1 is about generating and filtering the multi-step training data. You don't need human annotators writing multi-step trajectories — the model generates them itself.

Generation

The process is straightforward:

Take a dataset of questions (e.g., 10,000 from HotPotQA, 7,500 from GSM8K)
For each question, generate 5 trajectories using the base model (Gemma 2-27B)
At each step, the model either calls a tool (search engine or calculator) or produces an answer
Tool calls are executed automatically and results injected into context
Trajectory ends when the model produces an <answer> tag or hits the step limit (5 for QA, 10 for math)

This produces 50,000 trajectories for HotPotQA and 37,500 for GSM8K. The entire process can be parallelized since tool calls are pre-recorded.

Filtering

Not all trajectories are good training data. The paper explores four filtering strategies:

No Filtering

Use all generated trajectories, good or bad. Surprisingly, this still improves over the baseline.

Process Filtering ← BEST

A model judge (Gemini 1.5 Pro) evaluates each step: "Is action a_i reasonable given context s_i?" Keep trajectories where every step passes. No golden labels needed.

Outcome Filtering

Keep trajectories where the final answer matches the ground truth. Requires golden labels.

Process + Outcome

Intersection: every step is reasonable AND the final answer is correct. Most restrictive.

The surprising finding: Process-only filtering gives the best results. Outcome filtering actually hurts. SWiRL benefits from seeing trajectories that reason well but arrive at wrong answers — these contain valuable signal about good intermediate reasoning. This is the opposite of what you'd expect from supervised fine-tuning, where outcome filtering is critical.

Why does outcome filtering hurt? The paper hypothesizes that SWiRL benefits from having both positive and negative examples of final answers, as long as the intermediate steps are sound. When you filter for correct outcomes only, you lose the negative examples that help the RL optimizer learn what not to do at the answer step.

Which filtering strategy gives SWiRL the best results, and why is this surprising?

Process-only filtering (keep trajectories where each step is reasonable, regardless of final answer correctness) — surprising because SWiRL can learn from trajectories with wrong answers, unlike SFT which needs correct outcomes Outcome filtering (keep only correct answers) No filtering at all

Chapter 5: RL Optimization

Stage 2 of SWiRL takes the filtered sub-trajectories and uses them to fine-tune the base model with RL. The objective is the expected sum of step-wise rewards:

J(θ) = E_{s~T, a~π_θ(s)} [R(a | s)]

Here π_θ is the base model being fine-tuned, T is the set of all states in the decomposed sub-trajectories, and R(a | s) is the reward from a generative reward model (Gemini 1.5 Pro) that scores each action given its context.

The reward signal

The reward model is a generative judge, not a trained classifier. For each sub-trajectory, Gemini 1.5 Pro is prompted: "Given this context, is this action reasonable?" It produces a binary judgment. No golden labels, no human annotations.

This is different from standard RLHF reward models in two important ways:

Contextual: The reward is conditioned on all previous steps, not just the prompt and final output
Process-based: It evaluates reasoning quality, not answer correctness

Policy gradient optimization

SWiRL uses the same policy gradient algorithm as Gemma 2. The key difference from standard RL is that optimization happens at the step level: each sub-trajectory is an independent training example. The model receives feedback on each individual action, not just the final answer.

Offline RL advantage: Because all trajectories are generated and scored offline (Stage 1), SWiRL doesn't need to interact with a live environment during training. The search results, calculator outputs, and reward scores are all pre-computed. This makes training reproducible and avoids the cost of running tools during RL optimization.

What the model learns

Through step-wise optimization, the model learns two types of skills simultaneously:

Local decision-making: At each step, what's the best next action? Should I search for more information, or do I have enough to answer? What query should I use?
Global trajectory optimization: How do I structure a multi-step plan? When do I stop searching? How do I synthesize information from multiple sources?

The paper confirms this by measuring average process label accuracy: SWiRL-trained models produce trajectories where each individual step is scored higher by the reward model, even on out-of-distribution tasks.

How does SWiRL's reward signal differ from standard RLHF?

SWiRL evaluates each action given its full context (process-based, per-step reward), while RLHF gives one reward for the entire response (outcome-based, single reward) SWiRL uses human annotators while RLHF uses a model SWiRL only rewards correct final answers

Chapter 6: Results

SWiRL is evaluated on 5 benchmarks: HotPotQA, CofCA, MuSiQue, BeerQA (all multi-hop QA), and GSM8K (math reasoning). The model is Gemma 2-27B trained on HotPotQA trajectories with process filtering.

SWiRL Improvements Over Base Model

Relative accuracy improvements of SWiRL (Gemma 2-27B) over the base model across 5 benchmarks. All trained on HotPotQA only.

The numbers are striking:

GSM8K: +21.5% relative accuracy (0.65 → 0.79 base-to-SWiRL, but training on math data)
HotPotQA: +12.3% relative accuracy (0.65 → 0.73)
CofCA: +14.8% relative accuracy (0.54 → 0.62)
MuSiQue: +11.1% relative accuracy (0.45 → 0.50)
BeerQA: +15.3% relative accuracy (0.59 → 0.68)

SWiRL vs. SFT

The paper finds that supervised fine-tuning (SFT) on the same trajectories actually degrades performance compared to the base model. This is consistent with prior work showing SFT can harm reasoning capabilities. The key difference: SFT memorizes specific trajectories, while SWiRL learns general step-wise reasoning skills.

SFT memorizes, SWiRL generalizes. SFT performs best with process+outcome filtered data (the cleanest examples to memorize). SWiRL performs best with process-only filtered data (a richer set including negative outcomes). This asymmetry reveals a fundamental difference: SFT teaches patterns by imitation, while RL teaches strategies by rewarding good decisions.

SWiRL vs. proprietary models

On HotPotQA, SWiRL Gemma 2-27B (76.9 partial match) outperforms GPT-4 (74.8), GPT-3.5 (62.8), and Bing Chat (72.1). It approaches o1-preview (76.9 vs. 76.9). A 27B open-source model matching frontier proprietary models on multi-hop QA — that's the power of step-wise optimization.

Why does SFT on multi-step trajectories actually hurt performance while SWiRL improves it?

SFT memorizes specific trajectories and overfits to training patterns, while SWiRL learns general step-wise reasoning through per-step reward maximization — enabling it to generalize to new scenarios SFT uses a different optimizer SFT trains for fewer epochs

Chapter 7: Cross-Task Generalization

The most exciting result in the paper isn't the in-domain numbers. It's this:

Train on QA, improve on math. SWiRL trained only on HotPotQA (text question-answering with a search engine) improves zero-shot performance on GSM8K (mathematical reasoning with a calculator) by 16.9% relative. The two tasks use different domains, different tools, and different reasoning patterns. Yet step-wise RL transfers.

Cross-Task Transfer

Each cell shows accuracy. Rows = training data, columns = evaluation benchmark. Green = improvement over base model.

The transfer goes both ways:

Train on HotPotQA (QA) → GSM8K (math) improves by +16.9%
Train on GSM8K (math) → HotPotQA (QA) improves by +9.2%
Train on HotPotQA → BeerQA improves by +15.3%, CofCA by +14.8%, MuSiQue by +11.1%

Why does this work? The paper's hypothesis: SWiRL doesn't just teach task-specific skills. It teaches general multi-step reasoning — when to search, how to formulate queries, when to stop, how to synthesize. These meta-skills transfer across domains because the structure of multi-step problem-solving is shared even when the content differs.

The paper supports this with process label analysis. After SWiRL training on HotPotQA, the model's average process label accuracy improves from 87.5% to 91.6% on GSM8K trajectories — an out-of-distribution task. The model generates better intermediate steps even on problems it wasn't trained on.

Smaller models don't generalize as well

The generalization story depends on scale. Gemma 2-27B shows strong cross-task transfer. Gemma 2-9B and 2-2B can benefit from in-domain SWiRL (training and testing on the same task), but don't display the same zero-shot generalization to new tasks. This suggests that a certain model capacity is needed to extract and transfer abstract reasoning patterns.

Why does SWiRL training on text QA improve math performance despite using different domains and tools?

SWiRL teaches general multi-step reasoning skills — when to use tools, how to formulate queries, when to synthesize — that transfer across domains because the structure of multi-step problem solving is shared The math and QA datasets share many of the same questions The model memorizes the search results and applies them to math

Chapter 8: Iterative Training

SWiRL isn't a one-shot process. The paper describes an iterative loop: generate trajectories, filter, train with RL, then use the improved model to generate better trajectories, and repeat.

Round 1: Generate

Base model (Gemma 2-27B) generates 50K trajectories with tool access. Filter with process judge.

↓

Round 1: Train

Decompose into sub-trajectories. Apply step-wise RL. Produce SWiRL model v1.

↓

Round 2: Generate

SWiRL v1 generates new trajectories. These are higher quality since the model is better at multi-step reasoning.

↓

Round 2: Train

Decompose and train again. SWiRL v2 improves further.

↓

Repeat

Continue until convergence or diminishing returns.

This iterative approach has a self-improving dynamic: as the model gets better at multi-step reasoning, it generates better training data, which further improves the model. It's reminiscent of expert iteration (ExIt) and AlphaZero's self-play loop, adapted for language model reasoning.

Scaling the dataset

The paper measures how SWiRL performance scales with dataset size:

100 trajectories: Insufficient — the model can't generalize from this few examples
1,000 trajectories: Solid gains across all benchmarks, even out-of-distribution. This is a remarkably data-efficient method.
10,000 trajectories: Further improvements, with consistent gains on both in-distribution and out-of-distribution tasks

Data efficiency: 1,000 trajectories is enough for significant improvement. At ~3 steps per trajectory, that's about 3,000 sub-trajectories. For a 27B parameter model, this is remarkably little data — suggesting that the step-wise decomposition creates a much richer training signal per trajectory than single-step approaches.

Comparison against the reward model

A natural question: is SWiRL just distilling the reward model (Gemini 1.5 Pro) into Gemma 2-27B? The paper shows that SWiRL actually outperforms Gemini 1.5 Pro on some out-of-distribution benchmarks (CofCA and BeerQA). This means SWiRL isn't just copying the reward model — it's extracting general reasoning patterns that the reward model itself hasn't mastered.

How does SWiRL demonstrate that it's not simply distilling the reward model?

SWiRL-trained Gemma 2-27B outperforms the reward model (Gemini 1.5 Pro) on some out-of-distribution benchmarks, showing it extracts general reasoning patterns beyond what the reward model provides SWiRL uses a different architecture than the reward model The reward model is not involved in training

Chapter 9: Connections

What SWiRL builds on

RLHF (Christiano et al., 2017): The foundation of preference-based RL for LLMs. SWiRL extends RLHF from single-step to multi-step by decomposing trajectories into sub-trajectories.

DPO (Rafailov et al., 2023): Showed that RL objectives can be optimized directly on preference data. SWiRL takes this further by applying the optimization step-wise, not on full responses.

ReAct (Yao et al., 2023): The reasoning-and-acting paradigm that interleaves thought and tool use. SWiRL's trajectory format follows ReAct's thought-action-observation pattern.

RLEF (Gehring et al., 2025): RL from Execution Feedback — uses environment signals (like code test results) as reward. SWiRL goes beyond execution feedback by using a generative judge for process-level evaluation.

Related multi-step RL approaches

DQO (Liu et al., 2024): Offline RL for multi-step reasoning, but operates at the token level. SWiRL works at the step level, which the paper argues is more effective.

OREO (Wang et al., 2024): Multi-step offline RL that requires training a separate value network alongside the policy. SWiRL avoids this by using a generative reward model, making it simpler and cheaper.

Math-Shepherd (Wang et al., 2023): Process reward model for math reasoning. SWiRL generalizes this idea to tool use and multi-hop QA, not just math.

PRIME (Cui et al., 2025): Online multi-step RL, but doesn't support tool use or offline training. SWiRL's offline approach enables pre-computing all tool results.

What SWiRL enables

Agentic RL at scale: As models increasingly act as agents — browsing the web, writing code, using APIs — SWiRL's step-wise approach becomes the natural training paradigm. Each tool call is a distinct RL action.

Cross-domain agents: The generalization results suggest that step-wise RL could train general-purpose agents that transfer reasoning skills across tools and domains.

Self-improving agents: The iterative generation-training loop is a template for agents that improve from their own experience, generating progressively better trajectories.

Cheat sheet

Core idea

Decompose multi-step trajectories into sub-trajectories. Apply RL on each step independently with a per-step reward.

Key finding

Process-only filtering beats outcome filtering. SWiRL learns from trajectories with wrong final answers.

Generalization

Train on QA → +16.9% on math. Step-wise RL teaches general multi-step reasoning.

Practical

Offline, no golden labels, 1K trajectories sufficient, outperforms proprietary models on multi-hop QA.

Objective

J(θ) = E_{s~T, a~π}[R(a|s)], where T is all states across all sub-trajectories.

What distinguishes SWiRL from prior multi-step RL methods like DQO and OREO?

SWiRL operates at the step level (not token level like DQO), uses a generative reward model instead of a separate value network (unlike OREO), supports tool use, works offline, and requires no golden labels SWiRL uses a larger model SWiRL trains for more epochs

Step-Wise Reinforcement Learning