When a reasoning model hits a knowledge gap mid-thought, let it search the web, filter the results through a separate Reason-in-Documents module, and resume its chain of thought with clean, relevant facts — not raw document dumps.
Large reasoning models like OpenAI's o1, QwQ, and DeepSeek-R1 can think for a long time. They decompose hard problems into dozens of steps, check their own work, backtrack when stuck. This extended chain-of-thought is their superpower.
But it's also their Achilles' heel. Every step in a long reasoning chain is an opportunity to introduce a factual error. And when you're reasoning for hundreds of tokens, you will encounter knowledge gaps — questions whose answers aren't in the model's parameters.
Consider a real example from the paper. A chemistry problem asks: "What is the carbon atom count of Product 3?" where Product 3 is the result of three sequential reactions starting from trans-cinnamaldehyde. The reasoning model needs to know the structure of trans-cinnamaldehyde. It doesn't. So it guesses:
The authors measured how often this happens. They analyzed QwQ-32B-Preview on the GPQA diamond set (PhD-level science questions) and counted uncertainty words — "perhaps," "alternatively," "wait," "likely" — in the model's reasoning chains. The results are striking:
| Uncertainty word | Avg. occurrences per output |
|---|---|
| "perhaps" | 30.4 |
| "alternatively" | 26.4 |
| "wait" | 15.8 |
| "likely" | 7.8 |
On average, the model says "perhaps" 30 times per reasoning chain on hard science problems. Each "perhaps" is a moment where the model knows it doesn't know — and decides to guess anyway.
You might think: just use RAG. Retrieve relevant documents before the model starts reasoning. But the authors show this doesn't work well either. Standard RAG retrieves documents based on the original question, but in complex multi-step reasoning, the knowledge needed at step 15 is completely different from what was needed at step 1. A single upfront retrieval can't anticipate every knowledge gap along a branching chain of thought.
The core idea of Search-o1 is deceptively simple: let the reasoning model decide when to search.
Instead of retrieving documents once before reasoning begins, embed the search capability inside the reasoning loop. When the model reaches a point where it's uncertain — where it would normally write "perhaps" or "alternatively" — it instead emits a search query. The system retrieves documents, processes them, and injects the refined information back into the reasoning chain. Then the model continues where it left off.
This is the difference between giving someone a textbook before an exam (standard RAG) versus letting them look things up during the exam exactly when they need to (Search-o1).
But there's a subtlety. If you just dump raw retrieved documents into the middle of a reasoning chain, you break the model's flow. Retrieved web pages are verbose, full of navigation elements, ads, and tangential information. A 2000-token web page crammed between reasoning step 14 and step 15 will derail the chain of thought.
The full Search-o1 framework thus has two innovations stacked together:
Together, these let the reasoning model access the world's knowledge on-demand without losing its train of thought.
Let's trace the entire Search-o1 pipeline step by step. The model receives a question and begins reasoning. At any point, it can emit a special token sequence to trigger a search.
Search-o1 uses four special token boundaries:
| Token | Purpose |
|---|---|
<|begin_search_query|> | Start of a search query emitted by the model |
<|end_search_query|> | End of query — triggers the retrieval system |
<|begin_search_result|> | Start of injected search results |
<|end_search_result|> | End of injected results — model resumes reasoning |
<|begin_search_query|> structure of trans-cinnamaldehyde <|end_search_query|><|end_search_query|>, extracts the query string<|begin_search_result|> rfinal <|end_search_result|> into R, model continues generatingCrucially, this is iterable. A single reasoning session can trigger multiple searches. The chemistry example might search for the structure of trans-cinnamaldehyde, then later search for the product of a Grignard reaction, then later search for DMSO reactivity. Each search is triggered by the model itself, at exactly the moment it needs external knowledge.
The complete generation can be written as:
Where {rfinal(j)}j ≤ i(t) represents all the refined search results injected up to the current reasoning step t. The model conditions on both its own prior reasoning and all previously retrieved knowledge.
<|end_search_query|> in the model's output?This is the showcase innovation of the paper. Without it, agentic RAG actually hurts performance on some tasks. With it, Search-o1 consistently outperforms all baselines.
There are two problems with injecting raw retrieved documents into a reasoning chain:
The solution is elegant: run a separate generation pass dedicated entirely to document analysis. This module receives three inputs:
Here is the core of the Reason-in-Documents instruction, paraphrased from the appendix:
Task: Read and analyze web pages based on: - Previous Reasoning Steps - Current Search Query - Searched Web Pages Guidelines: 1. Analyze each web page carefully 2. Identify factual information relevant to the Current Search Query 3. Select information that advances the Previous Reasoning Steps Output format: If helpful info found: Final Information [Helpful information] If no helpful info found: Final Information No helpful information found.
Let's see what this looks like in practice. The model is reasoning about a chemistry problem and searches for "structure of trans-cinnamaldehyde."
This 100x compression — from thousands of tokens of raw web content to tens of tokens of distilled fact — is what preserves reasoning coherence.
How does the model know when to search? This is the "agentic" part of the system. The model doesn't have an explicit uncertainty detector — instead, the instruction prompt teaches it to recognize its own knowledge gaps.
The key part of the instruction given to the reasoning model:
You are a reasoning assistant with the ability to perform web searches. To perform a search: write <|begin_search_query|> your query <|end_search_query|> The system will search and analyze relevant web pages, then provide you with helpful information. You can repeat the search process multiple times if necessary. The maximum number of search attempts is limited to {MAX_SEARCH_LIMIT}.
That's it. The model is simply told it can search, and it learns through the chain-of-thought process when searching would be helpful. No separate classifier. No confidence threshold. No uncertainty metric.
From the paper's examples, searches are triggered when the model encounters:
The beauty of this approach is that the same reasoning capabilities that make LRMs powerful at problem-solving also make them good at recognizing their own limitations. A model that can say "perhaps" and "alternatively" and "wait" already knows when it's uncertain — it just needs permission and a mechanism to act on that uncertainty.
When the model decides to search, the quality of its query determines the quality of retrieved documents. The model doesn't just copy the original question — it formulates a targeted search query based on the current reasoning context.
The search query at step i is generated conditioned on everything the model has seen so far:
Where R(i-1) includes all prior reasoning steps and all prior search results. This means later queries benefit from earlier retrievals — the model can ask progressively more specific questions.
Consider a multi-step chemistry problem. The model's searches might evolve like this:
| Step | Search query | Why this query? |
|---|---|---|
| Search 1 | "structure of trans-cinnamaldehyde" | Basic fact needed for step 1 of reaction |
| Search 2 | "Grignard reaction with aldehyde product" | Now knows the aldehyde structure, needs reaction outcome |
| Search 3 | "DMSO oxidation of alcohol mechanism" | Knows Product 2 is an alcohol, needs to understand the final reaction |
Each query is more specific than the last because it builds on information gathered from previous searches. This is fundamentally different from standard RAG, which gets exactly one shot at formulating a query from the original question alone.
The injection point is critical. Search results don't replace reasoning — they augment it. The refined information from Reason-in-Documents slots into the reasoning chain at exactly the point where the model paused.
Here's a simplified view of an actual reasoning chain with search insertions:
[Reasoning] Step 1: The problem asks for the carbon count of Product 3. Let me trace the reactions... Step 2: trans-Cinnamaldehyde + MeMgBr I need the structure of trans-cinnamaldehyde. <|begin_search_query|> structure of trans-cinnamaldehyde <|end_search_query|> <|begin_search_result|> Trans-cinnamaldehyde: C6H5CH=CHCHO (phenyl + propenal, 9 carbons total) <|end_search_result|> Step 3: Now I know the structure has 9 carbons. Adding MeMgBr (Grignard) to the aldehyde gives an alcohol with 10 carbons... [...reasoning continues with correct facts...] Step N: Product 3 has 11 carbon atoms. ✓
In practice, Search-o1 processes multiple questions simultaneously. The inference algorithm (Algorithm 1 in the paper) maintains two sets:
At each iteration, the system generates all sequences in S in parallel until each either hits EOS or emits <|end_search_query|>. Sequences that need search are batched together for retrieval and Reason-in-Documents processing. Sequences that finish are moved to F. This batched approach maximizes GPU utilization even though different sequences trigger searches at different points.
The paper evaluates Search-o1 on two categories: challenging reasoning tasks (science, math, coding) and open-domain QA (single-hop and multi-hop). Let's look at both.
This is the flagship benchmark — 198 PhD-level multiple-choice questions in physics, chemistry, and biology.
| Method | Physics | Chemistry | Biology | Overall |
|---|---|---|---|---|
| QwQ-32B (no retrieval) | 75.6 | 39.8 | 68.4 | 58.1 |
| RAG-QwQ-32B | 76.7 | 54.7 | 73.7 | 58.6 |
| RAgent-QwQ-32B | 76.7 | 59.5 | 68.4 | 61.6 |
| Search-o1 (Ours) | 77.9 | 47.3 | 78.9 | 63.6 |
Search-o1 achieves 63.6% overall, a 5.5-point improvement over direct reasoning (58.1%) and a 2.0-point improvement over the next best retrieval method (RAgent at 61.6%).
| Method | MATH500 | AMC23 | AIME24 | LiveCodeBench |
|---|---|---|---|---|
| QwQ-32B | 83.2 | 82.5 | 53.3 | 33.0 |
| RAgent-QwQ | 85.0 | 85.0 | 56.7 | 26.8 |
| Search-o1 | 86.4 | 85.0 | 56.7 | 33.0 |
Search-o1 matches or exceeds all baselines on math. On coding (LiveCodeBench), it maintains performance where RAgent actually drops from 33.0 to 26.8, showing that Reason-in-Documents prevents the noise from raw documents from hurting code generation.
| Method | Physics | Chemistry | Biology | Overall |
|---|---|---|---|---|
| Human experts (domain) | 57.9 | 72.6 | 68.9 | — |
| Human experts (overall) | — | — | — | 39.9 |
| Search-o1 | 68.7 | 40.7 | 69.5 | 57.9 |
Search-o1 outperforms overall human expert performance (57.9 vs 39.9) and beats domain experts in physics and biology. Chemistry remains the hardest domain for the model.
On multi-hop QA tasks (HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle), the agentic RAG approach shines most. Search-o1 exceeds RAG-QwQ by an average of 29.6% EM on multi-hop tasks and RAgent-QwQ by 5.3%. The iterable search mechanism is perfectly suited for questions that require chaining facts from multiple sources.
The paper reveals an important asymmetry. Agentic RAG (without Reason-in-Documents) helps reasoning models but can hurt non-reasoning models:
Why? Non-reasoning models can't effectively use search as a tool during problem-solving. They lack the chain-of-thought structure that makes agentic search useful. This tells us that the combination of reasoning + retrieval is synergistic, not just additive.
The paper analyzes performance as the number of retrieved documents (top-k) varies from 1 to 10:
Performance generally improves as k increases from 1 to 10, with chemistry showing the most sensitivity to document count (it needs more sources to find correct chemical structures). Physics and biology are more robust to document count variation.
The paper measures the frequency of uncertainty words before and after retrieval:
| Uncertainty word | Direct reasoning | Standard RAG | Search-o1 |
|---|---|---|---|
| "perhaps" | 30.4 | 27.1 | 21.6 |
| "alternatively" | 26.4 | 21.6 | 11.9 |
| "wait" | 15.8 | 9.3 | 8.2 |
| "likely" | 7.8 | 3.2 | 2.6 |
Search-o1 reduces "perhaps" occurrences by 29% compared to direct reasoning and "alternatively" by 55%. The model is genuinely more confident when it has access to search-verified facts.
A practical detail: when Search-o1 fails to produce a final answer (e.g., the reasoning chain gets too long or retrieval fails), the system falls back to the direct reasoning result. This ensures Search-o1 never performs worse than the base model on any individual question.
Search-o1 connects to several major research threads:
Large Reasoning Models (o1, QwQ, DeepSeek-R1): Search-o1 doesn't replace these models — it augments them. The backbone is QwQ-32B-Preview, and the framework is designed to work with any model that produces extended chains of thought. As reasoning models improve, Search-o1's agentic search becomes more effective because better reasoners are also better at knowing when they need help.
Retrieval-Augmented Generation (RAG): Search-o1 evolves RAG from a one-shot retrieval into an iterative, model-controlled process. Standard RAG retrieves once before generation. Agentic RAG systems like MindSearch and ReAct let models decide when to search. Search-o1 adds the Reason-in-Documents layer on top, addressing the noise problem that all prior agentic RAG systems suffer from.
Tool-augmented reasoning (ReAct, LATS): ReAct synergizes reasoning and acting by interleaving thought and action steps. LATS uses Language Agents with Tree Search for complex planning. Search-o1 follows this tradition but specializes the "action" to be web search with a refinement layer, and targets the specific use case of knowledge-supplemented reasoning.
Self-RAG and adaptive retrieval: Self-RAG teaches models to retrieve, generate, and critique through self-reflection. Search-o1 takes a different approach — instead of training the model to know when to retrieve, it leverages the existing uncertainty-awareness in LRMs and adds an external retrieval mechanism at inference time. No additional training required.
Open questions: