Search-o1 — Veanors

Chapter 0: The Problem

Large reasoning models like OpenAI's o1, QwQ, and DeepSeek-R1 can think for a long time. They decompose hard problems into dozens of steps, check their own work, backtrack when stuck. This extended chain-of-thought is their superpower.

But it's also their Achilles' heel. Every step in a long reasoning chain is an opportunity to introduce a factual error. And when you're reasoning for hundreds of tokens, you will encounter knowledge gaps — questions whose answers aren't in the model's parameters.

Consider a real example from the paper. A chemistry problem asks: "What is the carbon atom count of Product 3?" where Product 3 is the result of three sequential reactions starting from trans-cinnamaldehyde. The reasoning model needs to know the structure of trans-cinnamaldehyde. It doesn't. So it guesses:

"Perhaps the structure of trans-Cinnamaldehyde is C₆H₅CH=CH-CO-CH₃."
This guess is wrong. The real structure is C₆H₅CH=CHCHO. That single error cascades through all subsequent reaction steps, producing an incorrect final answer of 10 carbon atoms instead of the correct 11.

The authors measured how often this happens. They analyzed QwQ-32B-Preview on the GPQA diamond set (PhD-level science questions) and counted uncertainty words — "perhaps," "alternatively," "wait," "likely" — in the model's reasoning chains. The results are striking:

Uncertainty word	Avg. occurrences per output
"perhaps"	30.4
"alternatively"	26.4
"wait"	15.8
"likely"	7.8

On average, the model says "perhaps" 30 times per reasoning chain on hard science problems. Each "perhaps" is a moment where the model knows it doesn't know — and decides to guess anyway.

You might think: just use RAG. Retrieve relevant documents before the model starts reasoning. But the authors show this doesn't work well either. Standard RAG retrieves documents based on the original question, but in complex multi-step reasoning, the knowledge needed at step 15 is completely different from what was needed at step 1. A single upfront retrieval can't anticipate every knowledge gap along a branching chain of thought.

Why does standard RAG fail to help reasoning models on complex multi-step problems?

Standard RAG retrieves too many documents Standard RAG retrieves documents based on the original question, but different reasoning steps need different knowledge — a single upfront retrieval can't cover all gaps Standard RAG models are too slow

Chapter 1: The Key Insight

The core idea of Search-o1 is deceptively simple: let the reasoning model decide when to search.

Instead of retrieving documents once before reasoning begins, embed the search capability inside the reasoning loop. When the model reaches a point where it's uncertain — where it would normally write "perhaps" or "alternatively" — it instead emits a search query. The system retrieves documents, processes them, and injects the refined information back into the reasoning chain. Then the model continues where it left off.

This is the difference between giving someone a textbook before an exam (standard RAG) versus letting them look things up during the exam exactly when they need to (Search-o1).

But there's a subtlety. If you just dump raw retrieved documents into the middle of a reasoning chain, you break the model's flow. Retrieved web pages are verbose, full of navigation elements, ads, and tangential information. A 2000-token web page crammed between reasoning step 14 and step 15 will derail the chain of thought.

The second insight — Reason-in-Documents: Before injecting retrieved content into the reasoning chain, run a separate analysis pass. A dedicated module reads the documents, extracts only the relevant facts for the current query, and produces a clean, concise summary. This summary — not the raw documents — is what gets injected into the reasoning chain.

The full Search-o1 framework thus has two innovations stacked together:

Agentic RAG: The model autonomously triggers search during reasoning, as many times as needed
Reason-in-Documents: A separate module that distills raw documents into clean facts before injection

Together, these let the reasoning model access the world's knowledge on-demand without losing its train of thought.

What are the two key innovations in Search-o1?

Agentic RAG (model decides when to search during reasoning) and Reason-in-Documents (separate module filters raw documents before injection into the reasoning chain) Fine-tuning the model on search results and using a larger context window MCTS-based search and reward model scoring

Chapter 2: Agentic Search Workflow

Let's trace the entire Search-o1 pipeline step by step. The model receives a question and begins reasoning. At any point, it can emit a special token sequence to trigger a search.

The special tokens

Search-o1 uses four special token boundaries:

Token	Purpose
`<\|begin_search_query\|>`	Start of a search query emitted by the model
`<\|end_search_query\|>`	End of query — triggers the retrieval system
`<\|begin_search_result\|>`	Start of injected search results
`<\|end_search_result\|>`	End of injected results — model resumes reasoning

The pipeline, step by step

1. Start reasoning

Model receives instruction I + question q, begins generating chain of thought R

↓

2. Encounter knowledge gap

Model generates <|begin_search_query|> structure of trans-cinnamaldehyde <|end_search_query|>

↓

3. Pause generation

System detects <|end_search_query|>, extracts the query string

↓

4. Retrieve documents

Bing Web Search API returns top-k documents D = {d₁, d₂, ..., d_k}

↓

5. Reason-in-Documents

Separate module analyzes D given the query + prior reasoning → produces concise r_final

↓

6. Inject and resume

Insert <|begin_search_result|> r_final <|end_search_result|> into R, model continues generating

↓

7. Repeat or finish

Steps 2–6 can repeat multiple times. When model generates EOS, output final answer a.

Crucially, this is iterable. A single reasoning session can trigger multiple searches. The chemistry example might search for the structure of trans-cinnamaldehyde, then later search for the product of a Grignard reaction, then later search for DMSO reactivity. Each search is triggered by the model itself, at exactly the moment it needs external knowledge.

Formal generation

The complete generation can be written as:

P(R, a | I, q) = ∏_{t=1..T_r} P(R_t | R_<t, I, q, {r_final^(j)}_{j ≤ i(t)}) · ∏_{t=1..T_a} P(a_t | a_<t, R, I, q)

Where {r_final^(j)}_{j ≤ i(t)} represents all the refined search results injected up to the current reasoning step t. The model conditions on both its own prior reasoning and all previously retrieved knowledge.

What happens when the system detects <|end_search_query|> in the model's output?

Generation pauses, the query is extracted, documents are retrieved, Reason-in-Documents produces a refined summary, and the result is injected back before generation resumes The model restarts its reasoning from scratch with the new information The raw search results are directly appended to the output

Chapter 3: Reason-in-Documents

This is the showcase innovation of the paper. Without it, agentic RAG actually hurts performance on some tasks. With it, Search-o1 consistently outperforms all baselines.

Why raw documents break reasoning

There are two problems with injecting raw retrieved documents into a reasoning chain:

Redundancy: A web page about trans-cinnamaldehyde contains its molecular formula, its use in cinnamon flavoring, its CAS number, its safety data sheet, synthesis routes, and twenty paragraphs of tangential chemistry. The model only needs one fact: the molecular structure. Everything else is noise that dilutes the reasoning chain.
Limited document understanding: LRMs have been specifically aligned for reasoning tasks through pre-training and RLHF. This alignment has, to some degree, caused catastrophic forgetting of general capabilities like long-document comprehension. Asking a reasoning model to also be a perfect document summarizer mid-thought is asking too much.

The Reason-in-Documents module

The solution is elegant: run a separate generation pass dedicated entirely to document analysis. This module receives three inputs:

Input 1: Previous reasoning steps

R_<i — everything the model has reasoned so far, for context

Input 2: Current search query

q_search⁽ⁱ⁾ — what the model is looking for, e.g. "structure of trans-cinnamaldehyde"

Input 3: Retrieved documents

D⁽ⁱ⁾ = {d₁, d₂, ..., d_k} — raw web pages from Bing

↓

Reason-in-Documents generation

Separate LLM call analyzes documents, extracts relevant facts → r_docs⁽ⁱ⁾

↓

Final refined output

r_final⁽ⁱ⁾ — concise, relevant knowledge injected into the main reasoning chain

The actual prompt

Here is the core of the Reason-in-Documents instruction, paraphrased from the appendix:

Task: Read and analyze web pages based on:
  - Previous Reasoning Steps
  - Current Search Query
  - Searched Web Pages

Guidelines:
1. Analyze each web page carefully
2. Identify factual information relevant to the
   Current Search Query
3. Select information that advances the Previous
   Reasoning Steps

Output format:
If helpful info found:
  Final Information
  [Helpful information]

If no helpful info found:
  Final Information
  No helpful information found.

Concrete example: before vs. after

Let's see what this looks like in practice. The model is reasoning about a chemistry problem and searches for "structure of trans-cinnamaldehyde."

Raw retrieved documents (what the web returns):
"Trans-cinnamaldehyde, or (E)-cinnamaldehyde, is an organic compound with the formula C₆H₅CH=CHCHO. It is the E(trans) stereoisomer of cinnamaldehyde and is the principal component of the essential oil of cinnamon bark. It is a pale yellow viscous liquid with a sweet, warm, spicy smell. It is used as a flavoring agent, in perfumery, and has antimicrobial properties. CAS number: 14371-10-9. Molecular weight: 132.16 g/mol. Boiling point: 248 °C. The compound was first isolated in 1834 by Dumas and Péligot from cinnamon oil..."

Plus 9 more documents of similar length — easily 5,000+ tokens of raw text.

After Reason-in-Documents:
"Trans-cinnamaldehyde has the structure C₆H₅CH=CHCHO. It is a phenyl group attached to a propenal chain (3 carbons including the aldehyde). Total carbon count: 9 (6 from phenyl + 3 from the propenal chain)."

Clean, relevant, directly useful for the next reasoning step. ~40 tokens instead of ~5,000.

This 100x compression — from thousands of tokens of raw web content to tens of tokens of distilled fact — is what preserves reasoning coherence.

Why does Search-o1 use a separate Reason-in-Documents module instead of injecting raw retrieved documents?

Raw documents are verbose and noisy, disrupting reasoning coherence; the module extracts only the relevant facts, reducing ~5000 tokens to ~40, preserving the chain-of-thought flow Raw documents are too expensive to store in memory The reasoning model cannot read HTML

Chapter 4: Uncertainty Detection

How does the model know when to search? This is the "agentic" part of the system. The model doesn't have an explicit uncertainty detector — instead, the instruction prompt teaches it to recognize its own knowledge gaps.

The system instruction

The key part of the instruction given to the reasoning model:

You are a reasoning assistant with the ability
to perform web searches.

To perform a search: write
  <|begin_search_query|> your query
  <|end_search_query|>

The system will search and analyze relevant web
pages, then provide you with helpful information.

You can repeat the search process multiple times
if necessary. The maximum number of search
attempts is limited to {MAX_SEARCH_LIMIT}.

That's it. The model is simply told it can search, and it learns through the chain-of-thought process when searching would be helpful. No separate classifier. No confidence threshold. No uncertainty metric.

What triggers a search in practice?

From the paper's examples, searches are triggered when the model encounters:

Factual uncertainty: "I need the structure of trans-Cinnamaldehyde" — the model knows it doesn't know a specific fact
Domain-specific knowledge: "Let me look up epistaxis quickly" — specialized medical or scientific terminology
Verification needs: "Wait, perhaps it's referring to dimethyl sulfone, but that doesn't seem right" — the model suspects its own answer is wrong
Recall failures: "As I recall, Quinuclidine is a seven-membered ring... likely not having the required symmetry" — uncertain memory of a specific detail

The beauty of this approach is that the same reasoning capabilities that make LRMs powerful at problem-solving also make them good at recognizing their own limitations. A model that can say "perhaps" and "alternatively" and "wait" already knows when it's uncertain — it just needs permission and a mechanism to act on that uncertainty.

A practical benefit: Because the model controls when to search, Search-o1 is self-calibrating. On easy questions where the model is confident, it generates zero searches. On hard knowledge-intensive questions, it may trigger 5-10 searches. The compute cost naturally scales with problem difficulty.

How does Search-o1 decide when to trigger a search?

The model decides autonomously through the chain-of-thought process — when it recognizes a knowledge gap, it emits special search query tokens. No separate classifier is needed. A separate uncertainty classifier monitors the model's hidden states and triggers search when confidence drops below a threshold Search is triggered at fixed intervals during reasoning

Chapter 5: Query Formulation

When the model decides to search, the quality of its query determines the quality of retrieved documents. The model doesn't just copy the original question — it formulates a targeted search query based on the current reasoning context.

How queries are generated

The search query at step i is generated conditioned on everything the model has seen so far:

P(q_search⁽ⁱ⁾ | I, q, R^(i-1)) = ∏_t P(q_search,t⁽ⁱ⁾ | q_search,<t⁽ⁱ⁾, I, q, R^(i-1))

Where R^(i-1) includes all prior reasoning steps and all prior search results. This means later queries benefit from earlier retrievals — the model can ask progressively more specific questions.

Query evolution example

Consider a multi-step chemistry problem. The model's searches might evolve like this:

Step	Search query	Why this query?
Search 1	"structure of trans-cinnamaldehyde"	Basic fact needed for step 1 of reaction
Search 2	"Grignard reaction with aldehyde product"	Now knows the aldehyde structure, needs reaction outcome
Search 3	"DMSO oxidation of alcohol mechanism"	Knows Product 2 is an alcohol, needs to understand the final reaction

Each query is more specific than the last because it builds on information gathered from previous searches. This is fundamentally different from standard RAG, which gets exactly one shot at formulating a query from the original question alone.

Key difference from standard RAG: In standard RAG, you retrieve once based on the question "What is the carbon count of Product 3?" This generic query returns documents about counting carbon atoms, not about the specific chemical structures involved. In Search-o1, the model has already been reasoning about the specific reactions and knows exactly what fact it needs next.

Why are later search queries in Search-o1 typically more targeted than earlier ones?

The system automatically refines queries using a query optimizer Later queries are conditioned on all prior reasoning steps AND prior search results, so the model can ask progressively more specific questions The model is fine-tuned to generate better queries over time

Chapter 6: Integration into Reasoning

The injection point is critical. Search results don't replace reasoning — they augment it. The refined information from Reason-in-Documents slots into the reasoning chain at exactly the point where the model paused.

What the reasoning chain looks like

Here's a simplified view of an actual reasoning chain with search insertions:

[Reasoning]
Step 1: The problem asks for the carbon count
of Product 3. Let me trace the reactions...

Step 2: trans-Cinnamaldehyde + MeMgBr
I need the structure of trans-cinnamaldehyde.

<|begin_search_query|>
structure of trans-cinnamaldehyde
<|end_search_query|>

<|begin_search_result|>
Trans-cinnamaldehyde: C6H5CH=CHCHO
(phenyl + propenal, 9 carbons total)
<|end_search_result|>

Step 3: Now I know the structure has 9 carbons.
Adding MeMgBr (Grignard) to the aldehyde
gives an alcohol with 10 carbons...

[...reasoning continues with correct facts...]

Step N: Product 3 has 11 carbon atoms. ✓

The batch inference mechanism

In practice, Search-o1 processes multiple questions simultaneously. The inference algorithm (Algorithm 1 in the paper) maintains two sets:

S — set of unfinished sequences, actively being generated
F — set of finished sequences (model generated EOS)

At each iteration, the system generates all sequences in S in parallel until each either hits EOS or emits <|end_search_query|>. Sequences that need search are batched together for retrieval and Reason-in-Documents processing. Sequences that finish are moved to F. This batched approach maximizes GPU utilization even though different sequences trigger searches at different points.

Implementation detail: The system uses Bing Web Search API for retrieval (top-k = 10 documents, US-EN region) and Jina Reader API to fetch full web page content. Generation uses temperature 0.7, top_p 0.8, top_k 20, repetition penalty 1.05, max 32,768 tokens. All experiments on 8× NVIDIA A800-80GB GPUs using QwQ-32B-Preview as the backbone.

How does Search-o1 handle batch inference when different sequences trigger searches at different points?

Sequences that need search are batched together for retrieval and Reason-in-Documents processing in parallel, while finished sequences are moved to a completed set — maximizing GPU utilization Each sequence is processed entirely sequentially All sequences must search at the same points

Chapter 7: Results

The paper evaluates Search-o1 on two categories: challenging reasoning tasks (science, math, coding) and open-domain QA (single-hop and multi-hop). Let's look at both.

PhD-level science QA (GPQA Diamond)

This is the flagship benchmark — 198 PhD-level multiple-choice questions in physics, chemistry, and biology.

Method	Physics	Chemistry	Biology	Overall
QwQ-32B (no retrieval)	75.6	39.8	68.4	58.1
RAG-QwQ-32B	76.7	54.7	73.7	58.6
RAgent-QwQ-32B	76.7	59.5	68.4	61.6
Search-o1 (Ours)	77.9	47.3	78.9	63.6

Search-o1 achieves 63.6% overall, a 5.5-point improvement over direct reasoning (58.1%) and a 2.0-point improvement over the next best retrieval method (RAgent at 61.6%).

Math and coding benchmarks

Method	MATH500	AMC23	AIME24	LiveCodeBench
QwQ-32B	83.2	82.5	53.3	33.0
RAgent-QwQ	85.0	85.0	56.7	26.8
Search-o1	86.4	85.0	56.7	33.0

Search-o1 matches or exceeds all baselines on math. On coding (LiveCodeBench), it maintains performance where RAgent actually drops from 33.0 to 26.8, showing that Reason-in-Documents prevents the noise from raw documents from hurting code generation.

Comparison with human experts (GPQA Extended)

Method	Physics	Chemistry	Biology	Overall
Human experts (domain)	57.9	72.6	68.9	—
Human experts (overall)	—	—	—	39.9
Search-o1	68.7	40.7	69.5	57.9

Search-o1 outperforms overall human expert performance (57.9 vs 39.9) and beats domain experts in physics and biology. Chemistry remains the hardest domain for the model.

Multi-hop QA

On multi-hop QA tasks (HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle), the agentic RAG approach shines most. Search-o1 exceeds RAG-QwQ by an average of 29.6% EM on multi-hop tasks and RAgent-QwQ by 5.3%. The iterable search mechanism is perfectly suited for questions that require chaining facts from multiple sources.

Where does Search-o1 show the largest improvement over baselines?

Multi-hop QA tasks and PhD-level science QA, where iterable search and knowledge refinement are most needed for chaining facts across multiple sources Simple math problems like MATH500 Code generation tasks

Chapter 8: Analysis

When does search help vs. hurt?

The paper reveals an important asymmetry. Agentic RAG (without Reason-in-Documents) helps reasoning models but can hurt non-reasoning models:

RAgent-QwQ (reasoning model + agentic RAG) outperforms direct QwQ on most tasks
RAgent-Qwen2.5 (non-reasoning model + agentic RAG) performs similarly to standard RAG on GPQA and decreases performance on math and code

Why? Non-reasoning models can't effectively use search as a tool during problem-solving. They lack the chain-of-thought structure that makes agentic search useful. This tells us that the combination of reasoning + retrieval is synergistic, not just additive.

Scaling with number of retrieved documents

The paper analyzes performance as the number of retrieved documents (top-k) varies from 1 to 10:

Key finding: Retrieving even one document with Search-o1's agentic approach outperforms standard RAG with 10 documents on overall GPQA performance. This means it's not about having more documents — it's about retrieving the right document at the right time.

Performance generally improves as k increases from 1 to 10, with chemistry showing the most sensitivity to document count (it needs more sources to find correct chemical structures). Physics and biology are more robust to document count variation.

Uncertainty reduction

The paper measures the frequency of uncertainty words before and after retrieval:

Uncertainty word	Direct reasoning	Standard RAG	Search-o1
"perhaps"	30.4	27.1	21.6
"alternatively"	26.4	21.6	11.9
"wait"	15.8	9.3	8.2
"likely"	7.8	3.2	2.6

Search-o1 reduces "perhaps" occurrences by 29% compared to direct reasoning and "alternatively" by 55%. The model is genuinely more confident when it has access to search-verified facts.

Back-off strategy

A practical detail: when Search-o1 fails to produce a final answer (e.g., the reasoning chain gets too long or retrieval fails), the system falls back to the direct reasoning result. This ensures Search-o1 never performs worse than the base model on any individual question.

Why does agentic RAG help reasoning models (QwQ) but not non-reasoning models (Qwen2.5)?

Non-reasoning models lack the chain-of-thought structure that enables effective use of search as a tool during problem-solving — reasoning + retrieval is synergistic, not just additive Non-reasoning models have larger context windows Non-reasoning models retrieve lower-quality documents

Chapter 9: Connections

Where Search-o1 sits in the landscape

Search-o1 connects to several major research threads:

Large Reasoning Models (o1, QwQ, DeepSeek-R1): Search-o1 doesn't replace these models — it augments them. The backbone is QwQ-32B-Preview, and the framework is designed to work with any model that produces extended chains of thought. As reasoning models improve, Search-o1's agentic search becomes more effective because better reasoners are also better at knowing when they need help.

Retrieval-Augmented Generation (RAG): Search-o1 evolves RAG from a one-shot retrieval into an iterative, model-controlled process. Standard RAG retrieves once before generation. Agentic RAG systems like MindSearch and ReAct let models decide when to search. Search-o1 adds the Reason-in-Documents layer on top, addressing the noise problem that all prior agentic RAG systems suffer from.

Tool-augmented reasoning (ReAct, LATS): ReAct synergizes reasoning and acting by interleaving thought and action steps. LATS uses Language Agents with Tree Search for complex planning. Search-o1 follows this tradition but specializes the "action" to be web search with a refinement layer, and targets the specific use case of knowledge-supplemented reasoning.

Self-RAG and adaptive retrieval: Self-RAG teaches models to retrieve, generate, and critique through self-reflection. Search-o1 takes a different approach — instead of training the model to know when to retrieve, it leverages the existing uncertainty-awareness in LRMs and adds an external retrieval mechanism at inference time. No additional training required.

The big picture: Search-o1 demonstrates that the right architecture for knowledge-augmented reasoning isn't a better single model — it's a system that decomposes the problem. The reasoning model reasons. The retrieval system retrieves. The Reason-in-Documents module filters. Each component does what it's best at, and the special tokens provide a clean interface between them.

Open questions:

Can the Reason-in-Documents module be a smaller, specialized model instead of the full LRM?
How does Search-o1 interact with multimodal reasoning (images, code execution)?
Could the search queries be optimized through reinforcement learning for better retrieval?
How does performance scale with the backbone model size (beyond 32B)?

What is the key architectural principle that makes Search-o1 effective?

Training a single model to be good at everything Using more parameters than competing approaches Decomposing the problem into specialized components (reasoning, retrieval, document filtering) connected by clean interfaces (special tokens), so each module does what it's best at

Search-o1: Agentic Search-Enhanced Large Reasoning Models