Patel, Arabzadeh et al. — Stanford / UC Berkeley — 2026

DeepScholar-Bench

A live benchmark and automated evaluation for generative research synthesis. No system exceeds 31% geometric mean across knowledge synthesis, retrieval quality, and verifiability — the hardest open problem in AI-assisted research.

Prerequisites: RAG basics + LLM evaluation + Information retrieval
10
Chapters
5+
Simulations

Chapter 0: The Problem

Imagine you have just finished a research project. You need to write the related work section of your paper. This means finding every relevant prior paper, understanding what each one contributed, organizing them into coherent themes, and citing each claim precisely. It takes a domain expert days of careful work.

Now imagine asking an AI system to do it for you. The system must search the live web, retrieve papers, read them, synthesize their contributions into a coherent narrative, and attach correct citations to every claim. This is generative research synthesis — and it is one of the hardest knowledge tasks we can give an AI.

The core problem: A new class of AI systems — OpenAI DeepResearch, Perplexity Deep Research, Gemini Deep Research — all claim to do this. But how do we measure whether they actually work? Existing benchmarks are useless here. Question-answering benchmarks (SimpleQA, GAIA) test short factual answers. Expert-curated datasets go stale within months and get contaminated into training data. Nobody has built a benchmark that captures what research synthesis actually requires.

The problem breaks into three parts that no existing benchmark handles simultaneously:

  1. Knowledge synthesis — Can the system surface the key facts and organize them coherently? Not just retrieving information, but weaving it into a narrative that a human researcher would recognize as useful.
  2. Retrieval quality — Did the system find the right papers? Not just topically relevant ones, but the important, high-impact sources that a domain expert would cite.
  3. Verifiability — Can you actually check the system's claims? Does each citation support what the sentence says? Are there claims floating without any source?
Why Existing Benchmarks Fail

Each benchmark type covers only a slice of what research synthesis requires. Click on each benchmark type to see what it tests and what it misses.

QA Benchmarks

DeepScholar-Bench solves all three problems at once. It draws from recent ArXiv papers so the data is never stale. It evaluates a real task — writing a related work section — not a simplified proxy. And its automated metrics measure all three dimensions with strong agreement with human experts.

Concept: Research synthesis is fundamentally different from question answering — it requires retrieval, comprehension, organization, and precise citation in a single long-form output. Realization: No existing benchmark captures all three dimensions (synthesis + retrieval + verifiability) simultaneously, and static datasets become contaminated as models train on the public web. You need a live benchmark that regenerates itself from the latest research.
Why can't a static expert-curated benchmark adequately evaluate generative research synthesis systems over time?

Chapter 1: The Key Insight

Here is the clever trick. Every good research paper already contains a human-written example of research synthesis: the related work section. It was written by a domain expert who searched the literature, identified the most relevant prior work, organized it thematically, and cited each source precisely.

And ArXiv publishes thousands of new, peer-reviewed papers every month. Each one is a fresh, uncontaminated benchmark instance.

The key insight: Use recent ArXiv papers as a self-renewing source of ground truth. The paper's title and abstract become the query. The paper's related work section becomes the exemplar answer. The paper's bibliography provides the reference set. No manual curation needed. No risk of contamination. The benchmark stays fresh forever.

The data pipeline works like this:

1. Scrape Recent ArXiv
Load papers from configured domains (cs.ML, cs.CV, cs.CL, etc.) within a recent date range. Keep only v1 papers to avoid information leakage from later revisions.
2. Quality Filter
Keep only papers marked as "accepted" or "published" at a conference. Exclude papers without an explicit Related Work section or .bib file. This ensures high-quality exemplars.
3. Extract Content
Parse LaTeX to extract the related work section, clean it, and extract all citations. Use ArXiv and OpenAlex APIs to recover full metadata (abstracts, authors, citation counts) for each reference.
4. Benchmark Instance
Each paper yields: a query (title + abstract), an exemplar answer (related work section), and a reference set (all cited papers with metadata). Average: 23 unique references per instance, 63% on ArXiv.

The first dataset instantiation, DeepScholar-June-2025, uses papers published between April and June 2025 — after the April 5th release of Llama-4, the main open-source model benchmarked. It spans 18 ArXiv domains and contains 63 queries. A later expansion, DeepScholar-Nov-2025, grows to 200 queries across 75+ subject areas including physics, quantitative biology, and economics.

Live Benchmark Pipeline

Watch a paper flow through the automated data pipeline. Each step filters and transforms the raw ArXiv data into a benchmark instance. Click "Run Pipeline" to animate the flow.

Ready
Why does the pipeline keep only v1 ArXiv papers and filter for conference acceptance?

Chapter 2: Task Design

The benchmark task is deceptively simple to state: given a paper's title and abstract, generate its related work section by retrieving, synthesizing, and citing prior work from the live web.

But think about what this actually requires. The system receives a short description of a paper — say, a new approach to vision-language alignment. It must then:

  1. Understand the research context — What subfield is this? What are the key problems? What prior approaches exist?
  2. Search the live web — Find relevant papers through search engines, ArXiv, and other sources. Not from a closed corpus — from the entire internet.
  3. Assess relevance and importance — Out of hundreds of search results, identify the 20-30 papers that are truly essential. Not just topically related, but foundational and high-impact.
  4. Synthesize into a narrative — Organize the retrieved papers into themes, compare and contrast their approaches, and write a coherent multi-paragraph section.
  5. Cite precisely — Every factual claim must be supported by a specific cited source. No hallucinated citations. No unsupported claims.
Why related work? This is not an arbitrary choice. A related work section is one of the few research artifacts where the ground truth is both (a) available in structured form (the text + bibliography) and (b) genuinely requires all three capabilities: deep retrieval, knowledge synthesis, and precise citation. It is the ideal test of generative research synthesis.

Formally, the task is: given a paper description d (the abstract), produce a retrieved set S of sources and a written related work section W that synthesizes and cites those sources. The evaluation then compares S and W against the human-written exemplar along three dimensions.

Here is what a concrete instance looks like:

ComponentExample
Query (input)"We propose a method for efficient multi-modal alignment using contrastive learning with hard negative mining..."
Exemplar answerThe human-written related work section from the paper (~500 words, ~23 citations)
Reference setAll papers cited in the related work section, with metadata: title, authors, abstract, citation count, ArXiv link
System outputThe AI-generated related work section with its own retrieved sources and citations

A critical design choice: each system is restricted to searching only through the ArXiv API, and search results published after the query paper are filtered out. This prevents information leakage — the system cannot find the query paper itself or papers that cite it.

Why is generating a related work section a better benchmark task for research synthesis than answering factual questions?

Chapter 3: Three Evaluation Dimensions

A related work section can fail in three very different ways. It can be well-written but miss important papers. It can cite the right papers but scramble the narrative. It can sound authoritative but hallucinate citations. You need separate metrics for each failure mode.

DeepScholar-Bench evaluates seven metrics across three dimensions:

DimensionMetricWhat it measures
Knowledge SynthesisOrganization & CoherencyIs the text well-structured and coherent?
Nugget CoverageDoes it surface the essential facts?
Retrieval QualityRelevance RateAre the retrieved sources topically relevant?
Reference CoverageDoes it find the important papers a human expert would cite?
Document ImportanceAre the retrieved sources notable and high-impact?
VerifiabilityCitation PrecisionDoes each cited source actually support the claim?
Claim CoverageIs every claim backed by a cited source?
Why three dimensions, not one? Consider two systems. System A retrieves every important paper but writes an incoherent mess with broken citations. System B writes beautifully organized prose with perfect citations — but misses half the key references. A single score hides these tradeoffs. You need to see where each system excels and where it fails. The geometric mean across all seven metrics provides the overall score, but the per-dimension breakdown is where the insights live.

Let us walk through each dimension.

Dimension 1: Knowledge Synthesis

Organization & Coherency uses an LLM-as-a-judge to perform pairwise comparison between the system output and the human exemplar. The judge sees both (in randomized order to avoid position bias) and picks a winner. The metric is the win rate across all queries.

Nugget Coverage is more granular. An information nugget is a single essential fact extracted from the human exemplar. For example: "CLIP uses contrastive learning to align image-text representations." The metric counts what fraction of the human exemplar's nuggets appear in the system output.

Dimension 2: Retrieval Quality

Relevance Rate scores each retrieved source from 0-2 using an LLM judge and averages over the set. A score of 1.0 means every source is maximally relevant.

Reference Coverage checks how many of the "important" references from the human exemplar the system also found. Important references are those that cannot be substituted — foundational papers that any expert would cite.

Document Importance compares the median citation count of retrieved papers against the human exemplar's references. If the human cites papers with a median of 500 citations and your system cites papers with a median of 50, your score is 0.1.

Dimension 3: Verifiability

Citation Precision checks each citation: does the referenced source actually support a claim in the accompanying sentence? Average across all citations.

Claim Coverage checks each sentence: are all claims in this sentence supported by cited sources? Uses a sliding window of w sentences to account for citations placed in adjacent sentences.

Three Dimensions of Evaluation

Each dimension catches different failure modes. Hover over a dimension to see what a failure looks like in practice.

Knowledge Synthesis
A system retrieves all the right papers (high Reference Coverage and Relevance Rate) but scrambles the narrative and attaches citations to the wrong sentences. Which metrics would catch this?

Chapter 4: Automated Evaluation

Each of the seven metrics must be computed automatically — you cannot have humans grade every output of every system on every query. Here is exactly how each metric is measured.

Organization & Coherency: Pairwise LLM Judge

Present the LLM judge with two documents: the system output and the human exemplar. Randomize which appears first (to avoid position bias). Ask: "Which is better organized and more coherent?" The judge picks a winner or declares a tie. Run this for every query. Report the system's win rate.

Validation: 71.4% agreement with human expert annotators on pairwise comparisons. Strong disagreements (judge picks system when humans pick exemplar, or vice versa) are rare.

Nugget Coverage: Automated Fact Extraction

Step 1: Extract information nuggets from the human exemplar. Each nugget is one essential fact or claim. Step 2: For each nugget, check whether it appears (in any form) in the system output. Step 3: Report the fraction of nuggets covered.

Validation: 83.3% agreement with human annotators on nugget importance labeling. False positives and false negatives are both under 10%.

Concrete example of nugget evaluation:
Human exemplar nugget: "CLIP aligns image and text representations using contrastive learning on 400M image-text pairs."

System output: "Radford et al. (2021) introduced CLIP, which learns joint image-text embeddings through a contrastive objective trained on a large web-scraped dataset."

Verdict: Covered — the core fact (contrastive learning for image-text alignment) is present, even though the exact phrasing and the "400M" detail differ.

Relevance Rate: Graded Relevance Scoring

For each source s in the retrieved set S, an LLM judge assigns a relevance score Rel(s) from 0 to 2. The relevance rate averages over the set and normalizes by the max score:

RR(S) = (1 / 2|S|) × Σs ∈ S Rel(s)

A score of 1.0 means every retrieved source got the maximum relevance score of 2.

Reference Coverage: Important Source Recall

First, label each reference from the human exemplar as "important" (cannot be substituted) or "not important" (could be replaced). Then check what fraction of the important set E the system found:

RC(S, E) = (1 / |E|) × Σs ∈ S I[s ∈ E]

Validation: 65.9% agreement with human annotators. The low false-negative rate (9.8%) means the metric rarely falsely penalizes systems. The higher false-positive rate (24.2%) means it is actually conservative — it under-counts truly important references.

Document Importance: Citation Count Comparison

Compare the median citation count of the system's retrieved sources against the human exemplar's references:

DI(S, S*) = min( median-cites(S) / median-cites(S*), 1 )

If the human expert cites papers with a median of 200 citations, and your system cites papers with a median of 20, your Document Importance is 0.1. The score caps at 1.0.

Citation Precision & Claim Coverage: Entailment Checks

Citation Precision: For each citation in the system output, check whether the cited source actually supports a claim in the accompanying sentence. Average the precision across all citations.

Claim Coverage: For each sentence, check whether all claims are fully supported by the cited sources within a sliding window of w neighboring sentences. A claim is considered covered if the system's query context (the paper abstract) implicitly supports it, since this context is given to the system.

Evaluation Pipeline Walkthrough

See how a system output flows through all seven metrics. Each step extracts a different signal. Click "Evaluate" to step through the pipeline.

Ready — click Evaluate
Concept: Every metric uses LLM-as-a-judge or automated entailment checking — no human in the loop at evaluation time. Realization: The human validation study shows 65-83% agreement between LLM judges and CS PhD annotators across all tasks. This is strong enough that the automated pipeline can run continuously on fresh data, enabling the live benchmark design.
Document Importance compares median citation counts between system-retrieved sources and human-exemplar sources. Why use the median rather than the mean?

Chapter 5: DeepScholar-ref

Along with the benchmark, the authors release DeepScholar-ref, an open-source reference pipeline for generative research synthesis. It is built on the LOTUS framework, which provides declarative semantic operators for LLM-based data processing.

The pipeline has three stages:

Stage 1: Iterative Search
Given the query, generate web-search queries iteratively. After each round of search, summarize the results and generate new, more targeted queries. This mimics how a human researcher refines their search strategy as they learn more about the topic.
Stage 2: Semantic Filtering & Ranking
Pass all search results through LOTUS semantic operators. First, a semantic filter removes irrelevant sources using LLM judgment. Then, a semantic top-k ranks the remaining sources by relevance to the query.
Stage 3: Semantic Aggregation
A semantic aggregation operator takes the top-ranked sources and generates the final related work section, synthesizing and citing all remaining sources into a coherent narrative.
Why LOTUS? LOTUS provides declarative semantic operators — filter, top-k, aggregate — that can be composed like SQL operators but powered by LLMs. This means the pipeline is modular: you can swap in different models for each stage, change the filtering criteria, or adjust the ranking without rewriting the pipeline. The key insight is that structured data processing (filter, rank, aggregate) works better than asking a single LLM to do everything at once.

The pipeline is tested with five model configurations:

ConfigurationSearch + Filter ModelSynthesis Model
DeepScholar-ref (Llama-4)Llama-4-ScoutLlama-4-Scout
DeepScholar-ref (GPT-4.1)GPT-4.1GPT-4.1
DeepScholar-ref (GPT-4.1, o3)GPT-4.1o3
DeepScholar-ref (GPT-4.1, Claude)GPT-4.1Claude Opus 4
DeepScholar-ref (GPT-4.1, Gemini)GPT-4.1Gemini 2.5 Pro

The strongest configuration (GPT-4.1 for search, o3 for synthesis) achieves competitive performance with OpenAI DeepResearch while being 4.3x cheaper and 2.28x faster. And it achieves up to 6.3x higher verifiability.

Why does DeepScholar-ref use separate models for search/filtering versus synthesis, rather than a single model for everything?

Chapter 6: Systems Evaluated

The benchmark evaluates four categories of systems, spanning the full spectrum from open-source research tools to the most expensive commercial offerings:

Category 1: Open-Source Research Systems

These are specialized systems built explicitly for research synthesis:

Category 2: Search Agents

These are general-purpose LLMs given search tool access. Five models tested: Llama-4-Scout, GPT-4.1, o3, Claude Opus 4, and Gemini 2.5 Pro. Each gets the same search interface (ArXiv API only) and the same prompt.

Category 3: Commercial Deep Research

OpenAI o3-deep-research — the commercial benchmark that represents the current state of the art. It is the most expensive system tested and the only one with a dedicated deep research mode.

Category 4: DeepScholar-ref

The reference pipeline from this paper, tested in five model configurations as described in Chapter 5.

Fair comparison design: All systems are restricted to the same retrieval corpus (ArXiv API only) and the same information access (no results published after the query paper's date). This isolates the quality of synthesis and retrieval from the breadth of web access. Without this control, systems with broader web access would have an unfair advantage that confounds the measurement.
Why do all systems access only the ArXiv API rather than the full web during evaluation?

Chapter 7: Results

Here are the headline numbers. No system surpasses a geometric mean of 31% across all seven metrics. The benchmark is far from saturated.

The main result in one sentence: Even the best system (OpenAI DeepResearch, geometric mean 30.9%) fails to surface 60% of the key facts experts mention, misses 81% of the important references, and retrieves sources with only 12% of the citation impact of human-selected references.

Let us break down each dimension:

Knowledge Synthesis

SystemOrganizationNugget Coverage
Human exemplar.5001.000
OpenAI DeepResearch.857.392
Search Agent (o3).849.348
DeepScholar-ref (GPT-4.1, o3).857.384
STORM (Llama-4).119.183

Systems using strong models (o3, Claude, Gemini) write well-organized prose — often better organized than the human exemplars. But they all score below 40% on Nugget Coverage. They produce polished text that misses most of the essential facts.

Retrieval Quality

SystemRelevance RateRef CoverageDoc Importance
Human exemplar.9001.000.850
OpenAI DeepResearch.629.187.124
Search Agent (Claude).583.131.008
DeepScholar-ref (GPT-4.1, o3).645.167.009

This is where systems fail hardest. OpenAI DeepResearch finds relevant sources (Relevance Rate .629), but misses 81% of the important references (Reference Coverage .187) and retrieves papers with only 12% of the citation impact of human-selected ones (Document Importance .124). Systems can find topically related papers, but cannot identify the foundational, high-impact works that experts cite.

Verifiability

SystemCitation PrecisionClaim Coverage
Search Agent (Claude).701.760
DeepScholar-ref (GPT-4.1, Claude).944.895
OpenAI DeepResearch.399.138
DeepResearcher (Llama-4).312.396

A surprise: OpenAI DeepResearch scores worst on verifiability among strong systems, with only 39.9% Citation Precision and 13.8% Claim Coverage. DeepScholar-ref with the Claude synthesis model achieves 6.3x higher verifiability. The structured pipeline (filter then rank then synthesize) produces much more verifiable output than an end-to-end system.

System Comparison: Radar Chart

Compare how different systems perform across all seven metrics. Each axis represents one metric (0-100%). Click to cycle through systems.

OpenAI DeepResearch
OpenAI DeepResearch achieves the highest Organization score (.857) but one of the lowest Claim Coverage scores (.138). What does this tell us?

Chapter 8: Analysis

The authors conduct an ablation study that reveals exactly where the bottleneck is. They replace the search component of DeepScholar-ref with oracle retrievers that provide the system with the exact references from the human exemplar. If the problem is retrieval, oracle retrieval should saturate performance.

The Retrieval Bottleneck

With oracle retrieval, DeepScholar-ref (GPT-4.1, Claude) nearly saturates Retrieval Quality and Verifiability metrics:

SettingRef CoverageDoc ImportanceCite PrecisionClaim Coverage
arxiv.org retrieval.152.009.944.895
Oracle (ArXiv only)1.0001.000.955.899
Oracle (All refs)1.000.822.941.828

The jump from .152 to 1.000 in Reference Coverage confirms that the system can write about the right papers when given them. The problem is finding them in the first place.

The synthesis bottleneck persists. Even with oracle retrieval, Nugget Coverage only reaches .487 (up from .307 with normal retrieval). The system still misses more than half the key facts, even when handed the perfect set of papers. This means better retrieval alone will not solve the problem — you also need better synthesis that extracts and surfaces more information from retrieved sources.

Where Systems Fail: Three Retrieval APIs

The ablation also tests three different retrieval APIs — arxiv.org, parallel.ai, and tavily.com — revealing that retrieval quality varies dramatically by API:

Tavily produces the most polished-looking output but finds the fewest important papers. The API that makes the system look best on surface quality is worst at the substance underneath.

Human Validation

Eleven CS PhD students from four research universities provided over 300 annotations. The agreement rates validate the automated metrics:

Concept: The two-bottleneck finding is the key takeaway: performance is limited by both retrieval (finding the right papers) and synthesis (extracting facts from found papers). Realization: Even oracle retrieval only gets Nugget Coverage to ~50%. Solving generative research synthesis requires advances on both fronts simultaneously — neither better search nor better writing alone is sufficient.
With oracle retrieval, Reference Coverage jumps from .152 to 1.000 but Nugget Coverage only rises from .307 to .487. What does this tell us about the fundamental challenge?

Chapter 9: Connections

DeepScholar-Bench sits at the intersection of several research threads:

Deep Research Systems

The systems being benchmarked represent a new product category. OpenAI DeepResearch (2025), Perplexity Deep Research, Gemini Deep Research, and xAI Grok Deep Research all offer automated research synthesis as a product. DeepScholar-Bench is the first rigorous benchmark for comparing them. The finding that none exceeds 31% geometric mean suggests these products are still far from replacing human researchers.

Open-Source Research Tools

OpenScholar (Asai et al., 2024) specializes in scientific literature synthesis with retrieval-augmented LMs. STORM (Shao et al., 2024) generates Wikipedia-like articles using multi-perspective questioning. DeepResearcher (Zheng et al., 2025) trains research agents via RL. All are benchmarked here and all fall below DeepScholar-ref, suggesting that structured pipelines outperform end-to-end approaches for this task.

RAG and Information Retrieval

DeepScholar-Bench can be viewed as the hardest RAG evaluation — one where the corpus is the entire live web, the queries require deep domain expertise, and the output must be long-form with precise citations. Benchmarks like BEIR (Thakur et al., 2021) and FreshStack (Thakur et al., 2025) evaluate retrieval in isolation. DeepScholar-Bench evaluates the full pipeline end-to-end.

Live and Contamination-Resistant Benchmarks

BrowseComp (Wei et al., 2025) tests browsing agents with hard factual questions that require deep web navigation, but focuses on short answers, not synthesis. LiveDRBench (Java et al., 2025) also targets deep research but uses expert-curated questions. DeepScholar-Bench's automated pipeline from ArXiv avoids both the short-answer limitation and the curation bottleneck.

The LOTUS Framework

LOTUS (Patel et al., 2025) provides the semantic operators that power DeepScholar-ref. The idea of declarative, composable operators for LLM-based data processing is powerful beyond this benchmark — it suggests that structured pipelines with explicit filter/rank/aggregate stages can match or beat monolithic systems while being cheaper and more debuggable.

Agent Evaluation More Broadly

The Agent Evaluation Survey (Yehudai et al., 2025) maps the full landscape of how we measure what agents can do. DeepScholar-Bench fills a specific gap in that landscape: evaluating research synthesis agents on a real, complex, long-form task with multiple evaluation dimensions. It is closer in spirit to SWE-Bench (for coding agents) or WebArena (for web agents) than to traditional QA benchmarks.

The big picture: We are entering an era where AI systems generate long-form research documents from live web sources. DeepScholar-Bench shows we are far from solving this — the best systems miss most key facts, most important references, and most verifiable citations. But the structured evaluation framework and open-source baseline point the way forward: separate retrieval from synthesis, measure both independently, and iterate on each.
What is the most important open problem revealed by DeepScholar-Bench?