A live benchmark and automated evaluation for generative research synthesis. No system exceeds 31% geometric mean across knowledge synthesis, retrieval quality, and verifiability — the hardest open problem in AI-assisted research.
Imagine you have just finished a research project. You need to write the related work section of your paper. This means finding every relevant prior paper, understanding what each one contributed, organizing them into coherent themes, and citing each claim precisely. It takes a domain expert days of careful work.
Now imagine asking an AI system to do it for you. The system must search the live web, retrieve papers, read them, synthesize their contributions into a coherent narrative, and attach correct citations to every claim. This is generative research synthesis — and it is one of the hardest knowledge tasks we can give an AI.
The problem breaks into three parts that no existing benchmark handles simultaneously:
Each benchmark type covers only a slice of what research synthesis requires. Click on each benchmark type to see what it tests and what it misses.
DeepScholar-Bench solves all three problems at once. It draws from recent ArXiv papers so the data is never stale. It evaluates a real task — writing a related work section — not a simplified proxy. And its automated metrics measure all three dimensions with strong agreement with human experts.
Here is the clever trick. Every good research paper already contains a human-written example of research synthesis: the related work section. It was written by a domain expert who searched the literature, identified the most relevant prior work, organized it thematically, and cited each source precisely.
And ArXiv publishes thousands of new, peer-reviewed papers every month. Each one is a fresh, uncontaminated benchmark instance.
The data pipeline works like this:
The first dataset instantiation, DeepScholar-June-2025, uses papers published between April and June 2025 — after the April 5th release of Llama-4, the main open-source model benchmarked. It spans 18 ArXiv domains and contains 63 queries. A later expansion, DeepScholar-Nov-2025, grows to 200 queries across 75+ subject areas including physics, quantitative biology, and economics.
Watch a paper flow through the automated data pipeline. Each step filters and transforms the raw ArXiv data into a benchmark instance. Click "Run Pipeline" to animate the flow.
The benchmark task is deceptively simple to state: given a paper's title and abstract, generate its related work section by retrieving, synthesizing, and citing prior work from the live web.
But think about what this actually requires. The system receives a short description of a paper — say, a new approach to vision-language alignment. It must then:
Formally, the task is: given a paper description d (the abstract), produce a retrieved set S of sources and a written related work section W that synthesizes and cites those sources. The evaluation then compares S and W against the human-written exemplar along three dimensions.
Here is what a concrete instance looks like:
| Component | Example |
|---|---|
| Query (input) | "We propose a method for efficient multi-modal alignment using contrastive learning with hard negative mining..." |
| Exemplar answer | The human-written related work section from the paper (~500 words, ~23 citations) |
| Reference set | All papers cited in the related work section, with metadata: title, authors, abstract, citation count, ArXiv link |
| System output | The AI-generated related work section with its own retrieved sources and citations |
A critical design choice: each system is restricted to searching only through the ArXiv API, and search results published after the query paper are filtered out. This prevents information leakage — the system cannot find the query paper itself or papers that cite it.
A related work section can fail in three very different ways. It can be well-written but miss important papers. It can cite the right papers but scramble the narrative. It can sound authoritative but hallucinate citations. You need separate metrics for each failure mode.
DeepScholar-Bench evaluates seven metrics across three dimensions:
| Dimension | Metric | What it measures |
|---|---|---|
| Knowledge Synthesis | Organization & Coherency | Is the text well-structured and coherent? |
| Nugget Coverage | Does it surface the essential facts? | |
| Retrieval Quality | Relevance Rate | Are the retrieved sources topically relevant? |
| Reference Coverage | Does it find the important papers a human expert would cite? | |
| Document Importance | Are the retrieved sources notable and high-impact? | |
| Verifiability | Citation Precision | Does each cited source actually support the claim? |
| Claim Coverage | Is every claim backed by a cited source? |
Let us walk through each dimension.
Organization & Coherency uses an LLM-as-a-judge to perform pairwise comparison between the system output and the human exemplar. The judge sees both (in randomized order to avoid position bias) and picks a winner. The metric is the win rate across all queries.
Nugget Coverage is more granular. An information nugget is a single essential fact extracted from the human exemplar. For example: "CLIP uses contrastive learning to align image-text representations." The metric counts what fraction of the human exemplar's nuggets appear in the system output.
Relevance Rate scores each retrieved source from 0-2 using an LLM judge and averages over the set. A score of 1.0 means every source is maximally relevant.
Reference Coverage checks how many of the "important" references from the human exemplar the system also found. Important references are those that cannot be substituted — foundational papers that any expert would cite.
Document Importance compares the median citation count of retrieved papers against the human exemplar's references. If the human cites papers with a median of 500 citations and your system cites papers with a median of 50, your score is 0.1.
Citation Precision checks each citation: does the referenced source actually support a claim in the accompanying sentence? Average across all citations.
Claim Coverage checks each sentence: are all claims in this sentence supported by cited sources? Uses a sliding window of w sentences to account for citations placed in adjacent sentences.
Each dimension catches different failure modes. Hover over a dimension to see what a failure looks like in practice.
Each of the seven metrics must be computed automatically — you cannot have humans grade every output of every system on every query. Here is exactly how each metric is measured.
Present the LLM judge with two documents: the system output and the human exemplar. Randomize which appears first (to avoid position bias). Ask: "Which is better organized and more coherent?" The judge picks a winner or declares a tie. Run this for every query. Report the system's win rate.
Validation: 71.4% agreement with human expert annotators on pairwise comparisons. Strong disagreements (judge picks system when humans pick exemplar, or vice versa) are rare.
Step 1: Extract information nuggets from the human exemplar. Each nugget is one essential fact or claim. Step 2: For each nugget, check whether it appears (in any form) in the system output. Step 3: Report the fraction of nuggets covered.
Validation: 83.3% agreement with human annotators on nugget importance labeling. False positives and false negatives are both under 10%.
For each source s in the retrieved set S, an LLM judge assigns a relevance score Rel(s) from 0 to 2. The relevance rate averages over the set and normalizes by the max score:
A score of 1.0 means every retrieved source got the maximum relevance score of 2.
First, label each reference from the human exemplar as "important" (cannot be substituted) or "not important" (could be replaced). Then check what fraction of the important set E the system found:
Validation: 65.9% agreement with human annotators. The low false-negative rate (9.8%) means the metric rarely falsely penalizes systems. The higher false-positive rate (24.2%) means it is actually conservative — it under-counts truly important references.
Compare the median citation count of the system's retrieved sources against the human exemplar's references:
If the human expert cites papers with a median of 200 citations, and your system cites papers with a median of 20, your Document Importance is 0.1. The score caps at 1.0.
Citation Precision: For each citation in the system output, check whether the cited source actually supports a claim in the accompanying sentence. Average the precision across all citations.
Claim Coverage: For each sentence, check whether all claims are fully supported by the cited sources within a sliding window of w neighboring sentences. A claim is considered covered if the system's query context (the paper abstract) implicitly supports it, since this context is given to the system.
See how a system output flows through all seven metrics. Each step extracts a different signal. Click "Evaluate" to step through the pipeline.
Along with the benchmark, the authors release DeepScholar-ref, an open-source reference pipeline for generative research synthesis. It is built on the LOTUS framework, which provides declarative semantic operators for LLM-based data processing.
The pipeline has three stages:
The pipeline is tested with five model configurations:
| Configuration | Search + Filter Model | Synthesis Model |
|---|---|---|
| DeepScholar-ref (Llama-4) | Llama-4-Scout | Llama-4-Scout |
| DeepScholar-ref (GPT-4.1) | GPT-4.1 | GPT-4.1 |
| DeepScholar-ref (GPT-4.1, o3) | GPT-4.1 | o3 |
| DeepScholar-ref (GPT-4.1, Claude) | GPT-4.1 | Claude Opus 4 |
| DeepScholar-ref (GPT-4.1, Gemini) | GPT-4.1 | Gemini 2.5 Pro |
The strongest configuration (GPT-4.1 for search, o3 for synthesis) achieves competitive performance with OpenAI DeepResearch while being 4.3x cheaper and 2.28x faster. And it achieves up to 6.3x higher verifiability.
The benchmark evaluates four categories of systems, spanning the full spectrum from open-source research tools to the most expensive commercial offerings:
These are specialized systems built explicitly for research synthesis:
These are general-purpose LLMs given search tool access. Five models tested: Llama-4-Scout, GPT-4.1, o3, Claude Opus 4, and Gemini 2.5 Pro. Each gets the same search interface (ArXiv API only) and the same prompt.
OpenAI o3-deep-research — the commercial benchmark that represents the current state of the art. It is the most expensive system tested and the only one with a dedicated deep research mode.
The reference pipeline from this paper, tested in five model configurations as described in Chapter 5.
Here are the headline numbers. No system surpasses a geometric mean of 31% across all seven metrics. The benchmark is far from saturated.
Let us break down each dimension:
| System | Organization | Nugget Coverage |
|---|---|---|
| Human exemplar | .500 | 1.000 |
| OpenAI DeepResearch | .857 | .392 |
| Search Agent (o3) | .849 | .348 |
| DeepScholar-ref (GPT-4.1, o3) | .857 | .384 |
| STORM (Llama-4) | .119 | .183 |
Systems using strong models (o3, Claude, Gemini) write well-organized prose — often better organized than the human exemplars. But they all score below 40% on Nugget Coverage. They produce polished text that misses most of the essential facts.
| System | Relevance Rate | Ref Coverage | Doc Importance |
|---|---|---|---|
| Human exemplar | .900 | 1.000 | .850 |
| OpenAI DeepResearch | .629 | .187 | .124 |
| Search Agent (Claude) | .583 | .131 | .008 |
| DeepScholar-ref (GPT-4.1, o3) | .645 | .167 | .009 |
This is where systems fail hardest. OpenAI DeepResearch finds relevant sources (Relevance Rate .629), but misses 81% of the important references (Reference Coverage .187) and retrieves papers with only 12% of the citation impact of human-selected ones (Document Importance .124). Systems can find topically related papers, but cannot identify the foundational, high-impact works that experts cite.
| System | Citation Precision | Claim Coverage |
|---|---|---|
| Search Agent (Claude) | .701 | .760 |
| DeepScholar-ref (GPT-4.1, Claude) | .944 | .895 |
| OpenAI DeepResearch | .399 | .138 |
| DeepResearcher (Llama-4) | .312 | .396 |
A surprise: OpenAI DeepResearch scores worst on verifiability among strong systems, with only 39.9% Citation Precision and 13.8% Claim Coverage. DeepScholar-ref with the Claude synthesis model achieves 6.3x higher verifiability. The structured pipeline (filter then rank then synthesize) produces much more verifiable output than an end-to-end system.
Compare how different systems perform across all seven metrics. Each axis represents one metric (0-100%). Click to cycle through systems.
The authors conduct an ablation study that reveals exactly where the bottleneck is. They replace the search component of DeepScholar-ref with oracle retrievers that provide the system with the exact references from the human exemplar. If the problem is retrieval, oracle retrieval should saturate performance.
With oracle retrieval, DeepScholar-ref (GPT-4.1, Claude) nearly saturates Retrieval Quality and Verifiability metrics:
| Setting | Ref Coverage | Doc Importance | Cite Precision | Claim Coverage |
|---|---|---|---|---|
| arxiv.org retrieval | .152 | .009 | .944 | .895 |
| Oracle (ArXiv only) | 1.000 | 1.000 | .955 | .899 |
| Oracle (All refs) | 1.000 | .822 | .941 | .828 |
The jump from .152 to 1.000 in Reference Coverage confirms that the system can write about the right papers when given them. The problem is finding them in the first place.
The ablation also tests three different retrieval APIs — arxiv.org, parallel.ai, and tavily.com — revealing that retrieval quality varies dramatically by API:
Tavily produces the most polished-looking output but finds the fewest important papers. The API that makes the system look best on surface quality is worst at the substance underneath.
Eleven CS PhD students from four research universities provided over 300 annotations. The agreement rates validate the automated metrics:
DeepScholar-Bench sits at the intersection of several research threads:
The systems being benchmarked represent a new product category. OpenAI DeepResearch (2025), Perplexity Deep Research, Gemini Deep Research, and xAI Grok Deep Research all offer automated research synthesis as a product. DeepScholar-Bench is the first rigorous benchmark for comparing them. The finding that none exceeds 31% geometric mean suggests these products are still far from replacing human researchers.
OpenScholar (Asai et al., 2024) specializes in scientific literature synthesis with retrieval-augmented LMs. STORM (Shao et al., 2024) generates Wikipedia-like articles using multi-perspective questioning. DeepResearcher (Zheng et al., 2025) trains research agents via RL. All are benchmarked here and all fall below DeepScholar-ref, suggesting that structured pipelines outperform end-to-end approaches for this task.
DeepScholar-Bench can be viewed as the hardest RAG evaluation — one where the corpus is the entire live web, the queries require deep domain expertise, and the output must be long-form with precise citations. Benchmarks like BEIR (Thakur et al., 2021) and FreshStack (Thakur et al., 2025) evaluate retrieval in isolation. DeepScholar-Bench evaluates the full pipeline end-to-end.
BrowseComp (Wei et al., 2025) tests browsing agents with hard factual questions that require deep web navigation, but focuses on short answers, not synthesis. LiveDRBench (Java et al., 2025) also targets deep research but uses expert-curated questions. DeepScholar-Bench's automated pipeline from ArXiv avoids both the short-answer limitation and the curation bottleneck.
LOTUS (Patel et al., 2025) provides the semantic operators that power DeepScholar-ref. The idea of declarative, composable operators for LLM-based data processing is powerful beyond this benchmark — it suggests that structured pipelines with explicit filter/rank/aggregate stages can match or beat monolithic systems while being cheaper and more debuggable.
The Agent Evaluation Survey (Yehudai et al., 2025) maps the full landscape of how we measure what agents can do. DeepScholar-Bench fills a specific gap in that landscape: evaluating research synthesis agents on a real, complex, long-form task with multiple evaluation dimensions. It is closer in spirit to SWE-Bench (for coding agents) or WebArena (for web agents) than to traditional QA benchmarks.