Miao, Davis, Zhang, Pritchard, Zou — Stanford University, 2025

Paper2Agent: Papers as AI Agents

An automated framework that converts research papers into interactive AI agents via MCP servers — turning static publications into tool-using, test-validated, natural-language-queryable systems.

Prerequisites: LLM basics + Tool-use / function calling + Software testing intuition
10
Chapters
4+
Simulations

Chapter 0: The Problem

A new paper drops on arXiv. It describes a powerful genomics method, links to a GitHub repo with 47 Python files, three Jupyter notebooks, and a requirements.txt that pins 23 dependencies. You want to use this method on your data.

So you clone the repo. You create a conda environment. You spend two hours resolving dependency conflicts. You trace through the notebooks, trying to figure out which functions to call, in what order, with what parameters. You hit a cryptic error — the tutorial assumed a specific GPU, or a dataset format slightly different from yours. You open an issue on GitHub and wait.

This experience is so common it has become the default mode of scientific software adoption. A 2022 study found that only 25% of published research code could be executed without errors. The remaining 75% had dependency issues, missing data, undocumented assumptions, or outright bugs.

The fundamental gap: A research paper is a passive artifact. It describes what the authors did, but it cannot do anything. The reader must manually parse the paper, understand the method, navigate the code, configure the environment, and adapt the pipeline to their data. Every step is a potential failure point. The knowledge is trapped in static text.

What if the paper could answer your questions? Not a chatbot summarizing the abstract — an actual agent that understands the method, has access to the working code, can execute the pipeline on your data, and returns validated results?

That is what Paper2Agent builds. It takes a paper and its codebase and automatically constructs an interactive AI agent that embodies the research contribution — an agent you can talk to in natural language, that calls the right functions in the right order, and that has been tested against the paper's own results before you ever interact with it.

The shift: From "read paper, clone repo, debug for hours" to "ask a question, get a validated answer." The paper becomes a knowledgeable entity capable of execution and dialogue, not just a document encoding knowledge.
Paper Adoption: Before vs. After

Click to toggle between the traditional workflow and the Paper2Agent workflow. Count the steps.

Why is a research paper fundamentally limited as a vehicle for method adoption?

Chapter 1: The Key Insight

The core idea is deceptively simple: convert a research paper into a Model Context Protocol (MCP) server, then connect that server to a chat agent. The MCP server exposes the paper's methods as callable tools, its data as queryable resources, and its workflows as executable prompts.

What is MCP?

The Model Context Protocol is an open standard (created by Anthropic in late 2024) that defines how an LLM-based agent connects to external tools. Think of it as a USB port for AI — any tool that speaks MCP can be plugged into any agent that speaks MCP. No custom integration code needed.

An MCP server exposes three types of capabilities:

ComponentWhat it isExample
ToolsCallable functions with typed inputs and outputsscore_variant(chr, pos, ref, alt, modality, tissue)
ResourcesStatic assets: data, docs, figuresTraining data links, manuscript text, supplementary tables
PromptsWorkflow templates that orchestrate tools in sequence"Preprocess → normalize → cluster → annotate" pipeline
Why MCP matters here: Before MCP, connecting a method to an AI agent meant writing custom glue code for each method-agent pair. If you had 100 papers and 5 agents, you needed 500 integrations. MCP makes this M + N instead of M × N: build one MCP server per paper, and any MCP-compatible agent can use it. Paper2Agent automates the "build one MCP server" step.

The Paper2Agent Pipeline

The framework has four stages, each handled by a specialized sub-agent:

Stage 1: Analysis
Multi-agent analysis of the paper + codebase. Environment agent configures dependencies. Tutorial scanner identifies reusable code.
Stage 2: Construction
Extract tutorials into single-purpose tools. Build typed schemas, wrap functions, parameterize hardcoded values.
Stage 3: Testing
Auto-generate tests from tutorial examples. Run tests, diagnose failures, fix code. Repeat until all pass or failing tools are removed.
Stage 4: Deployment
Package as MCP server. Deploy to Hugging Face Spaces. Connect to chat agent (e.g., Claude Code).

The entire pipeline is orchestrated by Claude Code acting as a meta-agent — it coordinates the four sub-agents, passes context between stages, and handles failures. On a personal laptop, Paper2Agent converted the AlphaGenome paper into 22 working MCP tools in about 3 hours, and the Scanpy preprocessing pipeline into 7 tools in about 45 minutes — all without human intervention.

The composability payoff: Because MCP servers are modular, you can connect multiple paper MCPs to the same agent. This enables cross-paper reasoning: "Use the genomics method from Paper A to analyze the GWAS data from Paper B." We will see exactly this in Chapter 8.
What makes MCP the right protocol for Paper2Agent?

Chapter 2: Paper Analysis Stage

Before you can turn a paper into an agent, you need to understand what the paper does. This is the job of Stage 1: a multi-agent system that reads the paper, clones the repo, and produces a structured understanding of the codebase.

The Orchestrator and Four Sub-Agents

Paper2Agent is itself implemented as a multi-agent system inside Claude Code. An orchestrator agent coordinates four specialized sub-agents, each with a focused mandate:

Sub-AgentResponsibilityOutput
Environment ManagerCreate clean, reproducible environment. Install dependencies. Resolve conflicts.Working conda/pip environment
Tutorial ScannerScan repo for notebooks, tutorials, examples. Distinguish useful code from boilerplate.Ranked list of tutorial candidates
Tool ExtractorConvert tutorials into single-purpose functions. Parameterize hardcoded values. Add type annotations.Library of reusable tool functions
Test VerifierGenerate tests from tutorial examples. Run, diagnose, fix. Loop until stable.Validated, tested tool implementations

What "Analysis" Actually Does

The Environment Manager does not just run pip install -r requirements.txt. It provisions an isolated workspace, analyzes which Python version the code expects, resolves version conflicts between pinned and transitive dependencies, and verifies that all imports succeed. This is the step that eliminates the two-hour dependency debugging session from Chapter 0.

The Tutorial Scanner solves a subtler problem: most repos contain many files, but only a few are genuine tutorials that demonstrate the method end-to-end. The scanner distinguishes tutorial notebooks from test files, utility scripts, and experimental code. It produces clear summaries of which resources demonstrate which capabilities.

Why multi-agent? Each sub-agent operates in an isolated context with a focused system prompt. The Environment Manager never sees tutorial code. The Tutorial Scanner never touches dependency resolution. This isolation prevents cross-contamination of concerns and allows each agent to use its full context window for its specific task. The orchestrator passes only summaries between agents, not raw outputs.

The Six Steps in Detail

  1. Locate and download the official repository linked to the paper.
  2. Environment setup: provision workspace, pin dependencies, verify imports.
  3. Tutorial discovery: scan for notebooks, README examples, tutorial directories.
  4. Tutorial execution and audit: run selected tutorials end-to-end with their example data, capture inputs/outputs/figures, record implicit assumptions.
  5. Tool extraction: convert tutorial logic into parameterized, typed functions.
  6. MCP assembly: integrate tools, resources, and prompts into a deployable server.
What makes this different from "just ask an LLM to read the paper": General LLMs hallucinate code. Paper2Agent never generates novel algorithms — it extracts and wraps existing code from the paper's own repository. Every tool traces back to a specific source file. This is what prevents "code hallucination" and ensures the agent's outputs are grounded in the authors' validated implementation.
Why does Paper2Agent use isolated sub-agents rather than a single agent for the analysis stage?

Chapter 3: MCP Construction

The Tool Extractor sub-agent takes the executed tutorials from Stage 2 and converts each one into a clean, reusable tool. This is the most technically demanding transformation in the pipeline — turning ad-hoc notebook code into production-grade, schema-typed functions.

From Tutorial to Tool

Consider a Jupyter notebook cell that scores a genomic variant:

# Original tutorial code (AlphaGenome notebook)
from alphagenome.models import dna_client
import os
model = dna_client.create(os.getenv('ALPHAGENOME_API_KEY'))
result = model.score_variant(
    chrom='chr3',        # hardcoded!
    pos=58394738,         # hardcoded!
    ref='A', alt='T',    # hardcoded!
    modality='atac',     # hardcoded!
    tissue='CL:0000100'  # hardcoded!
)
print(result['quantile_score'])

The Tool Extractor transforms this into a parameterized function with typed inputs, default values, and a standardized return format:

# Generated MCP tool
@mcp.tool()
def score_variant_effect(
    chrom: str,
    pos: int,
    ref: str,
    alt: str,
    modality: str = "atac",
    tissue: str = "CL:0000100",
) -> dict:
    """Score functional effect of a genetic variant.

    Args:
        chrom: Chromosome (e.g., 'chr3')
        pos: Genomic position
        ref: Reference allele
        alt: Alternate allele
        modality: Assay type ('atac', 'rnaseq', 'chipseq')
        tissue: Tissue/cell ontology ID
    Returns:
        Dict with quantile_score, effect_size, metadata
    """
    model = _get_model()  # cached singleton
    result = model.score_variant(
        chrom=chrom, pos=pos, ref=ref, alt=alt,
        modality=modality, tissue=tissue
    )
    return {
        "quantile_score": result["quantile_score"],
        "effect_size": result["effect_size"],
        "source_file": "alphagenome/scoring.py:L42"
    }
The key transformations: (1) Parameterize every hardcoded value. (2) Add type annotations so the LLM knows what to pass. (3) Enforce file-based inputs — no inline data blobs. (4) Save artifacts (figures, tables) to disk. (5) Embed source traceability — every tool links back to the exact line in the original repo. This last point is critical: it means the user can always verify what the tool is actually running.

Building the Three MCP Components

Tools are the extracted functions. But an MCP server is more than functions. It also contains:

Resources: The Tool Extractor identifies static assets — the manuscript text, supplementary tables, training data links, figure files — and registers them as queryable MCP resources. The AlphaGenome MCP, for instance, includes links to the training data used to train the model, accessible via a standardized resource query.

Prompts: For complex multi-step workflows, the system generates MCP prompts — templates that orchestrate tools in the correct order. A Scanpy MCP prompt might encode: "Run quality_control → normalize_data → select_features → reduce_dims → build_graph → cluster → annotate." These prompts are inferred directly from the paper's tutorials, not manually written.

MCP Server Structure

Click each component to see what it contains. The three components work together to make the paper's methods accessible.

Why does each MCP tool embed a traceable link to the original source code?

Chapter 4: Test-Driven Refinement

This is the stage that separates Paper2Agent from "just ask an LLM to wrap the code." The Test Verifier sub-agent does not trust the generated tools — it validates them against the paper's own results through an iterative test-fix loop.

The Test-Fix Loop

The process works like this:

  1. Generate tests from the tutorial's own examples. If the tutorial shows that score_variant('chr3', 58394738, 'A', 'T', 'atac', 'CL:0000100') returns a quantile score of -0.0203, that becomes an assertion.
  2. Run all tests. Capture stdout, stderr, return values, and any generated figures.
  3. Diagnose failures. The test agent reads the error messages and identifies root causes — missing imports, incorrect parameter mapping, environment issues, numerical precision mismatches.
  4. Fix the code. Apply targeted edits to the tool implementation or the test itself (if the assertion tolerance was too tight).
  5. Repeat until all tests pass or the agent gives up on a tool.
The safety net: If a tool repeatedly fails after multiple fix attempts, its @mcp.tool() decorator is removed. It will not appear in the MCP server. This is a crucial design choice: Paper2Agent would rather ship fewer tools that work than more tools that might hallucinate. A tool that passes all tests against the paper's own results is one you can trust. A tool that doesn't pass gets silently excluded.

What Gets Tested

Tests are not just "does the function run without errors." They check:

For AlphaGenome, this process validated 22 tools across 15 tutorial-based queries and 15 novel queries, achieving 100% accuracy on both sets. Every single tool produces outputs that exactly match the ground truth from the paper's own code.

Test-Fix Loop Simulation

Watch the test agent iterate through generate → run → diagnose → fix cycles. Click Step to advance one action, or Auto to animate.

Locked after validation: Once a tool passes all tests, its implementation is frozen. This means the agent will always run the exact same code path — no runtime code generation, no LLM improvisation. The LLM decides which tool to call and with what arguments, but the tool's internal logic is deterministic and locked. This design minimizes randomness in code generation and strengthens reproducibility.
What happens when a tool repeatedly fails the test-fix loop?

Chapter 5: Chat Agent Integration

The MCP server is built and tested. Now it needs a face — a conversational agent that users actually interact with. This is the final piece: connecting the MCP server to a chat agent like Claude Code.

How the Connection Works

MCP servers can be hosted remotely — Paper2Agent deploys them to Hugging Face Spaces, eliminating local dependency issues entirely. The user's chat agent connects to the MCP server over the network. From the agent's perspective, the MCP tools appear as native functions it can call, just like file reads or shell commands.

When a user types "Score variant chr19:8134523:G>A using ATAC-seq predictions for lung tissue," the agent:

  1. Parses the intent: identifies the variant, modality, and tissue from natural language.
  2. Selects the tool: matches the query to score_variant_effect() from the MCP schema.
  3. Constructs the call: fills in typed parameters — chrom="chr19", pos=8134523, etc.
  4. Executes: the MCP server runs the locked, tested code on its own infrastructure.
  5. Returns results: the agent receives the structured output and presents it in natural language.
The agent is NOT generating code: This is the critical distinction from "give an LLM access to a repo." The agent does not write Python. It calls pre-built, pre-tested functions through a typed interface. The LLM's job is limited to intent parsing and parameter extraction — tasks where it excels. The scientific computation is handled by the locked, validated tool. This separation of concerns is what makes Paper2Agent reliable.

Multi-Step Workflows

For complex queries, the agent chains multiple tool calls. When asked to "interpret why a variant associates with LDL cholesterol," the AlphaGenome agent constructs a multi-step plan:

  1. Score the variant across multiple modalities (expression, chromatin, splicing).
  2. Filter results for the relevant tissue (liver, for LDL).
  3. Generate modality-specific visualizations.
  4. Compile a report with figures and interpretation.

The agent iteratively plans, acts, observes results, and refines its approach — the classic ReAct pattern. But unlike a general agent, every action is a validated MCP tool call, not ad-hoc code generation.

Multi-Paper Agents

Because MCP servers are modular, you can connect multiple MCPs to the same agent. A researcher could have the AlphaGenome MCP, the TISSUE MCP, and the Scanpy MCP all active simultaneously. The agent seamlessly routes queries to the right server based on the task. This enables cross-paper reasoning that would be extremely difficult to set up manually.

Performance baseline: On the AlphaGenome benchmark, the Paper2Agent-generated agent achieved 100% accuracy on both tutorial-based and novel queries. Claude Code with access to the raw repo scored 60-80%. Biomni scored 40-60%. The Paper2Agent agent was also 1.8-4.6x faster in median runtime. Pre-built tools eliminate the overhead of reading source code, understanding APIs, and generating bespoke scripts at query time.
Why is the Paper2Agent approach more reliable than giving an LLM direct access to a repository?

Chapter 6: Case Study — AlphaGenome

AlphaGenome is an AI model from Google DeepMind that predicts how single-nucleotide mutations in human DNA affect gene regulation — expression, chromatin accessibility, splicing, transcription factor binding. It is powerful but complex: the codebase involves custom data loaders, GPU-accelerated inference, tissue ontology lookups, and multi-modal visualization pipelines.

What Paper2Agent Built

Paper2Agent generated 22 MCP tools in roughly 3 hours on a personal laptop, covering:

Each tool exposes flexible, well-annotated parameters. The visualize_variant_effects() tool, for example, lets users toggle organism (human or mouse), sequence context length, and modality (RNA-seq, ATAC-seq, ChIP-seq histone tracks) — all options discoverable through the tool's typed schema.

Benchmark Results

Agent SystemTutorial Queries (15)Novel Queries (15)Median Speedup
AlphaGenome Agent (Paper2Agent)100% (15/15)100% (15/15)1.0x (baseline)
Claude + Raw Repo60% (9/15)80% (12/15)1.8–3.2x slower
Biomni40% (6/15)60% (9/15)3.1–4.6x slower
Why the gap is so large: When Claude + Repo encounters a query, it must read the source code, understand the API, write a Python script, execute it, and handle errors — all from scratch each time. The Paper2Agent agent just calls score_variant_effect() with the right arguments. The pre-built tools eliminate an entire class of failure modes: wrong imports, incorrect parameter names, missing environment variables, GPU configuration issues.

The SORT1 Reinterpretation

When asked to interpret a GWAS locus associated with LDL cholesterol, the agent prioritized SORT1 as the most likely causal gene — whereas the original paper emphasized CELSR2 and PSRC1. The agent's reasoning: SORT1 had a quantile score of 0.99982 for expression impact in liver, and SORT1 encodes sortilin, a protein directly involved in LDL/VLDL secretion. Independent validation in GTEx eQTL data confirmed the association (p = 1.1e-65).

This was not a bug — it was a genuine scientific reinterpretation, enabled by the agent's ability to run comprehensive multi-modal analysis with a single prompt. The original authors may have emphasized different genes for valid reasons (both CELSR2 and PSRC1 also had high scores), but the agent provided an independent, model-based perspective that users can evaluate.

The broader point: With Paper2Agent, published conclusions become re-evaluable. A single prompt can trigger an analysis that took the original authors days. This shifts the balance of scientific effort from execution to interpretation.
Why did the AlphaGenome agent achieve 100% accuracy while Claude + Repo scored only 60-80%?

Chapter 7: Case Study — Single-Cell Agents

The AlphaGenome case shows Paper2Agent on a complex deep-learning model. The single-cell case studies — TISSUE and Scanpy — show it on a different challenge: multi-step analysis pipelines where the correct sequence of operations matters as much as the individual tools.

TISSUE: Uncertainty-Aware Spatial Transcriptomics

TISSUE is a method for predicting spatial gene expression with calibrated uncertainty estimates. Paper2Agent generated 6 tools covering spatial prediction, prediction interval construction, and uncertainty-aware downstream analysis (hypothesis testing, dimensionality reduction).

The TISSUE agent serves two roles:

MCP Resources as data catalogs: Paper2Agent translated the TISSUE paper's data availability section into a structured registry of spatial transcriptomics datasets, with standardized metadata (species, tissue type, modality, data URL). Users can query by species, download data through the Zenodo REST API, and pipe it directly into the analysis tools — all from natural language.

Scanpy: Preprocessing and Clustering

Scanpy is a widely used package with many features. Paper2Agent focused on the most common use case: the preprocessing-to-clustering pipeline. It generated 7 tools in 45 minutes:

ToolFunction
quality_control()Calculate QC metrics, filter cells/genes, detect doublets
normalize_data()Normalize count data
select_features()Identify highly variable genes
reduce_dims()PCA and UMAP
build_graph()Neighborhood graph construction
cluster()Leiden clustering at multiple resolutions
annotate()Cell type annotation via differential expression

The Role of MCP Prompts

For end-to-end workflows, the correct tool ordering is critical. You cannot cluster before normalizing, or reduce dimensions before selecting features. A general-purpose LLM might get the order wrong.

Paper2Agent solves this with MCP Prompts — workflow templates extracted from the paper's tutorials. The Scanpy MCP prompt encodes: QC → normalize → feature selection → dimensionality reduction → graph construction → clustering → annotation. The prompt also instructs the agent to inspect the data first and adjust parameters if defaults would yield incorrect results.

Users only need the data path. The prompt "Perform standard single-cell preprocessing and clustering pipeline on this single-cell data: data.h5ad" triggers the full workflow. The agent chains all 7 tools in the correct order, producing highly variable gene plots, UMAP embeddings, cluster assignments, and cell type annotations — all matching human researcher results.
What problem do MCP Prompts solve that individual MCP Tools cannot?

Chapter 8: The ADHD Discovery

This is the payoff chapter — the moment Paper2Agent stops being "a convenient way to run code" and becomes a tool for scientific discovery.

The Setup

Two papers exist independently:

A human researcher wanting to combine insights from both papers would need to: understand both methods, install both codebases, convert between data formats, design an analysis strategy, execute it, and interpret results. This typically takes weeks.

The AI Co-Scientist

Paper2Agent created MCPs for both papers and connected them to the same Claude Code agent — creating an "AI co-scientist" with access to both a method and a dataset. This agent was then prompted to generate novel hypotheses and execute analyses.

The co-scientist proposed several hypotheses, including:

  1. ADHD risk variants alter regulatory activity in brain-specific cell types.
  2. AlphaGenome can prioritize causal variants within ADHD fine-mapping credible sets.
  3. ADHD-associated variants disrupt transcription factor binding at FOXP family gene loci.
The finding: Among 209 candidate variants in one GWAS locus, the AI co-scientist identified rs1626703 as the most likely causal variant. This intronic variant is predicted to alter MPHOSPH9 splicing and expression specifically in glutamatergic neurons, with AlphaGenome quantile scores of 1.000 for splice junction effects and 0.963 for RNA-seq expression. MPHOSPH9 encodes an M-phase phosphoprotein involved in cell division and ciliogenesis — a plausible mechanism for ADHD risk through disrupted neuronal development.

Scaling to All 39 Loci

The co-scientist did not stop at one locus. It autonomously designed and executed a workflow across all 39 ADHD-associated loci:

  1. Extract credible-set variants from each locus.
  2. Run AlphaGenome functional scoring in glutamatergic neurons.
  3. Filter for protein-coding genes.
  4. Rank by maximum quantile impact scores across modalities.
  5. Compile a comprehensive report for each locus.

The entire analysis — across 39 loci — completed in approximately two hours. Manual execution by human experts would have taken weeks. The results are provided as a supplementary table in the paper, with each locus mapped to its prioritized causal variant, target gene, and molecular mechanism.

Cross-Paper Discovery Pipeline

Click through the stages of the AI co-scientist's discovery workflow. Two independent papers are combined into a single analytical pipeline.

A new paradigm: This is not AI replacing scientists — it is AI accelerating the most labor-intensive part of scientific collaboration: integrating methods across papers. The human scientist formulated the high-level question ("combine these two papers"). The AI co-scientist designed the analysis, executed it, and surfaced results for human evaluation. The shift is from manual execution to synthesis of actionable insights.
What enabled the ADHD discovery that would not have been possible with either paper alone?

Chapter 9: Connections

What Paper2Agent Builds On

Prior WorkConnection
Model Context Protocol (MCP)The foundational protocol. Paper2Agent automates what MCP makes possible — turning arbitrary tools into standard, composable interfaces.
Claude Code architecturePaper2Agent is implemented in Claude Code. The orchestrator uses Claude Code's sub-agent delegation, tool dispatch, and iterative debugging capabilities. See our Dive into Claude Code lesson.
ReAct (Yao et al., 2022)The paper agents follow the ReAct pattern: reason about the task, call a tool, observe the result, repeat. Paper2Agent constrains this to pre-validated tools, reducing hallucination risk.
AI Scientist (Sakana, 2024)Both envision AI as scientific collaborators. AI Scientist generates papers; Paper2Agent makes existing papers interactive. Complementary visions.
Paper2Code (Seo et al., 2025)Generates code from papers for ML reproducibility. Paper2Agent goes further: it generates tested, deployed, interactive agents, not just code.
Biomni (Huang et al., 2025)A general-purpose biomedical AI agent. Paper2Agent outperforms it on specialized tasks because pre-built tools eliminate runtime code generation.

Limitations and Open Questions

The Vision: Agent Availability Sections

Just as journals now require data availability and code availability sections, the authors envision an agent availability section — specifying whether and how a paper's contribution has been embodied as an interactive agent. Well-documented, modular, transparent papers will naturally lend themselves to agentification. Papers that cannot be agentified reveal, by that failure, their reproducibility gaps.

Communities of agents: Once scientific knowledge is encoded in active agents rather than static artifacts, agents could interact with each other — linking methods to datasets, combining insights across domains. Paper2Agent's ADHD case study is a proof of concept for this vision: two paper agents collaborating to produce a discovery neither could make alone.
MetricValue
PaperPaper2Agent (Miao et al., 2025)
Core ideaConvert papers to MCP servers → interactive AI agents
Pipeline stagesAnalysis → Construction → Testing → Deployment
Sub-agents4 (Environment, Scanner, Extractor, Tester)
AlphaGenome tools22 tools, 100% accuracy, ~3 hours
Scanpy tools7 tools, human-matching results, ~45 min
Key discoveryrs1626703 → MPHOSPH9 splicing → ADHD risk
Speedup vs. Claude+Repo1.8–3.2x faster
Speedup vs. Biomni3.1–4.6x faster
What makes Paper2Agent's approach fundamentally different from Paper2Code?