An automated framework that converts research papers into interactive AI agents via MCP servers — turning static publications into tool-using, test-validated, natural-language-queryable systems.
A new paper drops on arXiv. It describes a powerful genomics method, links to a GitHub repo with 47 Python files, three Jupyter notebooks, and a requirements.txt that pins 23 dependencies. You want to use this method on your data.
So you clone the repo. You create a conda environment. You spend two hours resolving dependency conflicts. You trace through the notebooks, trying to figure out which functions to call, in what order, with what parameters. You hit a cryptic error — the tutorial assumed a specific GPU, or a dataset format slightly different from yours. You open an issue on GitHub and wait.
This experience is so common it has become the default mode of scientific software adoption. A 2022 study found that only 25% of published research code could be executed without errors. The remaining 75% had dependency issues, missing data, undocumented assumptions, or outright bugs.
What if the paper could answer your questions? Not a chatbot summarizing the abstract — an actual agent that understands the method, has access to the working code, can execute the pipeline on your data, and returns validated results?
That is what Paper2Agent builds. It takes a paper and its codebase and automatically constructs an interactive AI agent that embodies the research contribution — an agent you can talk to in natural language, that calls the right functions in the right order, and that has been tested against the paper's own results before you ever interact with it.
Click to toggle between the traditional workflow and the Paper2Agent workflow. Count the steps.
The core idea is deceptively simple: convert a research paper into a Model Context Protocol (MCP) server, then connect that server to a chat agent. The MCP server exposes the paper's methods as callable tools, its data as queryable resources, and its workflows as executable prompts.
The Model Context Protocol is an open standard (created by Anthropic in late 2024) that defines how an LLM-based agent connects to external tools. Think of it as a USB port for AI — any tool that speaks MCP can be plugged into any agent that speaks MCP. No custom integration code needed.
An MCP server exposes three types of capabilities:
| Component | What it is | Example |
|---|---|---|
| Tools | Callable functions with typed inputs and outputs | score_variant(chr, pos, ref, alt, modality, tissue) |
| Resources | Static assets: data, docs, figures | Training data links, manuscript text, supplementary tables |
| Prompts | Workflow templates that orchestrate tools in sequence | "Preprocess → normalize → cluster → annotate" pipeline |
The framework has four stages, each handled by a specialized sub-agent:
The entire pipeline is orchestrated by Claude Code acting as a meta-agent — it coordinates the four sub-agents, passes context between stages, and handles failures. On a personal laptop, Paper2Agent converted the AlphaGenome paper into 22 working MCP tools in about 3 hours, and the Scanpy preprocessing pipeline into 7 tools in about 45 minutes — all without human intervention.
Before you can turn a paper into an agent, you need to understand what the paper does. This is the job of Stage 1: a multi-agent system that reads the paper, clones the repo, and produces a structured understanding of the codebase.
Paper2Agent is itself implemented as a multi-agent system inside Claude Code. An orchestrator agent coordinates four specialized sub-agents, each with a focused mandate:
| Sub-Agent | Responsibility | Output |
|---|---|---|
| Environment Manager | Create clean, reproducible environment. Install dependencies. Resolve conflicts. | Working conda/pip environment |
| Tutorial Scanner | Scan repo for notebooks, tutorials, examples. Distinguish useful code from boilerplate. | Ranked list of tutorial candidates |
| Tool Extractor | Convert tutorials into single-purpose functions. Parameterize hardcoded values. Add type annotations. | Library of reusable tool functions |
| Test Verifier | Generate tests from tutorial examples. Run, diagnose, fix. Loop until stable. | Validated, tested tool implementations |
The Environment Manager does not just run pip install -r requirements.txt. It provisions an isolated workspace, analyzes which Python version the code expects, resolves version conflicts between pinned and transitive dependencies, and verifies that all imports succeed. This is the step that eliminates the two-hour dependency debugging session from Chapter 0.
The Tutorial Scanner solves a subtler problem: most repos contain many files, but only a few are genuine tutorials that demonstrate the method end-to-end. The scanner distinguishes tutorial notebooks from test files, utility scripts, and experimental code. It produces clear summaries of which resources demonstrate which capabilities.
The Tool Extractor sub-agent takes the executed tutorials from Stage 2 and converts each one into a clean, reusable tool. This is the most technically demanding transformation in the pipeline — turning ad-hoc notebook code into production-grade, schema-typed functions.
Consider a Jupyter notebook cell that scores a genomic variant:
# Original tutorial code (AlphaGenome notebook) from alphagenome.models import dna_client import os model = dna_client.create(os.getenv('ALPHAGENOME_API_KEY')) result = model.score_variant( chrom='chr3', # hardcoded! pos=58394738, # hardcoded! ref='A', alt='T', # hardcoded! modality='atac', # hardcoded! tissue='CL:0000100' # hardcoded! ) print(result['quantile_score'])
The Tool Extractor transforms this into a parameterized function with typed inputs, default values, and a standardized return format:
# Generated MCP tool @mcp.tool() def score_variant_effect( chrom: str, pos: int, ref: str, alt: str, modality: str = "atac", tissue: str = "CL:0000100", ) -> dict: """Score functional effect of a genetic variant. Args: chrom: Chromosome (e.g., 'chr3') pos: Genomic position ref: Reference allele alt: Alternate allele modality: Assay type ('atac', 'rnaseq', 'chipseq') tissue: Tissue/cell ontology ID Returns: Dict with quantile_score, effect_size, metadata """ model = _get_model() # cached singleton result = model.score_variant( chrom=chrom, pos=pos, ref=ref, alt=alt, modality=modality, tissue=tissue ) return { "quantile_score": result["quantile_score"], "effect_size": result["effect_size"], "source_file": "alphagenome/scoring.py:L42" }
Tools are the extracted functions. But an MCP server is more than functions. It also contains:
Resources: The Tool Extractor identifies static assets — the manuscript text, supplementary tables, training data links, figure files — and registers them as queryable MCP resources. The AlphaGenome MCP, for instance, includes links to the training data used to train the model, accessible via a standardized resource query.
Prompts: For complex multi-step workflows, the system generates MCP prompts — templates that orchestrate tools in the correct order. A Scanpy MCP prompt might encode: "Run quality_control → normalize_data → select_features → reduce_dims → build_graph → cluster → annotate." These prompts are inferred directly from the paper's tutorials, not manually written.
Click each component to see what it contains. The three components work together to make the paper's methods accessible.
This is the stage that separates Paper2Agent from "just ask an LLM to wrap the code." The Test Verifier sub-agent does not trust the generated tools — it validates them against the paper's own results through an iterative test-fix loop.
The process works like this:
score_variant('chr3', 58394738, 'A', 'T', 'atac', 'CL:0000100') returns a quantile score of -0.0203, that becomes an assertion.@mcp.tool() decorator is removed. It will not appear in the MCP server. This is a crucial design choice: Paper2Agent would rather ship fewer tools that work than more tools that might hallucinate. A tool that passes all tests against the paper's own results is one you can trust. A tool that doesn't pass gets silently excluded.Tests are not just "does the function run without errors." They check:
For AlphaGenome, this process validated 22 tools across 15 tutorial-based queries and 15 novel queries, achieving 100% accuracy on both sets. Every single tool produces outputs that exactly match the ground truth from the paper's own code.
Watch the test agent iterate through generate → run → diagnose → fix cycles. Click Step to advance one action, or Auto to animate.
The MCP server is built and tested. Now it needs a face — a conversational agent that users actually interact with. This is the final piece: connecting the MCP server to a chat agent like Claude Code.
MCP servers can be hosted remotely — Paper2Agent deploys them to Hugging Face Spaces, eliminating local dependency issues entirely. The user's chat agent connects to the MCP server over the network. From the agent's perspective, the MCP tools appear as native functions it can call, just like file reads or shell commands.
When a user types "Score variant chr19:8134523:G>A using ATAC-seq predictions for lung tissue," the agent:
score_variant_effect() from the MCP schema.chrom="chr19", pos=8134523, etc.For complex queries, the agent chains multiple tool calls. When asked to "interpret why a variant associates with LDL cholesterol," the AlphaGenome agent constructs a multi-step plan:
The agent iteratively plans, acts, observes results, and refines its approach — the classic ReAct pattern. But unlike a general agent, every action is a validated MCP tool call, not ad-hoc code generation.
Because MCP servers are modular, you can connect multiple MCPs to the same agent. A researcher could have the AlphaGenome MCP, the TISSUE MCP, and the Scanpy MCP all active simultaneously. The agent seamlessly routes queries to the right server based on the task. This enables cross-paper reasoning that would be extremely difficult to set up manually.
AlphaGenome is an AI model from Google DeepMind that predicts how single-nucleotide mutations in human DNA affect gene regulation — expression, chromatin accessibility, splicing, transcription factor binding. It is powerful but complex: the codebase involves custom data loaders, GPU-accelerated inference, tissue ontology lookups, and multi-modal visualization pipelines.
Paper2Agent generated 22 MCP tools in roughly 3 hours on a personal laptop, covering:
Each tool exposes flexible, well-annotated parameters. The visualize_variant_effects() tool, for example, lets users toggle organism (human or mouse), sequence context length, and modality (RNA-seq, ATAC-seq, ChIP-seq histone tracks) — all options discoverable through the tool's typed schema.
| Agent System | Tutorial Queries (15) | Novel Queries (15) | Median Speedup |
|---|---|---|---|
| AlphaGenome Agent (Paper2Agent) | 100% (15/15) | 100% (15/15) | 1.0x (baseline) |
| Claude + Raw Repo | 60% (9/15) | 80% (12/15) | 1.8–3.2x slower |
| Biomni | 40% (6/15) | 60% (9/15) | 3.1–4.6x slower |
score_variant_effect() with the right arguments. The pre-built tools eliminate an entire class of failure modes: wrong imports, incorrect parameter names, missing environment variables, GPU configuration issues.When asked to interpret a GWAS locus associated with LDL cholesterol, the agent prioritized SORT1 as the most likely causal gene — whereas the original paper emphasized CELSR2 and PSRC1. The agent's reasoning: SORT1 had a quantile score of 0.99982 for expression impact in liver, and SORT1 encodes sortilin, a protein directly involved in LDL/VLDL secretion. Independent validation in GTEx eQTL data confirmed the association (p = 1.1e-65).
This was not a bug — it was a genuine scientific reinterpretation, enabled by the agent's ability to run comprehensive multi-modal analysis with a single prompt. The original authors may have emphasized different genes for valid reasons (both CELSR2 and PSRC1 also had high scores), but the agent provided an independent, model-based perspective that users can evaluate.
The AlphaGenome case shows Paper2Agent on a complex deep-learning model. The single-cell case studies — TISSUE and Scanpy — show it on a different challenge: multi-step analysis pipelines where the correct sequence of operations matters as much as the individual tools.
TISSUE is a method for predicting spatial gene expression with calibrated uncertainty estimates. Paper2Agent generated 6 tools covering spatial prediction, prediction interval construction, and uncertainty-aware downstream analysis (hypothesis testing, dimensionality reduction).
The TISSUE agent serves two roles:
Scanpy is a widely used package with many features. Paper2Agent focused on the most common use case: the preprocessing-to-clustering pipeline. It generated 7 tools in 45 minutes:
| Tool | Function |
|---|---|
quality_control() | Calculate QC metrics, filter cells/genes, detect doublets |
normalize_data() | Normalize count data |
select_features() | Identify highly variable genes |
reduce_dims() | PCA and UMAP |
build_graph() | Neighborhood graph construction |
cluster() | Leiden clustering at multiple resolutions |
annotate() | Cell type annotation via differential expression |
For end-to-end workflows, the correct tool ordering is critical. You cannot cluster before normalizing, or reduce dimensions before selecting features. A general-purpose LLM might get the order wrong.
Paper2Agent solves this with MCP Prompts — workflow templates extracted from the paper's tutorials. The Scanpy MCP prompt encodes: QC → normalize → feature selection → dimensionality reduction → graph construction → clustering → annotation. The prompt also instructs the agent to inspect the data first and adjust parameters if defaults would yield incorrect results.
This is the payoff chapter — the moment Paper2Agent stops being "a convenient way to run code" and becomes a tool for scientific discovery.
Two papers exist independently:
A human researcher wanting to combine insights from both papers would need to: understand both methods, install both codebases, convert between data formats, design an analysis strategy, execute it, and interpret results. This typically takes weeks.
Paper2Agent created MCPs for both papers and connected them to the same Claude Code agent — creating an "AI co-scientist" with access to both a method and a dataset. This agent was then prompted to generate novel hypotheses and execute analyses.
The co-scientist proposed several hypotheses, including:
The co-scientist did not stop at one locus. It autonomously designed and executed a workflow across all 39 ADHD-associated loci:
The entire analysis — across 39 loci — completed in approximately two hours. Manual execution by human experts would have taken weeks. The results are provided as a supplementary table in the paper, with each locus mapped to its prioritized causal variant, target gene, and molecular mechanism.
Click through the stages of the AI co-scientist's discovery workflow. Two independent papers are combined into a single analytical pipeline.
| Prior Work | Connection |
|---|---|
| Model Context Protocol (MCP) | The foundational protocol. Paper2Agent automates what MCP makes possible — turning arbitrary tools into standard, composable interfaces. |
| Claude Code architecture | Paper2Agent is implemented in Claude Code. The orchestrator uses Claude Code's sub-agent delegation, tool dispatch, and iterative debugging capabilities. See our Dive into Claude Code lesson. |
| ReAct (Yao et al., 2022) | The paper agents follow the ReAct pattern: reason about the task, call a tool, observe the result, repeat. Paper2Agent constrains this to pre-validated tools, reducing hallucination risk. |
| AI Scientist (Sakana, 2024) | Both envision AI as scientific collaborators. AI Scientist generates papers; Paper2Agent makes existing papers interactive. Complementary visions. |
| Paper2Code (Seo et al., 2025) | Generates code from papers for ML reproducibility. Paper2Agent goes further: it generates tested, deployed, interactive agents, not just code. |
| Biomni (Huang et al., 2025) | A general-purpose biomedical AI agent. Paper2Agent outperforms it on specialized tasks because pre-built tools eliminate runtime code generation. |
Just as journals now require data availability and code availability sections, the authors envision an agent availability section — specifying whether and how a paper's contribution has been embodied as an interactive agent. Well-documented, modular, transparent papers will naturally lend themselves to agentification. Papers that cannot be agentified reveal, by that failure, their reproducibility gaps.
| Metric | Value |
|---|---|
| Paper | Paper2Agent (Miao et al., 2025) |
| Core idea | Convert papers to MCP servers → interactive AI agents |
| Pipeline stages | Analysis → Construction → Testing → Deployment |
| Sub-agents | 4 (Environment, Scanner, Extractor, Tester) |
| AlphaGenome tools | 22 tools, 100% accuracy, ~3 hours |
| Scanpy tools | 7 tools, human-matching results, ~45 min |
| Key discovery | rs1626703 → MPHOSPH9 splicing → ADHD risk |
| Speedup vs. Claude+Repo | 1.8–3.2x faster |
| Speedup vs. Biomni | 3.1–4.6x faster |