Paper2Agent — Veanors

Chapter 0: The Problem

A new paper drops on arXiv. It describes a powerful genomics method, links to a GitHub repo with 47 Python files, three Jupyter notebooks, and a requirements.txt that pins 23 dependencies. You want to use this method on your data.

So you clone the repo. You create a conda environment. You spend two hours resolving dependency conflicts. You trace through the notebooks, trying to figure out which functions to call, in what order, with what parameters. You hit a cryptic error — the tutorial assumed a specific GPU, or a dataset format slightly different from yours. You open an issue on GitHub and wait.

This experience is so common it has become the default mode of scientific software adoption. A 2022 study found that only 25% of published research code could be executed without errors. The remaining 75% had dependency issues, missing data, undocumented assumptions, or outright bugs.

The fundamental gap: A research paper is a passive artifact. It describes what the authors did, but it cannot do anything. The reader must manually parse the paper, understand the method, navigate the code, configure the environment, and adapt the pipeline to their data. Every step is a potential failure point. The knowledge is trapped in static text.

What if the paper could answer your questions? Not a chatbot summarizing the abstract — an actual agent that understands the method, has access to the working code, can execute the pipeline on your data, and returns validated results?

That is what Paper2Agent builds. It takes a paper and its codebase and automatically constructs an interactive AI agent that embodies the research contribution — an agent you can talk to in natural language, that calls the right functions in the right order, and that has been tested against the paper's own results before you ever interact with it.

The shift: From "read paper, clone repo, debug for hours" to "ask a question, get a validated answer." The paper becomes a knowledgeable entity capable of execution and dialogue, not just a document encoding knowledge.

Paper Adoption: Before vs. After

Click to toggle between the traditional workflow and the Paper2Agent workflow. Count the steps.

Why is a research paper fundamentally limited as a vehicle for method adoption?

Because it is a passive artifact — it describes what the authors did but cannot execute anything, requiring readers to manually parse, configure, debug, and adapt the code themselves Because most papers are poorly written Because PDF format does not support hyperlinks

Chapter 1: The Key Insight

The core idea is deceptively simple: convert a research paper into a Model Context Protocol (MCP) server, then connect that server to a chat agent. The MCP server exposes the paper's methods as callable tools, its data as queryable resources, and its workflows as executable prompts.

What is MCP?

The Model Context Protocol is an open standard (created by Anthropic in late 2024) that defines how an LLM-based agent connects to external tools. Think of it as a USB port for AI — any tool that speaks MCP can be plugged into any agent that speaks MCP. No custom integration code needed.

An MCP server exposes three types of capabilities:

Component	What it is	Example
Tools	Callable functions with typed inputs and outputs	`score_variant(chr, pos, ref, alt, modality, tissue)`
Resources	Static assets: data, docs, figures	Training data links, manuscript text, supplementary tables
Prompts	Workflow templates that orchestrate tools in sequence	"Preprocess → normalize → cluster → annotate" pipeline

Why MCP matters here: Before MCP, connecting a method to an AI agent meant writing custom glue code for each method-agent pair. If you had 100 papers and 5 agents, you needed 500 integrations. MCP makes this M + N instead of M × N: build one MCP server per paper, and any MCP-compatible agent can use it. Paper2Agent automates the "build one MCP server" step.

The Paper2Agent Pipeline

The framework has four stages, each handled by a specialized sub-agent:

Stage 1: Analysis

Multi-agent analysis of the paper + codebase. Environment agent configures dependencies. Tutorial scanner identifies reusable code.

↓

Stage 2: Construction

Extract tutorials into single-purpose tools. Build typed schemas, wrap functions, parameterize hardcoded values.

↓

Stage 3: Testing

Auto-generate tests from tutorial examples. Run tests, diagnose failures, fix code. Repeat until all pass or failing tools are removed.

↓

Stage 4: Deployment

Package as MCP server. Deploy to Hugging Face Spaces. Connect to chat agent (e.g., Claude Code).

The entire pipeline is orchestrated by Claude Code acting as a meta-agent — it coordinates the four sub-agents, passes context between stages, and handles failures. On a personal laptop, Paper2Agent converted the AlphaGenome paper into 22 working MCP tools in about 3 hours, and the Scanpy preprocessing pipeline into 7 tools in about 45 minutes — all without human intervention.

The composability payoff: Because MCP servers are modular, you can connect multiple paper MCPs to the same agent. This enables cross-paper reasoning: "Use the genomics method from Paper A to analyze the GWAS data from Paper B." We will see exactly this in Chapter 8.

What makes MCP the right protocol for Paper2Agent?

It is a standard interface that decouples tools from agents — one MCP server per paper works with any compatible agent, making the cost M+N instead of M×N It is faster than REST APIs It was created by the same lab as Paper2Agent

Chapter 2: Paper Analysis Stage

Before you can turn a paper into an agent, you need to understand what the paper does. This is the job of Stage 1: a multi-agent system that reads the paper, clones the repo, and produces a structured understanding of the codebase.

The Orchestrator and Four Sub-Agents

Paper2Agent is itself implemented as a multi-agent system inside Claude Code. An orchestrator agent coordinates four specialized sub-agents, each with a focused mandate:

Sub-Agent	Responsibility	Output
Environment Manager	Create clean, reproducible environment. Install dependencies. Resolve conflicts.	Working conda/pip environment
Tutorial Scanner	Scan repo for notebooks, tutorials, examples. Distinguish useful code from boilerplate.	Ranked list of tutorial candidates
Tool Extractor	Convert tutorials into single-purpose functions. Parameterize hardcoded values. Add type annotations.	Library of reusable tool functions
Test Verifier	Generate tests from tutorial examples. Run, diagnose, fix. Loop until stable.	Validated, tested tool implementations

What "Analysis" Actually Does

The Environment Manager does not just run pip install -r requirements.txt. It provisions an isolated workspace, analyzes which Python version the code expects, resolves version conflicts between pinned and transitive dependencies, and verifies that all imports succeed. This is the step that eliminates the two-hour dependency debugging session from Chapter 0.

The Tutorial Scanner solves a subtler problem: most repos contain many files, but only a few are genuine tutorials that demonstrate the method end-to-end. The scanner distinguishes tutorial notebooks from test files, utility scripts, and experimental code. It produces clear summaries of which resources demonstrate which capabilities.

Why multi-agent? Each sub-agent operates in an isolated context with a focused system prompt. The Environment Manager never sees tutorial code. The Tutorial Scanner never touches dependency resolution. This isolation prevents cross-contamination of concerns and allows each agent to use its full context window for its specific task. The orchestrator passes only summaries between agents, not raw outputs.

The Six Steps in Detail

Locate and download the official repository linked to the paper.
Environment setup: provision workspace, pin dependencies, verify imports.
Tutorial discovery: scan for notebooks, README examples, tutorial directories.
Tutorial execution and audit: run selected tutorials end-to-end with their example data, capture inputs/outputs/figures, record implicit assumptions.
Tool extraction: convert tutorial logic into parameterized, typed functions.
MCP assembly: integrate tools, resources, and prompts into a deployable server.

What makes this different from "just ask an LLM to read the paper": General LLMs hallucinate code. Paper2Agent never generates novel algorithms — it extracts and wraps existing code from the paper's own repository. Every tool traces back to a specific source file. This is what prevents "code hallucination" and ensures the agent's outputs are grounded in the authors' validated implementation.

Why does Paper2Agent use isolated sub-agents rather than a single agent for the analysis stage?

Each sub-agent operates with a focused system prompt and full context window for its specific task, preventing cross-contamination of concerns like dependency resolution and tutorial parsing Because a single agent would be too slow To reduce API costs by splitting calls across models

Chapter 3: MCP Construction

The Tool Extractor sub-agent takes the executed tutorials from Stage 2 and converts each one into a clean, reusable tool. This is the most technically demanding transformation in the pipeline — turning ad-hoc notebook code into production-grade, schema-typed functions.

From Tutorial to Tool

Consider a Jupyter notebook cell that scores a genomic variant:

# Original tutorial code (AlphaGenome notebook)
from alphagenome.models import dna_client
import os
model = dna_client.create(os.getenv('ALPHAGENOME_API_KEY'))
result = model.score_variant(
    chrom='chr3',        # hardcoded!
    pos=58394738,         # hardcoded!
    ref='A', alt='T',    # hardcoded!
    modality='atac',     # hardcoded!
    tissue='CL:0000100'  # hardcoded!
)
print(result['quantile_score'])

The Tool Extractor transforms this into a parameterized function with typed inputs, default values, and a standardized return format:

# Generated MCP tool
@mcp.tool()
def score_variant_effect(
    chrom: str,
    pos: int,
    ref: str,
    alt: str,
    modality: str = "atac",
    tissue: str = "CL:0000100",
) -> dict:
    """Score functional effect of a genetic variant.

    Args:
        chrom: Chromosome (e.g., 'chr3')
        pos: Genomic position
        ref: Reference allele
        alt: Alternate allele
        modality: Assay type ('atac', 'rnaseq', 'chipseq')
        tissue: Tissue/cell ontology ID
    Returns:
        Dict with quantile_score, effect_size, metadata
    """
    model = _get_model()  # cached singleton
    result = model.score_variant(
        chrom=chrom, pos=pos, ref=ref, alt=alt,
        modality=modality, tissue=tissue
    )
    return {
        "quantile_score": result["quantile_score"],
        "effect_size": result["effect_size"],
        "source_file": "alphagenome/scoring.py:L42"
    }

The key transformations: (1) Parameterize every hardcoded value. (2) Add type annotations so the LLM knows what to pass. (3) Enforce file-based inputs — no inline data blobs. (4) Save artifacts (figures, tables) to disk. (5) Embed source traceability — every tool links back to the exact line in the original repo. This last point is critical: it means the user can always verify what the tool is actually running.

Building the Three MCP Components

Tools are the extracted functions. But an MCP server is more than functions. It also contains:

Resources: The Tool Extractor identifies static assets — the manuscript text, supplementary tables, training data links, figure files — and registers them as queryable MCP resources. The AlphaGenome MCP, for instance, includes links to the training data used to train the model, accessible via a standardized resource query.

Prompts: For complex multi-step workflows, the system generates MCP prompts — templates that orchestrate tools in the correct order. A Scanpy MCP prompt might encode: "Run quality_control → normalize_data → select_features → reduce_dims → build_graph → cluster → annotate." These prompts are inferred directly from the paper's tutorials, not manually written.

MCP Server Structure

Click each component to see what it contains. The three components work together to make the paper's methods accessible.

Why does each MCP tool embed a traceable link to the original source code?

To prevent code hallucination — users can verify that the tool executes the authors' validated implementation, not LLM-generated code For debugging convenience Because MCP requires source links

Chapter 4: Test-Driven Refinement

This is the stage that separates Paper2Agent from "just ask an LLM to wrap the code." The Test Verifier sub-agent does not trust the generated tools — it validates them against the paper's own results through an iterative test-fix loop.

The Test-Fix Loop

The process works like this:

Generate tests from the tutorial's own examples. If the tutorial shows that score_variant('chr3', 58394738, 'A', 'T', 'atac', 'CL:0000100') returns a quantile score of -0.0203, that becomes an assertion.
Run all tests. Capture stdout, stderr, return values, and any generated figures.
Diagnose failures. The test agent reads the error messages and identifies root causes — missing imports, incorrect parameter mapping, environment issues, numerical precision mismatches.
Fix the code. Apply targeted edits to the tool implementation or the test itself (if the assertion tolerance was too tight).
Repeat until all tests pass or the agent gives up on a tool.

The safety net: If a tool repeatedly fails after multiple fix attempts, its @mcp.tool() decorator is removed. It will not appear in the MCP server. This is a crucial design choice: Paper2Agent would rather ship fewer tools that work than more tools that might hallucinate. A tool that passes all tests against the paper's own results is one you can trust. A tool that doesn't pass gets silently excluded.

What Gets Tested

Tests are not just "does the function run without errors." They check:

Numerical accuracy: Does the output match the tutorial's expected value? (With appropriate tolerance for floating-point differences.)
Output structure: Does the return dict contain the expected keys?
Figure generation: Does the visualization tool produce a file? Is it non-empty?
Edge cases: Does the tool handle missing inputs gracefully?
Reproducibility: Does running the same input twice produce the same output? (Critical for scientific tools.)

For AlphaGenome, this process validated 22 tools across 15 tutorial-based queries and 15 novel queries, achieving 100% accuracy on both sets. Every single tool produces outputs that exactly match the ground truth from the paper's own code.

Test-Fix Loop Simulation

Watch the test agent iterate through generate → run → diagnose → fix cycles. Click Step to advance one action, or Auto to animate.

Locked after validation: Once a tool passes all tests, its implementation is frozen. This means the agent will always run the exact same code path — no runtime code generation, no LLM improvisation. The LLM decides which tool to call and with what arguments, but the tool's internal logic is deterministic and locked. This design minimizes randomness in code generation and strengthens reproducibility.

What happens when a tool repeatedly fails the test-fix loop?

Its MCP decorator is removed and it is excluded from the server — Paper2Agent ships fewer tools that work rather than more tools that might produce wrong results It is shipped with a warning label The entire pipeline restarts from Stage 1

Chapter 5: Chat Agent Integration

The MCP server is built and tested. Now it needs a face — a conversational agent that users actually interact with. This is the final piece: connecting the MCP server to a chat agent like Claude Code.

How the Connection Works

MCP servers can be hosted remotely — Paper2Agent deploys them to Hugging Face Spaces, eliminating local dependency issues entirely. The user's chat agent connects to the MCP server over the network. From the agent's perspective, the MCP tools appear as native functions it can call, just like file reads or shell commands.

When a user types "Score variant chr19:8134523:G>A using ATAC-seq predictions for lung tissue," the agent:

Parses the intent: identifies the variant, modality, and tissue from natural language.
Selects the tool: matches the query to score_variant_effect() from the MCP schema.
Constructs the call: fills in typed parameters — chrom="chr19", pos=8134523, etc.
Executes: the MCP server runs the locked, tested code on its own infrastructure.
Returns results: the agent receives the structured output and presents it in natural language.

The agent is NOT generating code: This is the critical distinction from "give an LLM access to a repo." The agent does not write Python. It calls pre-built, pre-tested functions through a typed interface. The LLM's job is limited to intent parsing and parameter extraction — tasks where it excels. The scientific computation is handled by the locked, validated tool. This separation of concerns is what makes Paper2Agent reliable.

Multi-Step Workflows

For complex queries, the agent chains multiple tool calls. When asked to "interpret why a variant associates with LDL cholesterol," the AlphaGenome agent constructs a multi-step plan:

Score the variant across multiple modalities (expression, chromatin, splicing).
Filter results for the relevant tissue (liver, for LDL).
Generate modality-specific visualizations.
Compile a report with figures and interpretation.

The agent iteratively plans, acts, observes results, and refines its approach — the classic ReAct pattern. But unlike a general agent, every action is a validated MCP tool call, not ad-hoc code generation.

Multi-Paper Agents

Because MCP servers are modular, you can connect multiple MCPs to the same agent. A researcher could have the AlphaGenome MCP, the TISSUE MCP, and the Scanpy MCP all active simultaneously. The agent seamlessly routes queries to the right server based on the task. This enables cross-paper reasoning that would be extremely difficult to set up manually.

Performance baseline: On the AlphaGenome benchmark, the Paper2Agent-generated agent achieved 100% accuracy on both tutorial-based and novel queries. Claude Code with access to the raw repo scored 60-80%. Biomni scored 40-60%. The Paper2Agent agent was also 1.8-4.6x faster in median runtime. Pre-built tools eliminate the overhead of reading source code, understanding APIs, and generating bespoke scripts at query time.

Why is the Paper2Agent approach more reliable than giving an LLM direct access to a repository?

The LLM calls pre-built, pre-tested functions through typed schemas rather than generating novel code — limiting its role to intent parsing where it excels, while scientific computation runs in locked, validated tools Because MCP is a faster protocol than file system access Because Claude Code is a better model than other LLMs

Chapter 6: Case Study — AlphaGenome

AlphaGenome is an AI model from Google DeepMind that predicts how single-nucleotide mutations in human DNA affect gene regulation — expression, chromatin accessibility, splicing, transcription factor binding. It is powerful but complex: the codebase involves custom data loaders, GPU-accelerated inference, tissue ontology lookups, and multi-modal visualization pipelines.

What Paper2Agent Built

Paper2Agent generated 22 MCP tools in roughly 3 hours on a personal laptop, covering:

Single-variant scoring: predict functional effects across modalities and tissues.
Batch-variant scoring: process multiple variants in one call.
Sequence-level prediction: given a DNA sequence, predict its regulatory landscape.
Tissue ontology exploration: browse available tissues and cell types.
Visualization suite: generate publication-ready figures for variant effects, TF binding, chromatin accessibility.

Each tool exposes flexible, well-annotated parameters. The visualize_variant_effects() tool, for example, lets users toggle organism (human or mouse), sequence context length, and modality (RNA-seq, ATAC-seq, ChIP-seq histone tracks) — all options discoverable through the tool's typed schema.

Benchmark Results

Agent System	Tutorial Queries (15)	Novel Queries (15)	Median Speedup
AlphaGenome Agent (Paper2Agent)	100% (15/15)	100% (15/15)	1.0x (baseline)
Claude + Raw Repo	60% (9/15)	80% (12/15)	1.8–3.2x slower
Biomni	40% (6/15)	60% (9/15)	3.1–4.6x slower

Why the gap is so large: When Claude + Repo encounters a query, it must read the source code, understand the API, write a Python script, execute it, and handle errors — all from scratch each time. The Paper2Agent agent just calls score_variant_effect() with the right arguments. The pre-built tools eliminate an entire class of failure modes: wrong imports, incorrect parameter names, missing environment variables, GPU configuration issues.

The SORT1 Reinterpretation

When asked to interpret a GWAS locus associated with LDL cholesterol, the agent prioritized SORT1 as the most likely causal gene — whereas the original paper emphasized CELSR2 and PSRC1. The agent's reasoning: SORT1 had a quantile score of 0.99982 for expression impact in liver, and SORT1 encodes sortilin, a protein directly involved in LDL/VLDL secretion. Independent validation in GTEx eQTL data confirmed the association (p = 1.1e-65).

This was not a bug — it was a genuine scientific reinterpretation, enabled by the agent's ability to run comprehensive multi-modal analysis with a single prompt. The original authors may have emphasized different genes for valid reasons (both CELSR2 and PSRC1 also had high scores), but the agent provided an independent, model-based perspective that users can evaluate.

The broader point: With Paper2Agent, published conclusions become re-evaluable. A single prompt can trigger an analysis that took the original authors days. This shifts the balance of scientific effort from execution to interpretation.

Why did the AlphaGenome agent achieve 100% accuracy while Claude + Repo scored only 60-80%?

Pre-built, pre-tested tools eliminate failure modes from code generation: wrong imports, incorrect parameters, missing environment variables, and GPU configuration issues The Paper2Agent agent used a more powerful model The benchmark queries were easier for the Paper2Agent agent

Chapter 7: Case Study — Single-Cell Agents

The AlphaGenome case shows Paper2Agent on a complex deep-learning model. The single-cell case studies — TISSUE and Scanpy — show it on a different challenge: multi-step analysis pipelines where the correct sequence of operations matters as much as the individual tools.

TISSUE: Uncertainty-Aware Spatial Transcriptomics

TISSUE is a method for predicting spatial gene expression with calibrated uncertainty estimates. Paper2Agent generated 6 tools covering spatial prediction, prediction interval construction, and uncertainty-aware downstream analysis (hypothesis testing, dimensionality reduction).

The TISSUE agent serves two roles:

Execution: Given a spatial count matrix and scRNA-seq data, it runs the full TISSUE pipeline — from data loading through imputation and uncertainty estimation — producing outputs identical to human-executed results.
Guidance: Users can ask "What are the required inputs for TISSUE?" and receive structured, actionable instructions — transforming the paper into a live Q&A system about the method.

MCP Resources as data catalogs: Paper2Agent translated the TISSUE paper's data availability section into a structured registry of spatial transcriptomics datasets, with standardized metadata (species, tissue type, modality, data URL). Users can query by species, download data through the Zenodo REST API, and pipe it directly into the analysis tools — all from natural language.

Scanpy: Preprocessing and Clustering

Scanpy is a widely used package with many features. Paper2Agent focused on the most common use case: the preprocessing-to-clustering pipeline. It generated 7 tools in 45 minutes:

Tool	Function
`quality_control()`	Calculate QC metrics, filter cells/genes, detect doublets
`normalize_data()`	Normalize count data
`select_features()`	Identify highly variable genes
`reduce_dims()`	PCA and UMAP
`build_graph()`	Neighborhood graph construction
`cluster()`	Leiden clustering at multiple resolutions
`annotate()`	Cell type annotation via differential expression

The Role of MCP Prompts

For end-to-end workflows, the correct tool ordering is critical. You cannot cluster before normalizing, or reduce dimensions before selecting features. A general-purpose LLM might get the order wrong.

Paper2Agent solves this with MCP Prompts — workflow templates extracted from the paper's tutorials. The Scanpy MCP prompt encodes: QC → normalize → feature selection → dimensionality reduction → graph construction → clustering → annotation. The prompt also instructs the agent to inspect the data first and adjust parameters if defaults would yield incorrect results.

Users only need the data path. The prompt "Perform standard single-cell preprocessing and clustering pipeline on this single-cell data: data.h5ad" triggers the full workflow. The agent chains all 7 tools in the correct order, producing highly variable gene plots, UMAP embeddings, cluster assignments, and cell type annotations — all matching human researcher results.

What problem do MCP Prompts solve that individual MCP Tools cannot?

They encode the correct execution order for multi-step workflows, ensuring tools are chained in the right sequence without requiring the user to manually specify the pipeline They make individual tools run faster They provide documentation for each tool

Chapter 8: The ADHD Discovery

This is the payoff chapter — the moment Paper2Agent stops being "a convenient way to run code" and becomes a tool for scientific discovery.

The Setup

Two papers exist independently:

AlphaGenome (method paper): predicts variant effects on gene regulation.
ADHD GWAS (data paper): identifies 39 genomic loci associated with Attention-Deficit/Hyperactivity Disorder through a genome-wide association study.

A human researcher wanting to combine insights from both papers would need to: understand both methods, install both codebases, convert between data formats, design an analysis strategy, execute it, and interpret results. This typically takes weeks.

The AI Co-Scientist

Paper2Agent created MCPs for both papers and connected them to the same Claude Code agent — creating an "AI co-scientist" with access to both a method and a dataset. This agent was then prompted to generate novel hypotheses and execute analyses.

The co-scientist proposed several hypotheses, including:

ADHD risk variants alter regulatory activity in brain-specific cell types.
AlphaGenome can prioritize causal variants within ADHD fine-mapping credible sets.
ADHD-associated variants disrupt transcription factor binding at FOXP family gene loci.

The finding: Among 209 candidate variants in one GWAS locus, the AI co-scientist identified rs1626703 as the most likely causal variant. This intronic variant is predicted to alter MPHOSPH9 splicing and expression specifically in glutamatergic neurons, with AlphaGenome quantile scores of 1.000 for splice junction effects and 0.963 for RNA-seq expression. MPHOSPH9 encodes an M-phase phosphoprotein involved in cell division and ciliogenesis — a plausible mechanism for ADHD risk through disrupted neuronal development.

Scaling to All 39 Loci

The co-scientist did not stop at one locus. It autonomously designed and executed a workflow across all 39 ADHD-associated loci:

Extract credible-set variants from each locus.
Run AlphaGenome functional scoring in glutamatergic neurons.
Filter for protein-coding genes.
Rank by maximum quantile impact scores across modalities.
Compile a comprehensive report for each locus.

The entire analysis — across 39 loci — completed in approximately two hours. Manual execution by human experts would have taken weeks. The results are provided as a supplementary table in the paper, with each locus mapped to its prioritized causal variant, target gene, and molecular mechanism.

Cross-Paper Discovery Pipeline

Click through the stages of the AI co-scientist's discovery workflow. Two independent papers are combined into a single analytical pipeline.

A new paradigm: This is not AI replacing scientists — it is AI accelerating the most labor-intensive part of scientific collaboration: integrating methods across papers. The human scientist formulated the high-level question ("combine these two papers"). The AI co-scientist designed the analysis, executed it, and surfaced results for human evaluation. The shift is from manual execution to synthesis of actionable insights.

What enabled the ADHD discovery that would not have been possible with either paper alone?

Connecting MCPs from both a method paper (AlphaGenome) and a data paper (ADHD GWAS) to the same agent, enabling cross-paper analysis that combines variant interpretation with disease-associated loci in a single automated workflow Using a more powerful LLM than previous studies Having access to more GWAS data

Chapter 9: Connections

What Paper2Agent Builds On

Prior Work	Connection
Model Context Protocol (MCP)	The foundational protocol. Paper2Agent automates what MCP makes possible — turning arbitrary tools into standard, composable interfaces.
Claude Code architecture	Paper2Agent is implemented in Claude Code. The orchestrator uses Claude Code's sub-agent delegation, tool dispatch, and iterative debugging capabilities. See our Dive into Claude Code lesson.
ReAct (Yao et al., 2022)	The paper agents follow the ReAct pattern: reason about the task, call a tool, observe the result, repeat. Paper2Agent constrains this to pre-validated tools, reducing hallucination risk.
AI Scientist (Sakana, 2024)	Both envision AI as scientific collaborators. AI Scientist generates papers; Paper2Agent makes existing papers interactive. Complementary visions.
Paper2Code (Seo et al., 2025)	Generates code from papers for ML reproducibility. Paper2Agent goes further: it generates tested, deployed, interactive agents, not just code.
Biomni (Huang et al., 2025)	A general-purpose biomedical AI agent. Paper2Agent outperforms it on specialized tasks because pre-built tools eliminate runtime code generation.

Limitations and Open Questions

Codebase quality dependency: If the original repo is incomplete, poorly documented, or buggy, Paper2Agent cannot fix it. The framework's success is a practical measure of a paper's reproducibility.
Scope of agentification: A paper is not always the right unit. Some ideas span multiple publications. Paper2Agent supports multi-paper MCPs but the optimal granularity remains an open question.
Evaluation reliance on expert knowledge: Benchmarks require manually curated ground truth. Future work could use LLM-as-judge for more scalable evaluation.
Beyond methods papers: Current focus is computational methods. Data papers, discovery papers, and theoretical papers present different agentification challenges.

The Vision: Agent Availability Sections

Just as journals now require data availability and code availability sections, the authors envision an agent availability section — specifying whether and how a paper's contribution has been embodied as an interactive agent. Well-documented, modular, transparent papers will naturally lend themselves to agentification. Papers that cannot be agentified reveal, by that failure, their reproducibility gaps.

Communities of agents: Once scientific knowledge is encoded in active agents rather than static artifacts, agents could interact with each other — linking methods to datasets, combining insights across domains. Paper2Agent's ADHD case study is a proof of concept for this vision: two paper agents collaborating to produce a discovery neither could make alone.

Metric	Value
Paper	Paper2Agent (Miao et al., 2025)
Core idea	Convert papers to MCP servers → interactive AI agents
Pipeline stages	Analysis → Construction → Testing → Deployment
Sub-agents	4 (Environment, Scanner, Extractor, Tester)
AlphaGenome tools	22 tools, 100% accuracy, ~3 hours
Scanpy tools	7 tools, human-matching results, ~45 min
Key discovery	rs1626703 → MPHOSPH9 splicing → ADHD risk
Speedup vs. Claude+Repo	1.8–3.2x faster
Speedup vs. Biomni	3.1–4.6x faster

What makes Paper2Agent's approach fundamentally different from Paper2Code?

Paper2Agent produces tested, deployed, interactive agents (not just code) — with validated tools, standard MCP interfaces, and natural language access, going beyond reproducibility to usability Paper2Agent uses a better LLM Paper2Code only works on ML papers

Paper2Agent: Papers as AI Agents