Eliminate human-authored code templates. Manage experiments via agentic tree search. Refine figures with a VLM. Result: the first fully AI-generated paper to survive peer review at an ICLR workshop.
In 2024, Lu et al. released The AI Scientist — the first system to automate the full scientific discovery pipeline end-to-end. Give it a research topic and it would generate hypotheses, write code, run experiments, produce figures, and draft a complete manuscript. Impressive, but deeply limited in practice.
The core limitation: human-authored code templates. For every new research domain, a human researcher had to hand-write a baseline codebase. The AI Scientist v1 could only make incremental modifications to this template — swap a loss function, tweak a hyperparameter, add a regularization term. It couldn't build an experiment from scratch.
This made v1 brittle. Want to study compositional generalization? Someone first writes a training loop, a dataset loader, an evaluation harness. Want to study vision transformers? A completely different template. The "automation" was really just "incremental code editing given a scaffold."
The papers v1 produced reflected these limitations. Internal evaluation found them "below the level of top ML venues." The manuscripts lacked depth, had superficial experiment designs, and the figures were often poorly formatted (since no vision model ever looked at them). No v1 paper was submitted to peer review, because the authors judged none were good enough.
Three specific failure modes defined v1:
There was also a subtler issue: v1's manuscript writing used Aider to make incremental LaTeX edits, section by section. This often produced internally inconsistent papers where the introduction promised one thing and the experiments delivered another. The system had no "big picture" view of the document.
The AI Scientist-v2 addresses all of these. The question is: how?
The AI Scientist-v2 rests on one central idea: treat scientific experimentation as tree search, not linear editing.
In v1, the experiment flow looked like a chain:
Every edit depended on the one before it. If Edit 2 was a mistake, Edit 3 inherited the damage. There was no way to go back.
In v2, the flow is a tree. Each experiment is a node. Promising nodes get expanded with refinements. Buggy nodes get debugging children. The system can explore multiple hypotheses in parallel and select the best branch at each stage. This is managed by a dedicated Experiment Manager agent that coordinates four distinct research stages.
Three innovations work together to make this possible:
The payoff: one of the three manuscripts v2 generated and submitted to ICLR's ICBINB workshop received scores of 6, 7, and 6 — an average of 6.33, placing it in roughly the top 45% of submissions. It would have been accepted had it been human-authored. This is the first time a fully AI-generated paper has passed peer review.
Let's lay the differences out side by side. These aren't incremental tweaks — nearly every component was redesigned.
| Feature | AI Scientist v1 | AI Scientist v2 |
|---|---|---|
| Codebase | Topic-specific templates (human-authored) | Generated from scratch (domain-general) |
| Idea generation | Conditioned on existing template code | Open-ended, with Semantic Scholar in the loop |
| Experiment planning | Linear — each edit builds on the last | Tree-based — branch, backtrack, explore in parallel |
| Experiment stages | Single pass | 4 stages: feasibility → tuning → execution → ablation |
| Parallel execution | No | Yes — multiple nodes expanded concurrently |
| Figure review | None (text-only models) | VLM checks every figure for clarity and correctness |
| Manuscript writing | Incremental edits via Aider | Single-pass generation + reflection with o1 |
| Reviewer | Text-only LLM | VLM-augmented (sees figures + text together) |
| Human evaluation | Not submitted to peer review | One paper accepted at ICLR workshop (6.33 avg) |
| Dataset handling | Bundled with templates | Hugging Face Hub auto-download |
The v2 pipeline proceeds through five major phases:
Each complete run (idea → manuscript) costs approximately $20-50 in API calls and takes several hours of wall-clock time. The majority of cost comes from the experiment tree search phase, where many LLM calls generate and evaluate code. Experiment execution itself uses standard academic compute (single GPU). For the ICBINB submission, multiple seeds were run per idea, so the total cost per submitted paper was higher — but still orders of magnitude cheaper than a human researcher's salary.
This is the heart of v2. The Experiment Manager is a dedicated LLM agent that orchestrates the entire experimental process through four stages, using tree search within each stage to explore the hypothesis space systematically.
Real scientific research follows a natural progression: first check if an idea is feasible, then tune the setup, then run the core experiments, then validate with ablations. The Experiment Manager enforces this structure:
Within each stage, the system builds a tree of experiment nodes. Each node contains:
At each iteration, the system selects nodes to expand. The selection strategy depends on the node type:
New child nodes are created and executed in parallel, dramatically accelerating exploration compared to v1's sequential approach.
Each node, once created, follows a fixed execution cycle:
Beyond basic experiment nodes, the system creates specialized variants:
| Node Type | Stage | Purpose |
|---|---|---|
| Hyperparameter | 2 | Systematically explore alternative hyperparameters. Tracks previously tested configs to avoid redundancy. |
| Ablation | 4 | Remove one component at a time. Tracks previously tested ablation conditions. |
| Replication | 3, 4 | Re-run parent experiment with different random seeds. Enables mean ± std reporting. |
| Aggregation | 4 | Collect results from replication nodes. Produce combined figures with error bars. No new experiments. |
Suppose we're in Stage 3 (Research Agenda Execution). The tree currently has three non-buggy nodes and one buggy node. Here's what happens in one iteration:
This cycle repeats until the compute budget for the current stage is exhausted. Then the Experiment Manager selects the single best node (via the LLM evaluator) to seed the next stage.
This is where v2 becomes truly autonomous. Instead of starting from a human-written codebase, v2 starts from nothing but a research idea.
The process begins in the idea generation phase. The system is prompted to brainstorm open-ended research directions for a given topic — something like "negative results in deep learning" (the ICBINB workshop theme). It generates roughly twenty candidate ideas, each described as a title and short hypothesis. Critically, the system has access to Semantic Scholar during this phase, so it can check whether an idea is novel and identify relevant prior work.
Once an idea is selected, the Experiment Manager takes over. Here's how Stage 1 (Preliminary Investigation) works without a template:
datasets.load_dataset() from Hugging Face whenever possible. This provides a consistent API for downloading hundreds of ML datasets with predefined train/val/test splits. Without this, the LLM would have to write custom data loading code for every experiment — a major source of bugs.The manuscript writing phase also changed fundamentally. v1 used Aider to iteratively edit LaTeX files section by section, which was slow and often produced inconsistent text (early sections didn't know what later sections would say). v2 uses a single-pass generation: the LLM writes the entire manuscript at once, given the experiment results, figures, and summaries from the best experiment nodes.
After the initial draft, a separate reflection stage uses a reasoning model (o1) to review the manuscript holistically. The reflection stage also receives the target page limit (e.g., 4 pages for the ICBINB workshop) alongside the current PDF length, allowing it to suggest cuts or expansions as needed. This two-phase approach (draft + reflect) is cleaner and more reliable than incremental editing, and the separation of concerns means the drafting model can focus on content while the reflection model focuses on quality.
What does the LLM actually receive when generating experiment code? The prompt includes:
The system doesn't receive any starter code, model architecture, or training recipe. Everything is generated from the idea description alone. This is what makes v2 domain-general: the same pipeline that studies compositional generalization in sequence models could equally study data augmentation in vision or reward shaping in RL.
At the end of experimentation, the system generates a final plot aggregation script. The actual prompt instructs the LLM to:
This is a small but important detail. The system explicitly separates experiment execution (which saves raw data as .npy arrays) from figure generation (which reads those arrays and produces publication-quality plots). This separation prevents a common failure mode: plots that look right but are generated from hallucinated data rather than actual experiment outputs.
v1 had a blind spot: it never looked at its own figures. The language models generating plots couldn't verify whether the output was clear, correctly labeled, or even legible. This led to manuscripts with garbled axes, missing legends, and misleading visualizations — the kind of issues a human reviewer would catch immediately.
v2 integrates Vision-Language Models (VLMs) at two critical points in the pipeline.
After each experiment node generates its figures, the VLM (GPT-4o) reviews them. It checks for:
If the VLM flags any issues, the node is marked buggy and the feedback is recorded. When a debugging child is spawned from this node, the LLM receives the VLM's critique alongside the code, enabling targeted fixes. This means figure quality improves through the same tree search mechanism that improves experiment quality.
After the manuscript is drafted, the system extracts each figure along with its caption and the text that references it (identified by searching for "Figure X" in the manuscript). These figure-caption-context triples are sent to the VLM for a second round of checks:
This two-stage VLM integration is one of the simplest innovations in the paper, but one of the most impactful for manuscript quality. Human reviewers at the ICBINB workshop did flag some figure issues in the submitted papers, but far fewer than they would have without VLM filtering.
v1 included an automated AI reviewer — an LLM that scored generated manuscripts on a 1-10 scale across several criteria. But it had a fundamental limitation: it was text-only. The reviewer read the LaTeX source but never saw the rendered figures. This meant it couldn't assess whether figures supported the claims in the text.
v2 upgrades the reviewer with VLM capabilities. The enhanced reviewer:
The paper evaluates the enhanced reviewer against human peer review scores from the ICBINB workshop. The VLM-augmented reviewer shows higher correlation with human scores than the text-only reviewer from v1. This is expected: a reviewer that can actually see the figures makes more holistic judgments.
| Paper | Human Avg Score | AI Review (v2, VLM) | Outcome |
|---|---|---|---|
| Compositional Regularization | 6.33 | Highest ranked | Accepted (withdrawn post-review) |
| Paper 2 | Below threshold | Middle ranked | Rejected |
| Paper 3 | Below threshold | Lowest ranked | Rejected |
Even the enhanced reviewer has blind spots. The human reviewers at ICBINB flagged issues the AI reviewer did not:
These are higher-level scientific judgment calls that current AI reviewers struggle with. They require understanding what makes a research contribution convincing to the community, not just whether the text and figures are internally consistent.
Three manuscripts were submitted to the ICLR 2025 "I Can't Believe It's Not Better" (ICBINB) workshop. One was accepted. Let's dissect what happened.
The evaluation was carefully designed with ethical safeguards:
The workshop-accepted paper investigated whether adding a compositional regularization term to the training loss could improve neural network generalization. The idea: penalize large deviations between embeddings of successive time steps in an LSTM-based sequence model, hypothesizing this encourages compositional representations.
The experimental setup: train LSTM models on synthetic arithmetic expression datasets (tasks like evaluating "3 + (2 * 4)"), with and without the proposed regularization term. Evaluate whether the regularized model generalizes better to unseen combinations of operations.
The result was a negative finding — compositional regularization did not significantly improve generalization performance, and in some cases actually harmed training. Furthermore, increasing the complexity of arithmetic expressions made generalization worse regardless of regularization. The paper concludes that explicitly enforcing compositional structure via regularization alone may not be sufficient.
This aligned well with the ICBINB workshop theme (unexpected failures and negative results). The paper received scores of 6 (weak accept), 7 (accept), and 6 (weak accept), averaging 6.33 — placing it roughly in the top 45% of the 43 total submissions.
The system started from this one-sentence hypothesis, generated all code from scratch, ran experiments on synthetic arithmetic datasets, discovered the regularization didn't work, and wrote a paper honestly reporting the negative result. The ICBINB workshop specifically values such negative findings — a lucky alignment between the system's output and the venue's theme.
Both were rejected. The authors' internal analysis agrees with this outcome: they judged only one of the three to be workshop-quality. The system was run with multiple random seeds per idea, and the best manuscript from each idea's seed runs was selected for submission — similar to a professor selecting the best work from multiple students.
Systems that autonomously generate scientific manuscripts raise serious concerns. The authors devote considerable space to this, and it deserves careful attention.
The most obvious concern: AI Scientist-v2 could be used to flood conferences and journals with AI-generated submissions. If the system can produce workshop-level papers at scale, it could overwhelm the peer review system — which already struggles with the volume of human submissions.
More subtle: a system that generates plausible-looking research could be used to produce fake but convincing papers that support a desired conclusion. This is especially dangerous in politically sensitive areas (climate science, drug efficacy) where manufactured evidence could influence policy.
Should AI-generated papers be published in the scientific record? The authors deliberately avoid a strong stance. They argue for "transparent experimentation" — conducting controlled studies like the ICBINB evaluation to understand AI capabilities and limitations, rather than either banning AI-generated work outright or allowing it without guardrails.
Perhaps the most immediate concern: if AI-generated papers become common, peer review itself changes. Reviewers may become suspicious of all submissions, demanding proof of human authorship. This creates overhead for legitimate researchers and erodes trust in the review process. The ICBINB experiment was conducted transparently, but not all future uses will be.
The authors note that both v1 and v2 always include an explicit disclosure that manuscripts are AI-generated. But this is a voluntary safeguard. There's nothing preventing someone from removing that disclosure and submitting the output as their own work.
If autonomous systems can produce workshop-level research at low cost ($20-50 per paper run), what happens to early-career researchers whose workshop publications are crucial for career progression? The authors don't address this directly, but it's a consequence worth considering. Science has historically been a human endeavor where the process of doing research — learning to formulate hypotheses, design experiments, interpret results — is as valuable as the output.
On the other hand, if these systems are used as tools rather than replacements — helping researchers explore more hypotheses, catch figure errors, draft initial manuscripts — they could democratize research by reducing the labor barrier to entry. The dual-use nature is real and unresolved.
The AI Scientist-v2 sits at the intersection of several active research threads. Here's how it connects to the broader landscape.
Stepping back, AI Scientist-v2 exemplifies a pattern appearing across many recent systems: LLM + tree search + tool use + multi-modal feedback. The LLM provides the reasoning and code generation. Tree search provides structured exploration. Tool use (Python interpreter, Hugging Face, Semantic Scholar) grounds the system in reality. And VLM feedback closes the loop by evaluating outputs the text-only LLM can't assess. This same pattern shows up in AlphaEvolve (LLM + evolutionary search + code execution) and in concurrent work on agentic coding (LLM + tree search + test execution).