The first framework that takes LLMs from idea generation all the way through experiment execution, paper writing, and peer review — producing full scientific papers for less than $15 each.
Imagine you are a machine learning researcher. You have an idea — maybe a new way to schedule noise in a diffusion model, or a trick to make transformers generalize faster. What happens next?
You spend a day writing the experiment code. You debug it for another day. You run experiments overnight, look at the results in the morning, tweak things, and re-run. After a week, you have enough results to start writing the paper. The writing takes another week. Then you submit, wait months for reviews, revise, and resubmit. The total time from idea to published paper? Months to years.
Now imagine you could hand an LLM a starting codebase and say: "Here is a diffusion model that trains on 2D datasets. Find something interesting to do with it." And the LLM would brainstorm ideas, implement the best one, run the experiments, generate plots, write the paper, and even review it — all autonomously, all for less than $15.
That is exactly what The AI Scientist does. It is the first framework for fully automated, open-ended scientific discovery in machine learning.
Each bar represents one stage of a research project. Click "Automate" to see how The AI Scientist collapses the timeline.
Here is the central realization behind The AI Scientist: frontier LLMs are now good enough at every individual step of the research process — brainstorming, coding, writing, reviewing — that you can chain them together into a single autonomous pipeline.
No single capability is new. LLMs have been used to brainstorm ideas (ResearchAgent), write code (Aider on SWE-Bench), generate text (GPT-4 for manuscripts), and evaluate papers (LLM-as-a-judge). The insight is that these capabilities have crossed a quality threshold where the full loop produces useful output without human intervention.
The framework has three core phases, each built on LLM agent techniques:
After the paper is written, an Automated Reviewer evaluates it using NeurIPS-style guidelines. The review score and feedback can be fed back into the archive, making the process open-ended — each generation of ideas builds on the last.
The AI Scientist takes a starting code template — a small, self-contained ML experiment — and runs a seven-stage pipeline that produces a complete scientific paper.
Three templates were tested in the paper:
| Template | Domain | What It Does |
|---|---|---|
| 2D Diffusion | Generative models | Trains DDPM on four 2D distributions (circles, moons, dino, etc.) |
| NanoGPT | Language modeling | Trains a small transformer on character-level Shakespeare + enwik8 |
| Grokking | Learning dynamics | Trains a transformer on modular arithmetic to study delayed generalization |
Click "Next Stage" to walk through each phase. Watch costs accumulate and outputs build up.
The first step is the hardest to believe: an LLM brainstorming novel research ideas. Not just random word salad — ideas with a clear description, a concrete experiment plan, and self-assessed scores.
The AI Scientist uses an approach inspired by evolutionary computation. It maintains a growing archive of ideas. At each iteration, the LLM is prompted to propose a new idea that is different from everything already in the archive. This is the LLM acting as a mutation operator on the space of research directions.
Each idea includes:
The LLM uses chain-of-thought reasoning and self-reflection to refine each idea over multiple rounds. After generation, a novelty filter kicks in: the LLM queries the Semantic Scholar API to check if the idea has already been published. If a close match is found, the idea is discarded.
Self-assessment is biased. LLMs consistently overestimate how novel their ideas are. The Semantic Scholar check helps, but it is not perfect — it can miss unpublished concurrent work or ideas expressed differently. Still, across runs, about 70–95% of ideas pass the novelty check, depending on the model.
An interesting pattern: the archive creates implicit diversity pressure. Because each new idea must differ from the existing archive, the LLM is pushed to explore different directions rather than getting stuck in a rut. This mirrors how Quality-Diversity algorithms work in evolutionary computation.
This is where the magic happens. The AI Scientist has an idea and a code template. Now it needs to actually do the research: implement the idea in code, run experiments, look at results, iterate, and create visualizations.
Implementation is handled by Aider, a state-of-the-art open-source coding assistant. Aider takes the idea's experiment plan and the template codebase, then makes the necessary code changes. If the code fails to run, Aider gets the error traceback and retries — up to 4 attempts.
After each experiment completes, The AI Scientist follows a structured loop:
This loop repeats up to 5 times. At the end, Aider edits the plotting script to create figures.
Walk through the coding agent loop. Watch as the AI implements an idea, runs experiments, hits errors, recovers, and iterates on results. Click "Step" to advance.
Here is a simplified version of the kind of code Aider generates. For the dual-scale denoising idea, it creates two parallel branches and a weight network:
python class MLPDenoiser(nn.Module): def __init__(self, ...): # Two parallel denoiser branches self.global_network = nn.Sequential(...) self.local_network = nn.Sequential(...) # Upscale input for local branch self.upscale = nn.Linear(2, 4) # Timestep-conditioned weight self.weight_network = nn.Sequential( nn.Linear(emb_dim, hidden), nn.LeakyReLU(0.2), nn.Linear(hidden, 2), nn.Softmax(dim=-1) # weights sum to 1 ) def forward(self, x, t): g_out = self.global_network(emb) l_out = self.local_network(local_emb) w = self.weight_network(t_emb) return w[:,0]*g_out + w[:,1]*l_out, w
The experiments are done. The plots are ready. Now The AI Scientist needs to write the actual paper — a full scientific manuscript in LaTeX, following the format of a standard ML conference submission.
Writing happens in a carefully structured sequence:
A paper is only as good as its evaluation. The AI Scientist does not just write papers — it also reviews them, using a GPT-4o-based agent that follows NeurIPS conference review guidelines.
The reviewer agent takes the raw text of the PDF (parsed via PyMuPDF) and produces:
The key question: how good is this automated reviewer compared to actual human reviewers? The authors evaluated against 500 ICLR 2022 papers with known accept/reject decisions.
| Metric | Human (NeurIPS) | AI Reviewer (GPT-4o) |
|---|---|---|
| Balanced Accuracy | 66% | 65% |
| F1 Score | 0.49 | 0.57 |
| AUC | 0.65 | 0.65 |
| False Negative Rate | 0.52 | 0.39 |
| False Positive Rate | 0.17 | 0.31 |
The best configuration combines several techniques:
Self-reflection alone adds +2% accuracy. One-shot prompting adds another +2%. Ensembling reduces variance but does not significantly improve mean accuracy.
Compare the AI reviewer's performance against the human baseline across key metrics. Hover over bars for details.
The authors ran The AI Scientist across all three templates with four different LLMs: Claude Sonnet 3.5, GPT-4o, DeepSeek Coder, and Llama-3.1 405B. Each run generated about 50 ideas and ran on 8x NVIDIA H100s over roughly 12 hours.
| Model | Novel Ideas | Completed Papers | Mean Score | Max Score | Cost |
|---|---|---|---|---|---|
| Sonnet 3.5 | 149/154 | 83 | 3.77 | 6.0 | ~$750 |
| GPT-4o | 132/154 | 45 | 3.29 | 5.0 | ~$900 |
| DeepSeek | 125/154 | 90 | 3.22 | 5.0 | ~$30 |
| Llama 405B | 108/154 | 72 | 2.20 | 3.0 | ~$360 |
Scores follow NeurIPS guidelines: 6 = Weak Accept (the average accepted paper).
Among the highest-scored papers:
The cost is dominated by LLM API calls for coding and writing. Experiment compute is negligible because the templates are small-scale (tiny transformers, 2D datasets). Review costs are about $0.25–$0.50 per paper.
The AI Scientist is impressive, but it is far from a replacement for human researchers. The authors are refreshingly honest about its shortcomings.
The lack of sandboxing produced genuinely alarming behaviors:
The ability to generate papers at $15 each could overwhelm the peer review system with low-quality submissions. If the automated reviewer were widely adopted by lazy reviewers, review quality would drop further. The authors argue that AI-generated papers and reviews must be clearly labeled.
The AI Scientist sits at the intersection of several active research threads. Here is where it fits in the broader landscape.
| System | What It Does | How The AI Scientist Differs |
|---|---|---|
| AutoML / NAS | Search over architectures and hyperparameters | Operates in code space, not a predefined search space. Produces papers, not just configs. |
| FunSearch | Discovers mathematical functions via evolutionary code search | Domain-restricted. No paper writing. No self-review. |
| ResearchAgent | LLM brainstorms ideas from literature | Ideation only — no execution, no write-up. |
| LLM-as-reviewer | LLM evaluates papers | Review only — no generation. The AI Scientist closes the full loop. |
The AI Scientist (August 2024) opened the floodgates for automated research systems: