Skip to content

The AI Scientist in Nature: What Institutions Must Decide Now

6 min read

The AI Scientist in Nature: What Institutions Must Decide Now
Photo by Pavel Danilyuk on Pexels

A Peer-Reviewed Milestone That Changes the Conversation

In March 2026, a paper co-authored by Sakana AI, the University of British Columbia, the Vector Institute, and the University of Oxford landed in Nature — volume 651, pages 914–919. The paper, “Towards End-to-End Automation of AI Research”, is the first peer-reviewed account of a system that completes the full scientific cycle: generating research ideas, reviewing literature, designing and running experiments, and writing the full paper in LaTeX — without human intervention during execution.

The system is called The AI Scientist. You may have seen the headline when Sakana AI first announced that one of its AI-generated papers passed peer review at an ICLR 2025 workshop. What’s different now is the Nature publication itself: this is the scientific establishment formally documenting that end-to-end automated research is no longer theoretical. Institutions that haven’t yet formed a position on this have run out of time to defer the question.

What the System Actually Does — and What It Doesn’t

The AI Scientist-v2 runs a parallelized agentic tree search across the entire research pipeline. Given a broad research direction, it generates novel hypotheses, searches and reads relevant literature, programs and runs computational experiments, analyzes results, and produces a complete manuscript. The writing step costs roughly $5 in API calls; experiments add another $15–20. Total cost per paper: approximately $20–25.

One of its three papers submitted to the ICLR 2025 ICBINB workshop was accepted after unedited human peer review — the paper investigated compositional regularization in neural network training and averaged a score of 6.33, putting it in roughly the top 45% of that workshop’s submissions. Sakana AI withdrew the paper prior to publication, citing the need to avoid setting a precedent before the scientific community has established clear standards.

That caveat matters. The two other papers submitted to the same workshop scored between 3 and 4 out of 10 and were rejected. Even the accepted paper contained errors: misattributed LSTM concepts and definitional inaccuracies that human reviewers apparently didn’t catch. The system is also currently limited to computational machine learning experiments — it cannot run a wet lab, manage physical equipment, or handle disciplines that require empirical data collection outside a Python script.

The authors are candid about the ceiling: the accepted paper is not at the level of the best human-authored work at comparable venues. But they also include a pointed observation: “Once a new capability starts to work, even with clear limitations, it becomes superhuman surprisingly soon” — through scale and improved foundation models. That’s not a boast; it’s a pattern worth taking seriously.

Why Peer Review Is the Pressure Point

The structural risk isn’t that AI will write bad papers. It’s that it will write many papers, fast, and that existing review infrastructure isn’t designed to absorb that volume. Jevin West, a computational social scientist at the University of Washington, put it directly: what happens when conferences already drowning in submissions get hit with a firehose of $15 papers?

A related concern flagged in a Nature editorial from March 25 is the P-hacking problem at scale. Systems that iterate analyses algorithmically until they find something statistically significant could flood review pipelines with noise that looks like signal. It’s automated, large-scale p-hacking — and it’s hard to distinguish from genuine discovery without deep methodological scrutiny that most reviewers don’t have bandwidth for.

Attribution is a second structural problem. It’s hard to trace a model’s inspirations. An AI Scientist that reads thousands of papers to generate its hypotheses may be doing something functionally similar to plagiarism — not copying text, but building on ideas without any mechanism for credit or citation. This matters for incentive structures: careers are built on who originated an idea, not just who formalized it.

Earlier this year, vortx.ch reported that ICML desk-rejected 497 papers after detecting AI use by 398 reviewers in violation of policy — evidence that the governance gap runs in both directions. It’s not just AI writing papers; it’s AI reviewing them, too, often without disclosure.

What Publishers and Funders Have Said So Far

Nature already requires transparency on LLM use in submitted articles and will not list AI systems as authors. For reproducibility, when a model contributes to the creative aspects of a study, Nature encourages submission of prompt and response transcripts alongside the final outputs — treating model outputs as data. That’s a reasonable first step, but it places the burden of disclosure entirely on researchers, and relies on self-reporting.

Funders haven’t moved as quickly. There’s no established standard across major funding bodies — NSF, NIH, Wellcome, the European Research Council — on how AI-generated research proposals or outputs should be disclosed or evaluated. Some institutions are requiring AI use statements in grant applications, but enforcement and consistency vary widely.

The absence of coordinated standards is itself a risk. Researchers at institutions with clear AI disclosure policies operate at a disadvantage relative to those at institutions without them — at least in the short term. A single global standard from an organization like COPE (Committee on Publication Ethics) or an ICMJE update would help level the field.

What Engineering and Research Teams Should Watch

For engineers and technical leads building internal research infrastructure, the AI Scientist is a preview of where automated experimentation pipelines are heading. The underlying architecture — agentic tree search, parallelized hypothesis testing, automated manuscript generation — is not proprietary magic. Similar systems will appear in industry R&D contexts, not just academic ones. If your team runs systematic benchmarks, ablations, or model evaluations, the tooling to partially automate that pipeline exists today and is getting cheaper by the month.

Vortx.ch covered the broader automation trend back in April: AI is automating the full research pipeline, and the question is no longer whether this happens but which parts of the pipeline benefit most from automation versus which require human judgment to remain credible.

The honest answer, based on what the AI Scientist can and can’t do today: hypothesis generation and experiment execution are automatable for well-scoped computational problems. Novelty assessment, methodological rigor under adversarial review, and cross-disciplinary synthesis still require humans — for now. The v2 paper is proof that the gap is closing, not that it’s closed.

What’s Next

The Sakana AI team has been transparent that they intend to keep developing the system and that future versions will target broader scientific domains beyond machine learning. The scaling law they observed — paper quality improves proportionally with foundation model capability — implies that each new generation of frontier models will push this system closer to producing work that clears a higher review bar.

For institutions, the window to set proactive policy is narrowing. The question isn’t whether AI-generated research will appear in journals — it already has. The question is whether the scientific community will have built the disclosure, attribution, and review standards needed to maintain research integrity before the volume makes reactive policy unenforceable.

Further Reading

Don’t miss on Ai tips!

We don’t spam! We are not selling your data. Read our privacy policy for more info.

Don’t miss on Ai tips!

We don’t spam! We are not selling your data. Read our privacy policy for more info.

Enjoyed this? Get one AI insight per day.

Join engineers and decision-makers who start their morning with vortx.ch. No fluff, no hype — just what matters in AI.