From Messy Logs to LaTeX in 39 Minutes
Published on April 6, 2026 by researchers at Google Cloud AI Research — Yiwen Song, Yale Song, Tomas Pfister, and Jinsung Yoon — PaperOrchestra automates the writing half of academic research. Feed it an idea summary and raw experimental logs; it produces a submission-ready LaTeX manuscript, complete with figures, a literature review, and API-verified citations. The full pipeline takes a mean of 39.6 minutes and around 60–70 LLM API calls.
This isn’t a general chatbot with a “write my paper” button. PaperOrchestra orchestrates five specialized agents with distinct roles, each passing structured output to the next. The design choice matters: specialization is the reason it outperforms single-agent baselines by 52–88% on overall manuscript quality in head-to-head evaluations.
The paper also introduces PaperWritingBench — the first standardized benchmark for AI paper writing — built from 200 accepted papers at CVPR 2025 and ICLR 2025, with reverse-engineered raw materials used as input. Every comparison in this article references that benchmark.
The Five-Agent Pipeline
Each agent in PaperOrchestra handles one distinct stage of the writing process. They run mostly in sequence, with Plotting and Literature Review running in parallel after Outline completes.
Outline Agent
Reads the idea summary, experimental logs, conference template, and submission guidelines, then produces a structured JSON outline containing a visualization plan, literature search strategy, and section-level writing plan with citation hints. This JSON becomes the shared scaffold that every downstream agent builds from — a concrete design choice that prevents each agent from drifting off in its own direction.
Plotting Agent
Synthesizes both canonical scientific plots from empirical logs and conceptual diagrams. It uses iterative VLM-guided refinement: a vision-language model reviews each figure and suggests corrections until the output meets quality criteria. Most competing systems either skip figures entirely or produce static placeholders — PaperOrchestra’s iterative approach is one of its more technically distinctive features.
Literature Review Agent
Uses a hybrid search-and-verification approach: it discovers citations through external search APIs, then validates each one exists and actually says what it claims to say. The result is 45–48 citations per paper on average — close to the human baseline of roughly 59 — and a citation F1 of 29.7% versus 17.2% for AI Scientist-v2 on the CVPR subset. The API-verification step is what separates PaperOrchestra from systems that hallucinate non-existent papers.
Section Writing Agent
Takes the outline JSON, figures, and verified citations as inputs, then drafts each section of the paper. The agent writes section by section, maintaining coherence with the structured outline rather than generating the full manuscript in one pass. This reduces the compounding drift that plagues single-pass generation at paper length.
Content Refinement Agent
Acts as an automated reviewer — reading the draft and suggesting revisions for clarity, logical consistency, and alignment with conference formatting requirements. The authors are clear this doesn’t replace actual peer review, but it catches the structural and consistency problems that typically require a co-author read-through.
How It Stacks Up Against Rivals
The honest comparison is against Sakana AI’s AI Scientist-v2, which generated the first AI-authored paper to pass human peer review (published in Nature in early 2026). AI Scientist-v2 is designed to run the entire research loop — hypothesis generation, experimentation, and writeup — while PaperOrchestra only handles the writeup step.
On PaperWritingBench, PaperOrchestra outperforms AI Scientist-v2 on overall manuscript quality by 39–86% in automated evaluation (using Gemini-3.1-Pro and GPT-5 as judges). In human evaluation with 11 AI researchers across 180 paired manuscript comparisons, the gap narrows: PaperOrchestra achieves 14–38% win rate margins overall, and 50–68% win margins specifically on literature review quality. The timing difference is modest — 39.6 minutes versus AI Scientist-v2’s 35.1 minutes — despite running 20–30 more LLM calls.
The critical caveat: these systems aren’t competing for the same job. AI Scientist-v2 generates and runs experiments. PaperOrchestra assumes experiments are already done and documented. It’s a writing tool, not a research replacement — and that distinction matters enormously for where you’d actually slot it into a real workflow.
What PaperOrchestra Can’t Do
The system cannot evaluate the scientific validity of the ideas it’s given. If your experimental logs document a flawed study design, PaperOrchestra will write a polished paper describing a flawed study. The abstraction layer is clean: good inputs produce better manuscripts than bad inputs, but the system has no mechanism to flag whether the underlying science is sound.
Citation quality, while improved over baselines, still falls short of human-written papers. The human baseline in PaperWritingBench averages roughly 59 citations with higher F1 than any AI system tested. At 29.7% citation F1, PaperOrchestra’s literature synthesis is the strongest among AI systems — but it’s still missing a significant slice of the relevant literature a human expert would naturally include.
There’s also the novelty problem. Reviewers at top venues aren’t just checking whether a paper is well-written; they’re assessing whether the contribution is meaningful. PaperOrchestra can structure an argument clearly, but whether the underlying contribution is novel enough for ICLR is still a human judgment call that happens before — or after — the system ever touches the manuscript.
These limitations don’t undermine the use case — they define it. PaperOrchestra is most useful for researchers who have solid results and struggle with the writeup, not for researchers still figuring out what they’ve found. AI is increasingly automating the full research pipeline, but the writeup layer and the experimental layer are at different maturity levels right now.
What This Means for the Field
PaperOrchestra matters for two reasons beyond its benchmark numbers. First, PaperWritingBench — 200 papers with reverse-engineered raw materials — gives the community a standardized way to measure progress on automated writing. Before this, comparisons between systems were ad hoc and non-reproducible. That benchmark infrastructure matters more than any single performance number in the paper.
Second, the specialization-over-single-agent architecture is a repeatable lesson. Every domain where multi-agent systems have beaten single agents — coding, retrieval, long-form generation — has followed the same pattern: decompose the task, specialize the agents, pass structured intermediate output between stages. PaperOrchestra confirms that academic writing follows the same logic.
The practical question for researchers isn’t whether tools like this will become part of the workflow — they already are. It’s when they become trustworthy enough to use without careful review. Institutions are still deciding what rules to set around AI-generated manuscripts. PaperOrchestra’s verification-heavy design — API-checking citations, VLM-reviewing figures — signals that Google’s team is building with that accountability question in mind.
Further Reading
- PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing — The original arXiv paper with full benchmark methodology, agent architecture, and evaluation results across both automated and human judges.
- PaperOrchestra Project Page — The official demo page with example outputs, figure samples, and access to the PaperWritingBench dataset.
- Google AI Research Introduces PaperOrchestra (MarkTechPost) — Accessible summary of the architecture and what differentiates it from AI Scientist-v2 and single-agent systems.

