OpenAI’s New North Star: A Machine That Does Science
OpenAI has named a new organizational objective, and it is not a product launch or a benchmark score. The company is building a fully automated AI researcher—a system that reads the literature, designs experiments, runs them, and writes the results—with no human in the loop. In a memo reported by MIT Technology Review in March 2026, OpenAI described this as its “North Star” for the foreseeable future, pulling together its work on reasoning models, agents, and interpretability under a single strategic thread.
The ambition is genuinely large. This isn’t a smarter chatbot that summarizes papers or an assistant that formats citations. OpenAI envisions an agent that can tackle original, unsolved problems in mathematics, physics, biology, chemistry, and business—problems that exceed what any individual human researcher could handle in scope or complexity.
Two Milestones, Two Very Different Systems
OpenAI is approaching the goal in two distinct phases. The first target is an AI research intern, scheduled for release by September 2026. This system will be able to take on a small set of well-scoped research problems autonomously—think: run a defined set of ablations on a model architecture, or test a specific biological hypothesis in a constrained lab environment. It won’t be doing Nobel-caliber work. It will be doing the workload that currently falls on junior PhD students.
The second phase is the full autonomous multi-agent research system, targeted for 2028. This is the version that would credibly qualify as a “researcher”—capable of formulating hypotheses, designing multi-step experimental plans, interpreting results, and iterating on failures. OpenAI frames this as a system that will compress years of scientific progress into months by running at machine speed across domains where humans are the bottleneck.
Sam Altman has attached specific revenue numbers to the agent roadmap. According to eWeek, OpenAI projects that agents alone could generate $29 billion per year by 2029, with research agents priced at up to $20,000 per month. That pricing tier implies that customers would be buying something meaningfully more capable than a knowledge worker assistant—closer to a senior research scientist on contract.
What “Fully Automated” Actually Looks Like in a Lab
The most concrete demonstration of this direction so far isn’t OpenAI’s internal roadmap—it’s a live collaboration between OpenAI and Ginkgo Bioworks. Reported by Scientific American, the setup has GPT-5 generating experimental designs that are fed directly into Ginkgo’s automated wet lab infrastructure. Ginkgo CEO Jason Kelly described their lab as the “Waymo of biology”: a physical system where the human sets the objective, and the AI does the driving. Researchers define what they’re trying to optimize—enzyme activity, protein folding, yield—and the system designs, executes, and evaluates the experiments in sequence.
This is the architecture OpenAI is betting on at scale: a reasoning model as the experimental brain, connected to automated physical or digital infrastructure that executes what the model designs. The bottleneck shifts from “how fast can a PhD student run an experiment” to “how fast can the lab hardware run and how good is the model’s hypothesis generation.” Both of those improve faster than human training cycles.
Karpathy’s Autoresearch: The Open-Source Preview
While OpenAI plans for 2028, Andrej Karpathy shipped a working version of this concept in 630 lines of Python on March 7, 2026. His autoresearch project—which collected over 21,000 GitHub stars within days—implements the minimal agent loop for autonomous ML experimentation on a single GPU.
The system works like this: an agent reads its own training script, forms a hypothesis (e.g., “adjusting the learning rate schedule should improve convergence”), modifies the code, runs a 5-minute experiment on a fixed compute budget, evaluates the result, and loops. No human approval required between cycles. Left running for two days on a “depth=12” language model, the agent made approximately 700 autonomous code changes and identified roughly 20 additive improvements that transferred to larger models—dropping the “Time to GPT-2” leaderboard metric from 2.02 hours to 1.80 hours, an 11% efficiency gain. On a single night in March, 333 instances of the agent ran concurrently on the Hyperspace distributed compute network, unsupervised.
Karpathy’s implementation is deliberately minimal. It does not search the literature, manage lab equipment, or write papers. But it demonstrates the core loop: hypothesis → experiment → evaluation → next hypothesis, running faster than any human can sustain. That loop is what OpenAI is trying to generalize across all of science.
The Hard Problems This Doesn’t Solve Yet
The roadmap is serious, but the challenges are not trivial. A few are worth naming plainly.
Experiment validation and fraud resistance. The research community is already dealing with AI-generated papers passing peer review and conference submissions manipulated by AI. A system that produces thousands of experiments per day at minimal cost will stress every downstream verification mechanism in science—journal review, replication, funding prioritization.
Hallucination in hypothesis generation. Reasoning models still produce confident, plausible-sounding hypotheses that are factually wrong. In software, a wrong hypothesis wastes a few minutes of GPU time. In biology, it may consume weeks of expensive lab reagents or, in clinical contexts, carry real risk. The models need calibrated uncertainty that current systems do not reliably provide.
Domain-specific grounding. General-purpose reasoning models perform unevenly across scientific domains. A model that is excellent at generating code experiments may be weak at designing wet lab protocols in microbiology or interpreting spectroscopic data in chemistry. The September 2026 intern is likely to be domain-scoped, not general, for exactly this reason.
Infrastructure cost. Sam Altman’s compute commitment—30 gigawatts of infrastructure, representing over $1.4 trillion in financial obligations—gives a sense of the stakes. Running thousands of research agents continuously is not a laptop-scale problem. The cost structure of automated science will favor organizations with infrastructure at cloud-provider scale, which has implications for who actually benefits from the acceleration.
What September 2026 Will Tell Us
The AI research intern, if it ships on schedule in September, will be the first real-world test of whether OpenAI’s framework holds up outside of controlled demonstrations. The questions that matter: which domains does it actually work in? What does the failure rate look like on real research problems versus curated benchmarks? And how do institutions—universities, labs, journals, funding bodies—adapt their processes to a system that produces results at machine speed?
Karpathy’s autoresearch already gives a working answer for ML experimentation. The Ginkgo collaboration gives one for constrained biology. Whether those answers generalize is the empirical question OpenAI is betting billions to answer. The September milestone won’t resolve it—but it will tell us whether the bet is directionally right.
Further Reading
- OpenAI is throwing everything into building a fully automated researcher — MIT Technology Review’s original reporting on the internal memo and strategic direction.
- OpenAI and Ginkgo Bioworks show how AI can accelerate scientific discovery — Scientific American on the live biology experiment collaboration.
- karpathy/autoresearch on GitHub — The 630-line open-source implementation of autonomous ML experimentation, the clearest working demonstration of the concept today.

