Keynote from Y Combinator AI Startup School • San Francisco • 17 June 2025
Why this talk matters
In “Software Is Changing (Again)” Andrej Karpathy argues that large-language-model agents mark a third great epoch of software:
| Wave | Interface | What engineers write | Compute target |
|---|---|---|---|
| 1.0 | Compilers/CPUs | Imperative code (C++, Java) | Deterministic machines |
| 2.0 | ML frameworks | Data + model architecture | Neural nets |
| 3.0 | LLMs-as-OS | Prompts & feedback loops | Stochastic “people spirits” running in the cloud |
Karpathy’s core claim: “The hottest new programming language is English.”
What “agentic AI” means in practice
| Property of LLM agents | Engineering consequence |
|---|---|
| Jagged intelligence – super-human recall, sub-human arithmetic | Add verify loops & type-checkers around every call |
| Utility / fab / OS triple-role – expensive to build, cheap to consume | Treat model labs like cloud vendors; architect for fail-over & rate limits |
| Programmed in natural language | Shift effort from syntax → prompt design & feedback heuristics |
| Partial autonomy sliders | Design UIs that expose diff views and “confidence knobs” (Cursor, Perplexity, etc.) |
| Agents as first-class users | Ship llms.txt, structured markdown, and self-describing APIs so bots—not just humans—can consume your product |
Five new pillars of Software 3.0 engineering
- Prompt-oriented architecture
- Store prompts under version control, test them like code.
- Guard-railed execution
- Wrap every model call in validators (regex, unit tests, type checks).
- Tight generate/verify cycles
- Latency budgets must include detours for automatic critique or human review.
- Agent-friendly infrastructure
- Documentation readable by both humans and parsers; think OpenAPI plus llms.txt.
- Observability for cognition
- Log token streams & system messages; inspect why an agent acted, not just what it returned.
These practices operationalise Karpathy’s warning that LLMs are “people spirits”—creative but error-prone;they demand DevOps-for-cognition, not just DevOps-for-code.
Opportunities unlocked by agentic AI
| Opportunity | Example today | Why it’s viable now |
|---|---|---|
| One-person MVPs (“vibe coding”) | MenuGen demo: full UI via prompts | LLM handles boilerplate; human supplies vision |
| Autonomous research & data pull | Perplexity Deep Research agents | Cheap in-context search + summarisation |
| Continuous code-base refactoring | Internal GPT-powered linters deleting dead paths at Tesla | Model can reason over millions of LOC |
| Agent-to-agent protocols | DeepWiki, GitIngest | Structured docs allow bots to traverse knowledge |
Risks & open questions
- Reliability debt – Hallucinations trade raw speed for debugging overhead.
- Platform centralisation – Cloud LLMs resemble 1960s mainframes; open-weight models may rebalance power later.
- Skill displacement – Traditional “middleware” coding shrinks, but prompt engineering, evaluation and agent-safety grow.
- Security surface – Agents that read/write code can exfiltrate secrets or commit “prompt injection” supply-chain attacks.
Karpathy advocates a “keep AI on the leash” stance—tight scopes, incremental release, human-in-the-loop—mirroring comments he reiterated in press interviews.
How to prepare your team
- Inventory high-leverage prompts – Identify workflows where English beats code.
- Layer tests – Write property-based tests that re-prompt until they pass.
- Instrument everything – Token-level logs + embeddings let you analyse failures post-hoc.
- Upskill in evaluation – Learn metrics like graded self-consistency (GSC) or LM-confidence via entropy.
- Pilot “agent-first” modules – Start with side-cars (doc summariser, migration assistant) before core logic.
Verdict on the keynote
Strengths
- Clear, memorable mental model (Software 1.0-2.0-3.0) that frames LLMs as a new compute substrate.
- Concrete engineering anecdotes (Autopilot diff-slider, MenuGen) that ground the hype.
- Honest about limitations—“jagged intelligence” demands guardrails.
Gaps
- Little discussion of on-device LLM inference and its impact on edge privacy.
- Tooling section glosses over non-code disciplines (design, legal) that will also face agentic disruption.
Overall: A must-watch thesis for technologists. It’s less a product roadmap than a design brief for building robust, human-aligned AI agents—and a wake-up call that the next decade of software will be written as much in English as in Python.
