This Saturday afternoon, GPT-5.4 and Claude Opus 4.6 are running autonomous AI research experiments on 16 GPUs. No human is watching. No one is writing code. The only human-editable artifact in the entire setup is a markdown file.
That's not a metaphor. That's karpathy/autoresearch, shipped to GitHub in March 2026, and it's one of the more quietly significant things to happen in applied AI this year.
Three Files, One Research Lab
The autoresearch repo has exactly three files that matter:
- prepare.py — fixed constants and one-time data prep. Not touched after setup.
- train.py — 629 lines of Python containing the full GPT model, optimizer (Muon + AdamW), and training loop. The AI agent edits this.
- program.md — a markdown file containing instructions for the AI research agent. The human edits this.
That's the entire research organization. The agent reads program.md, modifies train.py, runs a 5-minute training experiment, checks the validation loss metric (val_bpb — bits per byte, lower is better), keeps improvements, discards regressions, and repeats. You wake up to a log of what it tried overnight.
Karpathy's framing in the README is deliberately dramatic — he writes from the imagined future perspective of AI agents that have been running for 10,205 generations. The point lands: the experiment loop itself is the product, not any single training run.
The 5-Minute Budget as a Design Constraint
The 5-minute wall-clock time budget per experiment is not a limitation — it's the constraint that makes the whole system tractable. It means:
- Each iteration costs roughly the same in compute regardless of architectural changes
- The agent can run ~100+ experiments overnight on a single H100
- The metric (
val_bpb) is vocabulary-size-independent, so architectural changes are fairly compared
This is the detail most coverage misses. The autonomous loop only works because the evaluation is bounded and the metric is normalized. Without those two properties, you'd get unbounded compute burns and incomparable results.
Yuchen Jin (Yuchenj_UW), who's running frontier models against the autoresearch setup today, noted that GPT-5.4 and Claude Opus 4.6 are functioning as competing AI researchers on the same task — which doubles as a live eval of which model is the better autonomous scientist.
program.md: The Human's Last Interface
Here's what the repo makes explicit in a way that most AI discourse dances around: the human's job in this system is writing the research agenda, not writing code. Karpathy's description: "you are programming the program.md."
This matters for any engineering lead running a team that's increasingly relying on AI coding agents. The leverage point has shifted upstream. The teams that will extract the most value from autonomous coding and research agents aren't the ones who know the most Python — they're the ones who can write the clearest, most constraint-aware program.md equivalents.
In practice, for a 30-person dev shop running an AI-assisted engineering org, this looks like:
- Before: Senior engineer writes the training loop, junior engineer runs experiments, PM reviews results in a retro
- After: Engineering lead writes the research agenda (the markdown file), the agent runs 80+ experiments overnight, engineer reviews the log and decides what to promote
The role shift isn't from engineer to non-engineer. It's from code-writer to agenda-writer. That's a meaningful distinction that most "AI replaces developers" takes miss entirely.
The Live Evidence Is Running Right Now
As of this post, researchers are already stress-testing the framework with frontier models at scale. The March 5th data point from the blockchain.news roundup noted that agents made 110 changes over 12 hours, reducing validation loss from 0.862415 to 0.858039 on a d12 model — without increasing wall clock time.
That's not a toy result. A 0.004 delta in val_bpb over 110 unsupervised iterations is exactly the kind of incremental but compounding progress that makes autonomous research loops valuable: any individual improvement is small, but the accumulation is what you're buying.
The most useful thing about autoresearch isn't the specific models it trains. It's the proof-of-concept that a human can deploy an AI research agent against a well-defined problem, specify the agenda in plain text, and come back to net-positive results. That loop is replicable across domains well beyond LLM training.
The race to automate the researcher rather than the coder is now clearly underway — and the entry point is a single markdown file on a single GPU. Shops that start practicing agenda-writing now will have a structural head start when the same pattern arrives in their domain.
