Run Autonomous LLM Research with autoresearch: Let AI Agents Train Models Overnight
🤖 What Is autoresearch?
autoresearch is an open-source experiment that hands LLM training to an AI agent. You set up a single-GPU box, write instructions in a Markdown file, and the agent iterates on model architecture, hyperparameters, and optimizers overnight.
The premise is simple: instead of manually tuning code, you let the agent propose changes, run a 5-minute experiment, and log whether the result improved. You wake up to a git history of experiments and (hopefully) a better model.
🚀 Why It Matters
- Hands-off research: The agent runs while you sleep
- Comparable experiments: Fixed 5-minute budget means every run is fair game
- Single-file focus: Only train.py is edited; the scope stays manageable
- Human steerable: You control the agent via program.md, not Python
- Self-contained: No distributed training, no complex configs, one GPU
🏁 Quick Start
Requirements: a single NVIDIA GPU, Python 3.10+, and uv.
1. Install and prepare
# clone the repo
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
# install dependencies
uv sync
# download data and train a tokenizer (one-time, ~2 minutes)
uv run prepare.py2. Verify the baseline
uv run train.pyThis runs a single 5-minute training experiment and prints a summary including val_bpb. If this works, your setup is ready.
3. Launch the agent
Open your AI coding assistant in the repo, point it at program.md, and prompt something like:
Hi, have a look at program.md and let’s kick off a new experiment. Let’s do the setup first.
The agent reads program.md, creates a git branch, and starts the experiment loop.
🧠 How It Works
Three files matter:
- prepare.py – fixed constants, data prep, tokenizer, dataloader, evaluation. Do not edit.
- train.py – the model, optimizer, and training loop. The agent edits this.
- program.md – your instructions to the agent. You edit this.
The agent runs in a loop:
- Reads the current git state
- Modifies train.py with an experimental idea
- Commits the change
- Runs
uv run train.pyfor 5 minutes - Logs the result to results.tsv
Each row in results.tsv records the commit hash, val_bpb, peak memory, status (keep / discard / crash), and a short description of what was tried.
📝 What You Can and Cannot Change
The agent can edit:
- Model architecture
- Optimizer settings
- Hyperparameters
- Batch size
- Training loop logic
The agent cannot edit:
- prepare.py (read-only evaluation harness)
- pyproject.toml (no new dependencies)
- The evaluation metric itself
You can edit:
- program.md to change agent behavior
- Model depth or dataset if you fork for smaller hardware
🛠️ Running on Smaller Hardware
The default setup targets an H100. For laptops or smaller GPUs, community forks exist for macOS, Windows RTX, and AMD. To tune manually:
- Use a lower-entropy dataset like TinyStories
- Reduce vocab_size from 8192 down to 1024 or even 256
- Lower MAX_SEQ_LEN in prepare.py (try 256)
- Reduce DEPTH in train.py (try 4)
- Use WINDOW_PATTERN “L” instead of “SSSL”
- Lower TOTAL_BATCH_SIZE to powers of 2 like 2**14
📊 Reading the Results
After a run, the terminal prints:
val_bpb: 0.997900
training_seconds: 300.1
total_seconds: 325.9
peak_vram_mb: 45060.2
mfu_percent: 39.80
total_tokens_M: 499.6
num_steps: 953
num_params_M: 50.3
depth: 8The key number is val_bpb. Lower is better. To extract it from logs:
grep "^val_bpb:" run.log📝 Tips
- Keep the first run as a baseline before any agent changes
- Redirect all output to a log file so the agent context stays clean
- Weigh complexity cost against improvement magnitude; simpler code that performs equally well is a win
- Run each experiment on its own git branch so you can compare histories
- Review the git log in the morning; discard branches that crashed or regressed
That is it. You write the mission, the agent runs the experiments, and the GPU does the work while you sleep.
Crepi il lupo! 🐺