Run Autonomous LLM Research with autoresearch: Let AI Agents Train Models Overnight

⬅️ Back to Tutorials

🤖 What Is autoresearch?

autoresearch is an open-source experiment that hands LLM training to an AI agent. You set up a single-GPU box, write instructions in a Markdown file, and the agent iterates on model architecture, hyperparameters, and optimizers overnight.

The premise is simple: instead of manually tuning code, you let the agent propose changes, run a 5-minute experiment, and log whether the result improved. You wake up to a git history of experiments and (hopefully) a better model.

🚀 Why It Matters

  • Hands-off research: The agent runs while you sleep
  • Comparable experiments: Fixed 5-minute budget means every run is fair game
  • Single-file focus: Only train.py is edited; the scope stays manageable
  • Human steerable: You control the agent via program.md, not Python
  • Self-contained: No distributed training, no complex configs, one GPU

🏁 Quick Start

Requirements: a single NVIDIA GPU, Python 3.10+, and uv.

1. Install and prepare

# clone the repo
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch

# install dependencies
uv sync

# download data and train a tokenizer (one-time, ~2 minutes)
uv run prepare.py

2. Verify the baseline

uv run train.py

This runs a single 5-minute training experiment and prints a summary including val_bpb. If this works, your setup is ready.

3. Launch the agent

Open your AI coding assistant in the repo, point it at program.md, and prompt something like:

Hi, have a look at program.md and let’s kick off a new experiment. Let’s do the setup first.

The agent reads program.md, creates a git branch, and starts the experiment loop.

🧠 How It Works

Three files matter:

  • prepare.py – fixed constants, data prep, tokenizer, dataloader, evaluation. Do not edit.
  • train.py – the model, optimizer, and training loop. The agent edits this.
  • program.md – your instructions to the agent. You edit this.

The agent runs in a loop:

  1. Reads the current git state
  2. Modifies train.py with an experimental idea
  3. Commits the change
  4. Runs uv run train.py for 5 minutes
  5. Logs the result to results.tsv

Each row in results.tsv records the commit hash, val_bpb, peak memory, status (keep / discard / crash), and a short description of what was tried.

📝 What You Can and Cannot Change

The agent can edit:

  • Model architecture
  • Optimizer settings
  • Hyperparameters
  • Batch size
  • Training loop logic

The agent cannot edit:

  • prepare.py (read-only evaluation harness)
  • pyproject.toml (no new dependencies)
  • The evaluation metric itself

You can edit:

  • program.md to change agent behavior
  • Model depth or dataset if you fork for smaller hardware

🛠️ Running on Smaller Hardware

The default setup targets an H100. For laptops or smaller GPUs, community forks exist for macOS, Windows RTX, and AMD. To tune manually:

  • Use a lower-entropy dataset like TinyStories
  • Reduce vocab_size from 8192 down to 1024 or even 256
  • Lower MAX_SEQ_LEN in prepare.py (try 256)
  • Reduce DEPTH in train.py (try 4)
  • Use WINDOW_PATTERN “L” instead of “SSSL”
  • Lower TOTAL_BATCH_SIZE to powers of 2 like 2**14

📊 Reading the Results

After a run, the terminal prints:

val_bpb:          0.997900
training_seconds: 300.1
total_seconds:    325.9
peak_vram_mb:     45060.2
mfu_percent:      39.80
total_tokens_M:   499.6
num_steps:        953
num_params_M:     50.3
depth:            8

The key number is val_bpb. Lower is better. To extract it from logs:

grep "^val_bpb:" run.log

📝 Tips

  • Keep the first run as a baseline before any agent changes
  • Redirect all output to a log file so the agent context stays clean
  • Weigh complexity cost against improvement magnitude; simpler code that performs equally well is a win
  • Run each experiment on its own git branch so you can compare histories
  • Review the git log in the morning; discard branches that crashed or regressed

That is it. You write the mission, the agent runs the experiments, and the GPU does the work while you sleep.

Crepi il lupo! 🐺