What hardware do I need to run autoresearch?

You need a single NVIDIA GPU. The original repo is tested on an H100, but community forks support macOS, Windows with RTX, and AMD GPUs. For smaller hardware you can reduce model depth, sequence length, and vocabulary size.

How does the agent know what to do?

You write instructions in a program.md file. This file tells the agent which files it can edit, what metrics to optimize for, and how to log results. The default program.md is intentionally minimal so you can iterate on your own research org code.

What metric does autoresearch optimize?

The goal is the lowest val_bpb (validation bits per byte). Lower is better. Because every experiment runs for a fixed 5-minute wall-clock budget, results are directly comparable regardless of architecture changes, batch size, or model size.

Is autoresearch free and open source?

Yes. The repository is MIT-licensed and available on GitHub. It uses uv for dependency management and has no commercial dependencies. You own the code and the compute.

Run Autonomous LLM Research with autoresearch: Let AI Agents Train Models Overnight

Q: What is autoresearch?

autoresearch is an open-source experiment by Andrej Karpathy that lets an AI agent autonomously run LLM training experiments on a single GPU. You set up the environment, define the agent's instructions in a Markdown file, and the agent edits training code, runs 5-minute experiments, and keeps or discards changes based on validation metrics.

⬅️ Back to Tutorials

🤖 What Is autoresearch?

autoresearch is an open-source experiment that hands LLM training to an AI agent. You set up a single-GPU box, write instructions in a Markdown file, and the agent iterates on model architecture, hyperparameters, and optimizers overnight.

The premise is simple: instead of manually tuning code, you let the agent propose changes, run a 5-minute experiment, and log whether the result improved. You wake up to a git history of experiments and (hopefully) a better model.

🚀 Why It Matters

Hands-off research: The agent runs while you sleep
Comparable experiments: Fixed 5-minute budget means every run is fair game
Single-file focus: Only train.py is edited; the scope stays manageable
Human steerable: You control the agent via program.md, not Python
Self-contained: No distributed training, no complex configs, one GPU

🏁 Quick Start

Requirements: a single NVIDIA GPU, Python 3.10+, and uv.

1. Install and prepare

# clone the repo
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch

# install dependencies
uv sync

# download data and train a tokenizer (one-time, ~2 minutes)
uv run prepare.py

2. Verify the baseline

uv run train.py

This runs a single 5-minute training experiment and prints a summary including val_bpb. If this works, your setup is ready.

3. Launch the agent

Open your AI coding assistant in the repo, point it at program.md, and prompt something like:

Hi, have a look at program.md and let’s kick off a new experiment. Let’s do the setup first.

The agent reads program.md, creates a git branch, and starts the experiment loop.

🧠 How It Works

Three files matter:

prepare.py – fixed constants, data prep, tokenizer, dataloader, evaluation. Do not edit.
train.py – the model, optimizer, and training loop. The agent edits this.
program.md – your instructions to the agent. You edit this.

The agent runs in a loop:

Reads the current git state
Modifies train.py with an experimental idea
Commits the change
Runs uv run train.py for 5 minutes
Logs the result to results.tsv

Each row in results.tsv records the commit hash, val_bpb, peak memory, status (keep / discard / crash), and a short description of what was tried.

📝 What You Can and Cannot Change

The agent can edit:

Model architecture
Optimizer settings
Hyperparameters
Batch size
Training loop logic

The agent cannot edit:

prepare.py (read-only evaluation harness)
pyproject.toml (no new dependencies)
The evaluation metric itself

You can edit:

program.md to change agent behavior
Model depth or dataset if you fork for smaller hardware

🛠️ Running on Smaller Hardware

The default setup targets an H100. For laptops or smaller GPUs, community forks exist for macOS, Windows RTX, and AMD. To tune manually:

Use a lower-entropy dataset like TinyStories
Reduce vocab_size from 8192 down to 1024 or even 256
Lower MAX_SEQ_LEN in prepare.py (try 256)
Reduce DEPTH in train.py (try 4)
Use WINDOW_PATTERN “L” instead of “SSSL”
Lower TOTAL_BATCH_SIZE to powers of 2 like 2**14

📊 Reading the Results

After a run, the terminal prints:

val_bpb:          0.997900
training_seconds: 300.1
total_seconds:    325.9
peak_vram_mb:     45060.2
mfu_percent:      39.80
total_tokens_M:   499.6
num_steps:        953
num_params_M:     50.3
depth:            8

The key number is val_bpb. Lower is better. To extract it from logs:

grep "^val_bpb:" run.log

📝 Tips

Keep the first run as a baseline before any agent changes
Redirect all output to a log file so the agent context stays clean
Weigh complexity cost against improvement magnitude; simpler code that performs equally well is a win
Run each experiment on its own git branch so you can compare histories
Review the git log in the morning; discard branches that crashed or regressed

That is it. You write the mission, the agent runs the experiments, and the GPU does the work while you sleep.

Crepi il lupo! 🐺