How to Train Your GPT: Build a Language Model From Scratch
There are two kinds of ML tutorials. The shallow kind where you call .fit() and walk away none the wiser. The academic kind where you need a PhD to get past the first equation. This project by raiyanyahya does something rare: it builds a complete GPT from scratch, explains every line, and assumes you know basic Python and nothing else.
How to Train Your GPT is a 12-chapter interactive textbook with 7,500+ lines of annotated code. You write the tokenizer, embeddings, attention mechanism, transformer blocks, training loop, and inference engine yourself. The architecture follows LLaMA 3: RoPE positional encoding, RMSNorm, SwiGLU activations, pre-norm residuals. Not GPT-2 style from 2019. The stuff production models actually use today.
What You Build
A 151 million parameter transformer, implemented from scratch in PyTorch. The tiny config (17M params, 4 layers) runs in minutes on a CPU. The full scale (768 dims, 12 layers) needs a GPU.
The code is split into clear chunks. BPE tokenizer in about 60 lines. Multi-head attention in 120. The full model in 200. Training pipeline in 250. Every line has a comment explaining what it does and why.
How It Teaches
Each chapter follows the same structure. First an analogy in plain language, no jargon. Then a worked example with real numbers traced through. Then the annotated code. Then a diagram.
There are also 15 standalone topic explainers for the key techniques: RoPE, RMSNorm, SwiGLU, KV cache, AdamW, mixed precision, and more. Each one covers what it is, where it goes, why it works, and includes a runnable code example.
Companion Jupyter notebooks let you skip the explanations and just run the code end to end. Open any notebook, hit Run All, and watch the model learn on a tiny dataset in minutes.
Who It’s For
The prerequisite is Python basics. Variables, functions, classes, pip install. That’s it. No calculus, no linear algebra, no previous ML experience. The explanations teach the math as they go.
I clipped this because it’s the resource I wish existed when I first tried to understand attention. After finishing, you won’t just know that attention works. You’ll understand why it’s scaled by 1 over root d_k. How RoPE captures relative position through rotation. Why pre-norm beats post-norm. Where every gradient flows during backprop. That level of detail.
Links
- GitHub — The full project
- Start with Chapter 0 — The overview and big picture
Crepi il lupo! 🐺