Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

Train LLM From Scratch - Open Source | Evermx | Evermx

Back to Open Source

Trending

Train LLM From Scratch

FareedKhan-devMIT

View on GitHub

LLM3.8K Stars521 Forks76 views

Train LLM From Scratch is an MIT-licensed Jupyter Notebook and PyTorch project by Fareed Khan that walks developers end to end through building a transformer language model from raw data ingestion to text generation. It has gained 3,800+ GitHub stars and 520+ forks by treating LLM training as something a single engineer with a consumer GPU can actually do, rather than as a black box that only frontier labs can touch. ## What It Actually Builds The project reconstructs the architecture from Attention is All You Need in plain PyTorch and connects it to a real corpus, real tokenization, and a real training loop. It is not a wrapper around HuggingFace Trainer. Multi-head self-attention, positional embeddings, layer normalization, residual connections, causal masking, and MLP blocks are all spelled out step by step so the reader sees every tensor shape, not just a configuration object. The end product is a small but functional GPT-style decoder model that emits coherent text after training. ## Scalable from 13M to 2B+ Parameters A notable design choice is that the same notebook scales from 13 million parameter toys all the way up to billion-plus parameter runs. The 13M variant trains on a T4 or RTX 3090 and produces grammatically correct English after a modest number of steps, which is enough to demystify the process. The 2B+ configuration documents what changes (deeper architecture, longer schedules, more careful tuning, A100-class hardware) so readers understand where the difficulty actually lives and why frontier-class training is hard. ## Realistic Data Pipeline The project uses The Pile, an 825GB open dataset, with Zstandard decompression and HDF5 for efficient on-disk storage. Tokenization uses tiktoken, the same encoder family OpenAI ships for GPT-3, which makes the resulting vocabulary and token counts directly comparable to mainstream LLM literature. This is meaningful because most from-scratch tutorials skip the data engineering step entirely and pretend tokenized corpora appear by magic. ## Training Choices The optimizer is AdamW with the now-standard warm-up and decay schedule. Causal masking, batched iteration, and gradient accumulation are explicitly implemented rather than imported. Inference covers both greedy and probabilistic sampling, so the reader sees how temperature, top-k, and top-p actually change the output distribution. ## Why It Stands Out Most LLM tutorials either stay at the toy-character-level RNN tier or jump straight to fine-tuning Llama 3. Train LLM From Scratch sits in the underserved middle: real transformer, real Pile data, real tokenizer, real consumer GPU. The Jupyter notebook format means each step is runnable and editable, which is a better fit for actual learning than a 20-file PyTorch Lightning project. The accompanying explanations of OOP, PyTorch tensor mechanics, and architectural decisions make it accessible to engineers who know Python but not transformer internals. ## Limitations This is a pedagogical project, not a path to a state-of-the-art base model. Even the 2B configuration is far from what would be competitive against an open release like Llama 3 or Qwen3. Training on The Pile is also a 2020-era choice, since modern frontier runs use much larger, more carefully filtered web crawls, code, and synthetic data. Anyone who treats this as a starting point for a serious foundation model run should expect months of additional work on data quality, tokenizer design, evaluation infrastructure, and distributed training. As a learning artifact, however, it is one of the clearest end-to-end PyTorch LLM walkthroughs currently maintained.

Key Features

End-to-end PyTorch pipeline from data download through tokenization to text generation
Transformer architecture implemented from scratch following Attention is All You Need
Scales from 13M parameter toy runs to 2B+ parameter configurations
Uses The Pile (825GB open corpus) with Zstandard decompression
tiktoken tokenization for GPT-3-compatible vocabulary
HDF5 storage for efficient large-dataset handling on disk
AdamW optimizer with causal masking and gradient accumulation
Greedy and probabilistic (temperature, top-k, top-p) text generation
Runnable on consumer GPUs (T4, RTX 3090, A100)