Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
AirLLM is an Apache-2.0 licensed Python library by Gavin Li that lets a single 4GB consumer GPU run inference on 70B-parameter language models, including Llama 3.1 405B variants in extreme configurations. With 18,900+ GitHub stars and 2,000+ forks, it has become the reference project for layer-wise sequential inference, which is a different memory strategy than the more common 4-bit and 2-bit quantization tricks used elsewhere. ## The Core Trick: Layer-Wise Sequential Inference Most "run a giant model on a tiny GPU" projects rely on aggressive quantization (reducing weights from 16-bit to 4-bit or 2-bit) combined with CPU offload. AirLLM takes a different bet. At initialization it decomposes the model and saves each transformer layer separately to disk. At inference time, it loads one layer into VRAM, runs the forward pass for that layer on the current activations, then unloads it and brings in the next layer. Total VRAM needed is roughly one layer's weights plus activations, which for a 70B model fits in about 4GB. The trade-off is moved from compute to disk I/O. Inference is no longer bound by how much VRAM you have, but by how fast the SSD can deliver the next layer. This is a fundamentally different bottleneck than the quantization-only approach takes, and it is the reason AirLLM can run a 405B model at all on consumer hardware. ## Model Coverage The project exposes an `AutoModel` interface that auto-detects the architecture, with explicit support for Llama (up to Llama 3.1 405B), Mistral, Qwen and Qwen2.5, ChatGLM, Baichuan, and InternLM. This covers most of the open-weight ecosystem that engineers want to actually run locally. ## Compression as a Speed Optimization, Not a Memory Optimization AirLLM also offers block-wise 4-bit and 8-bit quantization, but the framing is unusual: compression is treated as a speed optimization (smaller layers load from disk faster) rather than the primary memory-saving mechanism. The documentation cites roughly 3x inference speedup with minimal accuracy degradation when block-wise quantization is enabled. Layer prefetching, which overlaps loading the next layer with the current layer's computation, adds another ~10% on top. ## What It Is Actually Good For The honest use case is batch inference and exploration, not interactive chat. Because every token forward pass must stream all layers from disk, single-token latency is dominated by SSD bandwidth. For an NVMe SSD pulling layers of a quantized 70B model, you are looking at on the order of seconds per token, not tens of milliseconds. This makes AirLLM excellent for: evaluating a large model on a fixed test set, running a one-shot deep analysis where total cost matters more than latency, and learning what a 70B or 405B model can actually do without renting an H100. It is the wrong tool for production serving. ## Hardware Notes NVIDIA GPUs are the primary target, but Apple silicon is supported on macOS and CPU-only inference works as a fallback. The biggest practical constraint after VRAM is disk space: layer-wise splitting roughly doubles disk usage for the model, and there is an option to delete the original weights after splitting to reclaim that space. A fast NVMe SSD is effectively required; running this off a SATA SSD or a hard drive multiplies already-slow latency. ## Limitations Latency is the obvious one. Layer streaming defeats the speed assumptions that modern inference servers (vLLM, TGI, SGLang) are built around, so AirLLM is not interchangeable with those systems for serving. Some models lack native padding tokens and require manual configuration. KV cache management across the layer-streaming boundary is more complex than in standard inference engines, and longer context windows scale poorly compared to a properly resourced GPU. Finally, the initial decomposition step is disk-heavy and slow, which is acceptable as a one-time cost but surprising to first-time users. Within its niche, however, AirLLM remains the cleanest open implementation of layer-wise sequential inference and the most practical way to put a 70B-class model on a 4GB GPU.