Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
ExLlamaV3 is the third major generation of the ExLlama project, an MIT-licensed inference library for running quantized large language models efficiently on consumer-class GPUs. Maintained by turboderp, it has built a strong following among local-LLM enthusiasts who push single-GPU performance to its limits. The library introduces a new EXL3 quantization format and a rewritten inference runtime designed around modern consumer hardware like the RTX 4090 and RTX 5090. ## Why ExLlamaV3 Matters Most high-profile inference engines target data center GPUs and multi-user serving. ExLlamaV3 is unapologetically focused on a different audience: individual users running a single large model on a single workstation GPU. For this audience, raw throughput matters less than fitting the largest possible model into available VRAM with acceptable quality, then generating tokens as fast as memory bandwidth allows. With aggressive quantization, a Llama-3 70B model can fit on a single 24GB RTX 4090 and generate at memory-bound speeds. This unlocks frontier-class model quality for users without access to multi-GPU rigs or cloud accounts. ## EXL3 Quantization Format ExLlamaV3 introduces EXL3, a streamlined variant of QTIP (Quantization with Trellis-coded Independent Posteriors) from Cornell RelaxML. The format supports fine-grained bits-per-weight settings, allowing users to dial in the precise tradeoff between model size and quality. A conversion utility produces EXL3 weights from a HuggingFace checkpoint in a single step, taking a couple of minutes for smaller models and a few hours for 70B-class models on a single RTX 4090. Unlike formats that require offline calibration on a large dataset, EXL3 conversion is fast and reproducible, making it practical for users to quantize their own fine-tuned models. ## Marlin-Inspired GEMM Kernels The inference runtime uses Marlin-inspired GEMM kernels that achieve roughly memory-bound latency at 4 bits per weight on the RTX 4090. This means generation speed is limited by how fast weights can be read from VRAM rather than by arithmetic throughput, which is the theoretical optimum for decode-heavy workloads. The kernels also support mixed-precision activation handling to preserve quality on sensitive layers. ## Optimized Sampling and Speculative Decoding ExLlamaV3 includes a fast sampler implementation supporting temperature, top-k, top-p, min-p, and DRY repetition penalty, all running on GPU to avoid CPU round trips. Speculative decoding with a smaller draft model is supported natively, providing additional generation speedups for workloads that can tolerate the memory overhead of a second model. ## Long Context Without Sacrificing Speed The runtime handles long context efficiently through paged attention and tensor-parallel attention computation across multiple GPUs when available. RoPE scaling techniques including dynamic NTK and YaRN are supported, letting users extend Llama-3 and similar models well beyond their native context length without retraining. ## Python and OpenAI-Compatible Server ExLlamaV3 ships with a clean Python API for embedding in custom applications, plus an OpenAI-compatible HTTP server for use with existing client tooling like OpenWebUI, SillyTavern, and LM Studio. The server supports streaming, function calling, and multi-user request batching despite the project's single-user heritage. ## Active Community Ecosystem The ExLlama ecosystem includes a vibrant community on r/LocalLLaMA and Discord that produces quantized weights for new model releases within hours of their announcement. Pre-quantized EXL3 files for popular models are widely available on HuggingFace, making it trivial for users to try new architectures without running their own conversions. ## Limitations ExLlamaV3 is optimized for NVIDIA GPUs and does not currently support AMD or Apple Silicon. The project's single-user, single-node focus means features expected in production serving stacks, such as cluster-wide autoscaling, distributed KV caching, and multi-tenant SLO management, are out of scope. Documentation is improving but remains primarily aimed at technically sophisticated users comfortable reading source code.