Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
LMDeploy is the Shanghai AI Laboratory's open-source toolkit for compressing, deploying, and serving large language models, and the v0.13.0 release on May 12, 2026 has pulled the project back onto GitHub Trending after two years of steady production use. With 7,872 stars, 699 forks, and Apache 2.0 licensing, LMDeploy sits squarely between research-grade engines and the hyperscaler-only TensorRT-LLM stack: it is fast enough to run alongside vLLM for InternLM, Qwen, Llama, and DeepSeek workloads, and it is one of the few engines with first-class support for Huawei Ascend, Cambricon, and Maca accelerators in addition to NVIDIA CUDA. ## TurboMind: A Pure C++ Inference Engine The engine at the core of LMDeploy is TurboMind, a from-scratch C++ runtime that was forked from NVIDIA FasterTransformer and then heavily rewritten to support paged attention, persistent batching, tensor parallelism, and dynamic batch sizes. Because TurboMind is C++ rather than Python, the latency floor is meaningfully lower than the typical Python-orchestrated engine, and the v0.13.0 scheduler refinements specifically prevent prefill starvation under high decode load, which is the most common failure mode for engines serving long-context chat traffic. ## Comprehensive Quantization Story LMDeploy supports a wider range of quantization formats than almost any competitor in a single package. The toolkit offers weight-only INT4 and INT8 quantization, KV-cache quantization at INT4 and INT8 precision, 4-bit AWQ for compressed weights, and MXFP4 for Blackwell-class hardware. The v0.13.0 release adds TurboQuant with the new quant_policy=42 mode, which extends KV cache quantization to longer contexts without the quality loss that plagued earlier INT4 KV implementations. ## Multi-Accelerator and Multi-Model Coverage NVIDIA remains the primary backend, but v0.13.0 confirms LMDeploy as a serious cross-silicon engine. Huawei Ascend, AMD ROCm, Cambricon MLU, and Apple Maca on macOS are all secondary backends with active maintenance. Model coverage now includes Qwen3.5 MoE with Blackwell-specific optimization via cublasGemmGroupedBatchedEx, InternS2 in preview, the full InternLM family, Llama 2 and 3, CodeLlama, Mistral, and DeepSeek. Multimodal pipelines for InternVL and friends are integrated rather than bolted on. ## API Compatibility and Operational Polish The v0.13.0 release adds Anthropic-compatible serving endpoints alongside the existing OpenAI-compatible API, which makes LMDeploy a drop-in replacement for either provider in client code. The release also adds session identity preservation that maps user-supplied session IDs to internal ones for stable conversational memory, an improved XML tool-call parser abstraction, and configurable kernel block sizes for fine-tuning throughput on specific accelerators. Sixteen bug fixes target MTP (multi-token prediction) issues, cache sizing edge cases, and non-stream token processing errors that had bothered production users. ## Production Heritage With an Open License LMDeploy was built inside Shanghai AI Lab to serve the InternLM model family at scale, and the production-first design choices show throughout the codebase. Documentation lives at lmdeploy.readthedocs.io and covers benchmarking, distributed deployment, quantization recipes, and multi-accelerator setup with the same depth as commercial inference platforms. For teams that want vLLM-class performance on NVIDIA hardware plus the option to run on Ascend or AMD silicon without changing their serving stack, LMDeploy v0.13.0 is the strongest single-vendor-independent choice currently available under an Apache 2.0 license.