Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

LMDeploy - Open Source | Evermx | Evermx

Back to Open Source

TrendingFeatured

LMDeploy

Shanghai AI Laboratory (InternLM)Apache-2.0

View on GitHub

Inference7.9K Stars699 Forks112 views

LMDeploy is the Shanghai AI Laboratory's open-source toolkit for compressing, deploying, and serving large language models, and the v0.13.0 release on May 12, 2026 has pulled the project back onto GitHub Trending after two years of steady production use. With 7,872 stars, 699 forks, and Apache 2.0 licensing, LMDeploy sits squarely between research-grade engines and the hyperscaler-only TensorRT-LLM stack: it is fast enough to run alongside vLLM for InternLM, Qwen, Llama, and DeepSeek workloads, and it is one of the few engines with first-class support for Huawei Ascend, Cambricon, and Maca accelerators in addition to NVIDIA CUDA. ## TurboMind: A Pure C++ Inference Engine The engine at the core of LMDeploy is TurboMind, a from-scratch C++ runtime that was forked from NVIDIA FasterTransformer and then heavily rewritten to support paged attention, persistent batching, tensor parallelism, and dynamic batch sizes. Because TurboMind is C++ rather than Python, the latency floor is meaningfully lower than the typical Python-orchestrated engine, and the v0.13.0 scheduler refinements specifically prevent prefill starvation under high decode load, which is the most common failure mode for engines serving long-context chat traffic. ## Comprehensive Quantization Story LMDeploy supports a wider range of quantization formats than almost any competitor in a single package. The toolkit offers weight-only INT4 and INT8 quantization, KV-cache quantization at INT4 and INT8 precision, 4-bit AWQ for compressed weights, and MXFP4 for Blackwell-class hardware. The v0.13.0 release adds TurboQuant with the new quant_policy=42 mode, which extends KV cache quantization to longer contexts without the quality loss that plagued earlier INT4 KV implementations. ## Multi-Accelerator and Multi-Model Coverage NVIDIA remains the primary backend, but v0.13.0 confirms LMDeploy as a serious cross-silicon engine. Huawei Ascend, AMD ROCm, Cambricon MLU, and Apple Maca on macOS are all secondary backends with active maintenance. Model coverage now includes Qwen3.5 MoE with Blackwell-specific optimization via cublasGemmGroupedBatchedEx, InternS2 in preview, the full InternLM family, Llama 2 and 3, CodeLlama, Mistral, and DeepSeek. Multimodal pipelines for InternVL and friends are integrated rather than bolted on. ## API Compatibility and Operational Polish The v0.13.0 release adds Anthropic-compatible serving endpoints alongside the existing OpenAI-compatible API, which makes LMDeploy a drop-in replacement for either provider in client code. The release also adds session identity preservation that maps user-supplied session IDs to internal ones for stable conversational memory, an improved XML tool-call parser abstraction, and configurable kernel block sizes for fine-tuning throughput on specific accelerators. Sixteen bug fixes target MTP (multi-token prediction) issues, cache sizing edge cases, and non-stream token processing errors that had bothered production users. ## Production Heritage With an Open License LMDeploy was built inside Shanghai AI Lab to serve the InternLM model family at scale, and the production-first design choices show throughout the codebase. Documentation lives at lmdeploy.readthedocs.io and covers benchmarking, distributed deployment, quantization recipes, and multi-accelerator setup with the same depth as commercial inference platforms. For teams that want vLLM-class performance on NVIDIA hardware plus the option to run on Ascend or AMD silicon without changing their serving stack, LMDeploy v0.13.0 is the strongest single-vendor-independent choice currently available under an Apache 2.0 license.

Key Features

TurboMind: pure C++ inference engine with paged attention, persistent batching, and tensor parallelism
Comprehensive quantization: weight-only INT4/INT8, KV-cache INT4/INT8, 4-bit AWQ, MXFP4 for Blackwell
v0.13.0 TurboQuant (quant_policy=42) for long-context KV cache quantization without quality loss
Multi-accelerator support: NVIDIA CUDA primary, Huawei Ascend, AMD ROCm, Cambricon MLU, Apple Maca
Qwen3.5 MoE optimization on Blackwell via cublasGemmGroupedBatchedEx
Anthropic-compatible and OpenAI-compatible serving endpoints
Session identity preservation for stable conversational memory across requests
Scheduler refinement that prevents prefill starvation under high decode load

Related Projects

TrendingInference

GitHub

165.0K15.0K

Ollama

ollama

MIT291

Open Source

LMDeploy

Key Features

Tags

Related Projects

Ollama

llama.cpp

vLLM

Unsloth