Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

xLLM - Open Source | Evermx | Evermx

Back to Open Source

TrendingFeatured

xLLM

JD.comApache-2.0

View on GitHub

Inference1.3K Stars209 Forks75 views

xLLM is JD.com's open-source, high-performance LLM inference engine, designed from the ground up for the heterogeneous accelerator landscape that has emerged outside Nvidia. With 1,297 stars, 209 forks, and 933 commits since its August 2025 release, the Apache 2.0 project is now JD's production inference layer for intelligent customer service, risk control, supply chain optimization, and ad recommendation across the retailer's online business. The codebase is 91.6 percent C++, with smaller CUDA and Python layers, and the project explicitly targets LLMs, VLMs, diffusion transformers, and recommendation models from a single runtime. ## Built for Chinese Accelerators First xLLM's distinguishing feature is breadth of hardware support. The runtime ships with first-party kernels and schedulers for Huawei Ascend NPUs (A2 and A3, HDK Driver 25.2.0 and later), Cambricon MLUs (ILU and BI150), and MThreads MUSA (S5000), in addition to standard Nvidia CUDA paths. For Chinese enterprises navigating export controls, xLLM is one of the few production-grade inference frameworks that treats domestic silicon as a first-class deployment target rather than a forked experiment. ## Service-Engine Decoupled Architecture The project separates the service layer (request routing, queueing, batching) from the engine layer (model execution, KV management) so each can be scaled and replaced independently. Pipeline execution orchestration uses asynchronous request scheduling to minimize idle compute, runs parallel computation and communication at the model graph layer, and pipelines work across heterogeneous units in the same node. The result is steady throughput even when request shapes vary widely, which is the dominant pattern in customer-service and recommendation traffic. ## Dynamic Shape and Memory Management xLLM implements parameterized shape adaptation with multi-graph caching so the same compiled artifact can serve a wide range of batch sizes and sequence lengths without recompilation stalls. Memory is managed through controlled tensor pools that map discrete chunks to a continuous virtual address space, allocate on demand to reduce fragmentation, and feed an intelligent page scheduler that maximizes reuse. Custom operators including PageAttention and AllReduce are integrated directly rather than pulled in from external libraries, keeping the latency profile predictable. ## Speculative Decoding and Mooncake-Style KV Cache The engine implements speculative decoding using multi-core parallelism rather than the typical draft-model approach, dynamic MoE expert load balancing for models like DeepSeek V4 and the GLM 4.5 to 5 series, and a hybrid KV cache management system based on the Mooncake framework. The hybrid cache mixes hot in-GPU residency with offloaded host-memory tiers, which is what makes the engine practical for serving long-context VLM-R1 and DeepSeek-class models under realistic enterprise traffic. ## Production Heritage and Open Governance Unlike many open inference projects that begin as research prototypes, xLLM started inside JD.com Retail and was open-sourced after it had already been validated on the company's largest workloads. That production heritage is visible in the documentation, which is hosted at xllm.readthedocs.io and covers deployment topology, observability, and accelerator-specific tuning rather than only benchmark numbers. The Apache 2.0 license and active issue tracker make it a credible choice for any organization that needs to deploy frontier LLMs on non-Nvidia accelerators without locking into a single vendor's proprietary stack.

Key Features

Multi-accelerator support: Huawei Ascend NPU A2/A3, Cambricon MLU ILU/BI150, MThreads MUSA S5000, Nvidia CUDA
Service-engine decoupled architecture with asynchronous request scheduling
Parameterized shape adaptation with multi-graph caching to avoid recompilation stalls
Controlled tensor memory pools with discrete-to-continuous address mapping
Custom PageAttention and AllReduce operators integrated into the runtime
Speculative decoding via multi-core parallelism and dynamic MoE expert load balancing
Mooncake-based hybrid KV cache management across GPU and host tiers
Production-validated on JD Retail customer service, risk control, and ad recommendation workloads

Related Projects

TrendingInference

GitHub

165.0K15.0K

Ollama

ollama

MIT291

Open Source

xLLM

Key Features

Tags

Related Projects

Ollama

llama.cpp

vLLM

Unsloth