Trending

xLLM

jd-opensourceApache-2.0

Inference1.2K Stars95 Forks153 views

JD.com's open-source high-performance LLM inference engine optimized for Chinese AI accelerators including Ascend NPU, Cambricon MLU, Moore Threads MUSA, and Iluvatar BI150. xLLM delivers 2.2x throughput over vLLM-Ascend on Qwen models through its service-engine decoupled architecture, full graph pipeline execution, global KV cache management, and dynamic MoE expert load balancing. Battle-tested in JD.com's production retail AI systems before open-source release.

Key Features

Service-engine decoupled architecture for independent scaling of serving and compute layers
2.2x throughput over vLLM-Ascend on Qwen-series models under identical TPOT constraints
Supports Ascend NPU (A2/A3), Cambricon MLU, Moore Threads MUSA (S5000), Iluvatar BI150
Global KV cache management with hierarchical offloading and on-demand allocation
Dynamic MoE expert load balancing and speculative decoding acceleration
Day-0 model support: DeepSeek V3/R1, Qwen3, GLM-5, Llama3 and more
Production-deployed at JD.com across customer service, risk control, and ad recommendation

Open Source

xLLM

Key Features

Tags

Related Projects

Ollama

llama.cpp

vLLM

Unsloth