Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

KTransformers - Open Source | Evermx | Evermx

Back to Open Source

Trending

KTransformers

kvcache-aiApache-2.0

View on GitHub

Inference17.3K Stars1.3K Forks5 views

KTransformers is a flexible open-source framework for experiencing cutting-edge LLM inference and fine-tuning optimizations through CPU-GPU heterogeneous computing. Maintained by the kvcache-ai team and now past 17,000 GitHub stars, its core promise is making frontier-scale Mixture-of-Experts (MoE) models runnable on hardware that would otherwise be far too small for them. By intelligently splitting work between a single GPU and the host CPU and memory, KTransformers lets developers serve models that nominally require multiple high-end accelerators. ## Heterogeneous CPU-GPU Inference The framework's signature technique is expert offloading for MoE architectures: dense, frequently-used weights stay on the GPU while the many sparse expert weights are placed in CPU RAM and computed with optimized kernels. Recent releases add CPU-GPU expert scheduling, three-layer (GPU-CPU-disk) prefix-cache reuse, and multi-concurrency serving, squeezing usable throughput out of commodity machines rather than data-center clusters. ## Day-0 Support for Frontier MoE Models KTransformers tracks the open-model frontier aggressively, shipping day-0 or near-day-0 support for large releases such as DeepSeek-V4-Flash, Kimi-K2, GLM-5, MiniMax-M3, and Qwen3-Next. This makes it a common first stop for enthusiasts who want to run the newest giant MoE checkpoints locally before broader engine support lands. ## Beyond Inference: Fine-Tuning The project is not inference-only. Through an integration with LLaMA-Factory it exposes supervised fine-tuning (SFT) and RL-DPO workflows on the same heterogeneous backend, so the hardware that runs a model can also adapt it. Tutorials cover unified rent-and-run training plus inference pipelines for very large models. ## Hardware Flexibility Supported backends are unusually broad: NVIDIA CUDA, AMD ROCm, Intel Arc XPU, and Ascend NPU, plus AMX-Int8/BF16 acceleration on capable Intel CPUs and an AVX2-only path for older processors. It also handles low-bit and hybrid quantized weights, including unsloth 1.58/2.51-bit and FP8 formats, to fit larger contexts and models into constrained memory. ## Considerations KTransformers describes itself as a research project, so setup is more involved than turnkey servers and performance depends heavily on matching the right kernel, quantization, and offload strategy to your specific CPU and GPU. Documentation is extensive but fast-moving. For developers determined to run or fine-tune the largest open MoE models on limited hardware, however, KTransformers is one of the most capable and actively developed options available.

Key Features

CPU-GPU heterogeneous inference to run huge MoE models on limited VRAM
MoE expert offloading with CPU-GPU expert scheduling
Day-0 support for frontier models (DeepSeek-V4, Kimi-K2, GLM-5, MiniMax-M3)
Low-bit and hybrid quantization, including unsloth 1.58/2.51-bit and FP8
Supervised fine-tuning and RL-DPO via LLaMA-Factory integration
Broad hardware backends: NVIDIA, AMD ROCm, Intel Arc XPU, Ascend NPU, and AVX2 CPUs

Related Projects

TrendingInference

GitHub

165.0K15.0K

Ollama

ollama

MIT259

Open Source

KTransformers

Key Features

Tags

Related Projects

Ollama

llama.cpp

vLLM

Unsloth