Trending

KTransformers

kvcache-aiApache-2.0

Inference16.9K Stars1.3K Forks143 views

KTransformers is a flexible Python-centric framework for heterogeneous CPU-GPU LLM inference and fine-tuning optimization, enabling consumer-grade hardware to run ultra-large models like DeepSeek-R1-671B. It uses Intel AMX/AVX acceleration, NUMA-aware Mixture-of-Experts placement, and multi-GPU coordination to dramatically reduce deployment barriers, achieving fine-tuning of 671B parameter models with just 70GB GPU memory plus 1.3TB RAM. The framework supports INT4/INT8 quantization, prefix caching, and a wide range of leading models including DeepSeek, Qwen3, Kimi-K2, and GLM series.

Key Features

Heterogeneous CPU-GPU computing for running ultra-large LLMs on consumer hardware
Intel AMX/AVX2 acceleration for optimized CPU inference
NUMA-aware Mixture-of-Experts (MoE) expert placement
INT4/INT8 quantization with GPTQ support on GPU
Multi-GPU coordination for distributed inference
Supports DeepSeek-V3/R1, Qwen3, Kimi-K2, GLM, and Llama models
Fine-tune 671B models with 70GB GPU + 1.3TB RAM

Open Source

KTransformers

Key Features

Tags

Related Projects

Ollama

llama.cpp

vLLM

Unsloth