Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

TokenSpeed - Open Source | Evermx | Evermx

Back to Open Source

Trending

TokenSpeed

lightseekorgMIT

View on GitHub

Inference1.3K Stars134 Forks71 views

TokenSpeed is LightSeek Foundation's MIT-licensed LLM inference engine, built from the start to chase TensorRT-LLM-level performance while keeping vLLM-level usability. With 1,300+ GitHub stars in its first month and a public benchmark showing roughly 9% lower min-latency and 11% higher throughput than TensorRT-LLM on Kimi K2.5 at 100 TPS/User on NVIDIA B200, it is one of the more aggressive new entrants in the high-performance inference space, and it has positioned itself specifically around agentic workloads rather than chat throughput. ## Why Another Inference Engine The project's core thesis is that existing inference engines optimize either for raw throughput on long generations (vLLM) or for vendor-tuned latency on specific GPUs (TensorRT-LLM), but neither was designed around the shape of modern agent traffic: many short turns, tight reasoning loops, frequent KV cache churn, and bursty tool-call patterns. TokenSpeed rebuilds the runtime to match that shape, with kernel scheduling, KV management, and request lifecycle handling all oriented around low-latency, high-concurrency agent serving. ## Local-SPMD Modeling Layer TokenSpeed introduces Local-SPMD modeling, where a static compiler reads per-tensor annotations and automatically generates the collective communication and sharding logic needed across GPUs. This means model authors describe parallelism intent rather than hand-coding the all-reduces and all-gathers, and the compiler emits the right communication pattern for the target topology. The result is that adding tensor or pipeline parallelism to a new model becomes a configuration exercise rather than a rewrite. ## C++/Python Hybrid Scheduler The scheduler is split into a C++ control plane and a Python execution plane. The C++ side runs a finite-state machine over the request lifecycle, enforcing KV cache safety at compile time through the type system. The Python side handles model dispatch and high-level orchestration. This split is what lets TokenSpeed claim low CPU-side overhead through its AsyncLLM entrypoint, which keeps the hot scheduling path off the Python GIL while still giving researchers a Python-shaped API. ## Pluggable Kernel Registry Kernels live in a centralized registry with a portable public API, which makes it possible to ship multiple backends behind one entrypoint. The headline kernel is its MLA (Multi-head Latent Attention) implementation, which the team describes as one of the fastest on Blackwell for agentic workloads and which has already been adopted upstream by vLLM. On speculative decoding workloads, the TokenSpeed MLA kernel nearly halves decode latency versus TensorRT-LLM in the published benchmarks. ## Benchmark Results The public numbers focus on B200 with Kimi K2.5: roughly 9% lower min-latency and 11% higher throughput at 100 TPS/User compared to TensorRT-LLM, and 580 TPS on Qwen3.5-397B-A17B under agentic workload conditions. The team has been explicit that these are point comparisons on selected configurations rather than a full sweep, and they encourage users to rerun on their own hardware. ## Supported and Planned Models Kimi K2.5 is the primary demonstrated model. Qwen 3.6, DeepSeek V4, MiniMax M2.7, and additional Blackwell-tuned paths are listed as in progress. The kernel layer targets both Blackwell (B200) and Hopper (H100/H200) GPUs. ## Status and Practical Notes TokenSpeed is in preview, not a production-ready release. Several major PRs are still open, and the team explicitly warns against production deployment yet. The project is best understood today as a window into where high-performance agentic inference is heading, and as a reference implementation for techniques like compiler-driven SPMD and type-safe KV management that other engines will likely absorb. ## Limitations As a young project, model coverage is narrow and the documented benchmarks live on a small set of hardware configurations. The C++ build adds operational complexity compared to a pure-Python engine, and the agentic-workload focus means it is not necessarily the right choice for long-form single-prompt generation workloads where vLLM's throughput-oriented design still wins. License is MIT, which is permissive, but the dependency surface (custom kernels, FSM scheduler) makes it harder to vendor selectively than a more modular library.

Key Features

Local-SPMD modeling layer with static compiler that auto-generates collective communication from annotations
C++ control plane plus Python execution plane with finite-state machine over request lifecycle
Compile-time KV cache safety enforced through the type system
AsyncLLM entrypoint designed for low CPU-side overhead in high-concurrency agentic serving
Centralized pluggable kernel registry with portable public API
MLA kernel adopted upstream by vLLM, with near-halved decode latency vs TensorRT-LLM on speculative decoding
Demonstrated ~9% lower min-latency and ~11% higher throughput vs TensorRT-LLM on Kimi K2.5 at 100 TPS/User on B200
Targets Blackwell (B200) and Hopper (H100/H200) with planned support for Qwen 3.6, DeepSeek V4, and MiniMax M2.7

Related Projects

TrendingInference

GitHub

165.0K15.0K

Ollama

ollama

MIT299

Open Source

TokenSpeed

Key Features

Tags

Related Projects

Ollama

llama.cpp

vLLM

Unsloth