Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

TokenSpeed - Open Source | Evermx | Evermx

Back to Open Source

Trending

TokenSpeed

lightseekorgMIT

View on GitHub

Inference914 Stars65 Forks1 views

TokenSpeed is a speed-of-light LLM inference engine developed by the LightSeek Foundation, released as open source in May 2026 under the MIT license. The project targets agentic workloads and aims to deliver TensorRT-LLM-level performance with vLLM-level usability, combining a C++ control plane with a Python execution layer to keep CPU-side overhead minimal while preserving developer ergonomics. ## Why TokenSpeed Matters As LLM applications shift toward agentic systems that execute long chains of tool calls and reasoning steps, throughput per user at high tokens-per-second (TPS) becomes more important than aggregate throughput. TokenSpeed is engineered specifically for this regime: published benchmarks on Kimi K2.5 running on NVIDIA Blackwell B200 show roughly 9% lower minimum latency and roughly 11% higher throughput around 100 TPS/user compared to TensorRT-LLM. For coding agents running above 70 TPS/user, TokenSpeed dominates TensorRT-LLM across the entire Pareto frontier, which is significant because TensorRT-LLM has been the practical performance ceiling for many production deployments. ## Local-SPMD Modeling Layer TokenSpeed uses a local-SPMD design with static compilation that automatically generates collective communication patterns from per-rank model definitions. This eliminates the need for engineers to manually wire up tensor or pipeline parallelism, which is one of the most error-prone parts of scaling LLM inference across multiple GPUs. The static compilation step allows the engine to fuse operations and pre-plan communication, reducing runtime overhead. ## Finite-State-Machine Scheduler The scheduler encodes request lifecycles and KV cache management as finite-state machines with compile-time type safety. The C++ control plane drives scheduling decisions while a Python execution layer handles model invocation, giving operators a clear separation between low-latency control logic and flexible Python integration. This design pattern is what allows TokenSpeed to keep per-request CPU overhead extremely low even with many concurrent agentic sessions. ## MLA and Pluggable Kernels The engine ships one of the fastest Multi-head Latent Attention (MLA) kernel implementations available on NVIDIA Blackwell, which is critical for DeepSeek-style models that use MLA instead of standard multi-head attention. Kernels are organized through a pluggable system with portable APIs and a centralized registry, making it straightforward to add custom kernels for new hardware or new attention variants without forking the core engine. ## AsyncLLM Entrypoint TokenSpeed exposes an AsyncLLM entrypoint that provides low-overhead CPU-side request handling through SMG (Shared Memory Gateway) integration. This is the surface that agent frameworks and serving stacks talk to, and it is designed to minimize the latency cost of each round-trip when an agent makes many sequential LLM calls. ## Hardware and Model Coverage The initial release is optimized for NVIDIA Blackwell, with active development targeting Hopper and AMD MI350 platforms. Model coverage at launch includes Kimi K2.5, and the roadmap explicitly lists Qwen 3.6, DeepSeek V4, and MiniMax M2.7. This relatively narrow but deep coverage reflects the project's focus on agentic frontier models rather than serving every architecture available. ## Limitations TokenSpeed is currently a preview release and the maintainers explicitly do not recommend it for production use yet. Hardware support outside Blackwell is still in progress, so users on Hopper, MI350, or older GPUs will need to wait for upcoming releases. The model zoo is also narrower than vLLM or SGLang, which support a much wider range of architectures out of the box. Teams that need maximum flexibility today may still prefer those engines, while teams running agentic workloads on Blackwell hardware with supported models will likely see the largest immediate benefit from TokenSpeed.

Key Features

Local-SPMD modeling layer with static compilation that auto-generates collective communication patterns
C++ control plane plus Python execution layer with FSM-encoded request lifecycles and KV cache management
One of the fastest MLA (Multi-head Latent Attention) kernels available on NVIDIA Blackwell
Pluggable kernel system with portable APIs and centralized registry for new hardware and ops
AsyncLLM entrypoint with SMG integration for low-overhead CPU-side request handling
Benchmark: ~9% lower min-latency and ~11% higher throughput vs TensorRT-LLM around 100 TPS/user
Optimized for Blackwell B200 with active work on Hopper and AMD MI350
Targeted model support including Kimi K2.5 with Qwen 3.6, DeepSeek V4, and MiniMax M2.7 on the roadmap

Related Projects

TrendingInference

GitHub

165.0K15.0K

Ollama

ollama

MIT158

Open Source

TokenSpeed

Key Features

Tags

Related Projects

Ollama

llama.cpp

Unsloth

SGLang