Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

TokenSpeed - Open Source | Evermx | Evermx

Back to Open Source

TrendingFeatured

TokenSpeed

LightSeek FoundationMIT

View on GitHub

Inference1.1K Stars111 Forks77 views

TokenSpeed is the LightSeek Foundation's MIT-licensed LLM inference engine, released on May 6, 2026 and explicitly positioned as a speed-of-light runtime for agentic workloads. Within three weeks of its preview release the project has accumulated 1,136 stars and 111 forks, and its Multi-head Latent Attention kernel has already been upstreamed into vLLM. The pitch is uncompromising: TensorRT-LLM-level performance with vLLM-level usability, in a codebase that is 89.8 percent Python and 9.7 percent C++, all under the MIT license. ## A Four-Layer Architecture Built for Agents TokenSpeed splits the runtime into four cleanly separated layers. The modeling layer uses a local-SPMD design with static compilation that generates collective communication automatically, removing the manual parallelism wiring that consumes weeks of engineering time in other engines. The scheduler combines a C++ control plane with a Python execution plane and encodes request lifecycle plus KV cache management as a finite-state machine with compile-time type safety. The kernel layer is pluggable, layered, and exposes a public API and registry so third-party kernels can be dropped in without forking the engine. The entrypoint integrates with SMG to give AsyncLLM minimal CPU-side overhead, which matters when an agent fires thousands of small requests per second. ## Blackwell-First Performance Numbers The initial performance story is built around Nvidia Blackwell GPUs. On a B200 running Kimi K2.5, TokenSpeed outperforms TensorRT-LLM by roughly 9 percent in min-latency and 11 percent in throughput at 100 TPS per user, and the engine reports Pareto-superior latency-throughput curves rather than wins at only one operating point. The optimized MLA kernel nearly halves decode latency versus TensorRT-LLM on speculative decoding workloads, which is precisely the path that agentic systems exercise hardest. Hopper and AMD MI350 support is documented as ongoing work. ## Designed for Agentic Workloads Specifically Almost every existing open inference engine was originally tuned for chat-style traffic with long prompts and a single response stream. TokenSpeed's scheduler and KV reuse policies are instead built around the realities of agentic systems: bursty fan-out, many short requests, tool-call interruptions, and aggressive prefix sharing. The KV resource reuse policy is enforced with safety constraints so that aggressive sharing across requests cannot leak state, and the layered kernel system means heterogeneous accelerators can each contribute their best primitives to the same execution graph. ## Current Model Coverage and Roadmap Kimi K2.5 is the only fully supported model in the preview release, but the topic tags and roadmap make the broader ambition clear: DeepSeek V4, Qwen 3.6, MiniMax M2.7, and the open-weight GPT-OSS family are all in active development. The project's GitHub topics deliberately call out blackwell, deepseek, gpt-oss, kimi, minimax, and qwen, signaling that the team views TokenSpeed as a Blackwell-era replacement for the current vLLM and TensorRT-LLM duopoly rather than as a niche research tool. ## Preview-Quality, Production-Bound The maintainers are explicit that this is a preview release and not yet production-ready. Distributed inference, persistent KV storage tiers, and VLM support are all under active development, and production hardening is on the published roadmap for the coming weeks. Even at preview quality, TokenSpeed is the first new open inference engine since vLLM in 2023 to show credible benchmark numbers against TensorRT-LLM, and its MLA kernel already shipping inside vLLM is the strongest possible signal that the broader ecosystem takes the project seriously.

Key Features

Four-layer architecture: local-SPMD modeling, C++/Python scheduler, pluggable kernels, SMG-integrated AsyncLLM entrypoint
Optimized Multi-head Latent Attention kernel for Blackwell that nearly halves decode latency vs TensorRT-LLM
~9% lower min-latency and ~11% higher throughput than TensorRT-LLM on Kimi K2.5 (B200, 100 TPS/user)
MLA kernel already upstreamed and adopted by vLLM
Finite-state-machine request lifecycle with compile-time type safety
Safety-constrained KV resource reuse policy for aggressive prefix sharing in agentic workflows
Public kernel registry and API for third-party Blackwell, Hopper, and MI350 contributions
Roadmap includes DeepSeek V4, Qwen 3.6, MiniMax M2.7, and GPT-OSS support

Related Projects

TrendingInference

GitHub

165.0K15.0K

Ollama

ollama

MIT291

Open Source

TokenSpeed

Key Features

Tags

Related Projects

Ollama

llama.cpp

vLLM

Unsloth