Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
TokenSpeed is the LightSeek Foundation's MIT-licensed LLM inference engine, released on May 6, 2026 and explicitly positioned as a speed-of-light runtime for agentic workloads. Within three weeks of its preview release the project has accumulated 1,136 stars and 111 forks, and its Multi-head Latent Attention kernel has already been upstreamed into vLLM. The pitch is uncompromising: TensorRT-LLM-level performance with vLLM-level usability, in a codebase that is 89.8 percent Python and 9.7 percent C++, all under the MIT license. ## A Four-Layer Architecture Built for Agents TokenSpeed splits the runtime into four cleanly separated layers. The modeling layer uses a local-SPMD design with static compilation that generates collective communication automatically, removing the manual parallelism wiring that consumes weeks of engineering time in other engines. The scheduler combines a C++ control plane with a Python execution plane and encodes request lifecycle plus KV cache management as a finite-state machine with compile-time type safety. The kernel layer is pluggable, layered, and exposes a public API and registry so third-party kernels can be dropped in without forking the engine. The entrypoint integrates with SMG to give AsyncLLM minimal CPU-side overhead, which matters when an agent fires thousands of small requests per second. ## Blackwell-First Performance Numbers The initial performance story is built around Nvidia Blackwell GPUs. On a B200 running Kimi K2.5, TokenSpeed outperforms TensorRT-LLM by roughly 9 percent in min-latency and 11 percent in throughput at 100 TPS per user, and the engine reports Pareto-superior latency-throughput curves rather than wins at only one operating point. The optimized MLA kernel nearly halves decode latency versus TensorRT-LLM on speculative decoding workloads, which is precisely the path that agentic systems exercise hardest. Hopper and AMD MI350 support is documented as ongoing work. ## Designed for Agentic Workloads Specifically Almost every existing open inference engine was originally tuned for chat-style traffic with long prompts and a single response stream. TokenSpeed's scheduler and KV reuse policies are instead built around the realities of agentic systems: bursty fan-out, many short requests, tool-call interruptions, and aggressive prefix sharing. The KV resource reuse policy is enforced with safety constraints so that aggressive sharing across requests cannot leak state, and the layered kernel system means heterogeneous accelerators can each contribute their best primitives to the same execution graph. ## Current Model Coverage and Roadmap Kimi K2.5 is the only fully supported model in the preview release, but the topic tags and roadmap make the broader ambition clear: DeepSeek V4, Qwen 3.6, MiniMax M2.7, and the open-weight GPT-OSS family are all in active development. The project's GitHub topics deliberately call out blackwell, deepseek, gpt-oss, kimi, minimax, and qwen, signaling that the team views TokenSpeed as a Blackwell-era replacement for the current vLLM and TensorRT-LLM duopoly rather than as a niche research tool. ## Preview-Quality, Production-Bound The maintainers are explicit that this is a preview release and not yet production-ready. Distributed inference, persistent KV storage tiers, and VLM support are all under active development, and production hardening is on the published roadmap for the coming weeks. Even at preview quality, TokenSpeed is the first new open inference engine since vLLM in 2023 to show credible benchmark numbers against TensorRT-LLM, and its MLA kernel already shipping inside vLLM is the strongest possible signal that the broader ecosystem takes the project seriously.