Open Source

Block diffusion architecture enabling high-quality parallel token drafting for speculative decoding
Multi-backend support: vLLM v0.20.1+, SGLang (built-in), Hugging Face Transformers, and MLX for Apple Silicon
Broad model compatibility including Gemma-4, Qwen series, Llama-3.1, and MiniMax
Built-in benchmarking across GSM8K, MATH500, HumanEval, MBPP, and MT-Bench datasets
Speculative-config integration for flexible draft model and token parameter configuration
Docker support for Gemma-4 deployment with containerized inference pipelines
MIT-licensed codebase with research paper backing (arXiv:2602.06036)

Related Projects

TrendingInference

GitHub

165.0K15.0K

Ollama

ollama

The simplest way to run LLMs locally with 165K+ GitHub stars. One-command deployment, 100+ models, REST API, and multi-platform support.

llama.cpp

ggml-org

Pure C/C++ LLM inference engine supporting CPUs, Apple Silicon, CUDA, and Vulkan

vLLM

vLLM Project

A high-throughput, memory-efficient LLM inference and serving engine built around PagedAttention, with an OpenAI-compatible API and 200+ model support.

Apache-2.06