Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
DS4 is Salvatore Sanfilippo's (antirez, creator of Redis) single-purpose local inference engine for DeepSeek V4 Flash, written in pure C. Released on May 6, 2026 and already at 11,700+ GitHub stars by May 25, it is the project that ignited the now-mainstream conversation about running 600B+ parameter MoE models on consumer Macs. Unlike generic GGUF runners, DS4 makes a deliberately narrow bet: it ships only the kernels, schedulers, KV-cache layout, and tool-calling glue that DeepSeek V4 Flash specifically needs, and trades model breadth for end-to-end polish. ## Asymmetric 2-Bit Quantization for 128 GB Macs The headline trick is asymmetric quantization. Routed MoE experts are compressed with IQ2_XXS and downprojections with Q2_K, while shared experts, attention projections, and decision components are kept at higher precision. The result is that the full DeepSeek V4 Flash model fits in roughly 96 GB of unified memory, which means a single M3 Ultra or M5 Max MacBook can run a frontier-class model that would otherwise require an H100 cluster. A separate Q4 variant targets 256 GB workstations for users who want closer-to-reference quality. ## Metal-First With CUDA and ROCm Backends DS4 was clearly built Metal-first, and the numbers show it. On an M5 Max with 128 GB of RAM, the engine reports 463 tokens per second prefill and 26 tokens per second generation at Q2 with a 32K context. An M3 Ultra with 512 GB hits 27 tokens per second generation at both Q2 and Q4. The CUDA backend is tuned for Nvidia DGX Spark and GB10 boxes, and a community-maintained ROCm branch covers AMD users. The CPU path is restricted to debug use because the developers hit macOS virtual memory bugs that could trigger kernel panics under heavy load. ## KV Cache as a First-Class Disk Citizen The most architecturally interesting choice is how DS4 treats the KV cache. Compressed KV state is written to disk as a first-class artifact rather than living only in RAM, which lets long-context sessions resume without re-running expensive prefill. Combined with the integrated coding agent (ds4-agent), the engine keeps persistent sessions across machine restarts, gives reproducible tool-call replay through the DSML format, and exposes OpenAI, Anthropic, and Responses-compatible HTTP APIs so existing client code drops in without modification. ## Adaptive Thinking and a 1M Token Context DS4 exposes the full one-million-token context window of DeepSeek V4 Flash and adds an adaptive thinking mode that produces shorter reasoning sections for simple problems and longer chains for complex ones. A Think Max mode forces extended reasoning when the user wants it, and a non-thinking direct-answer mode is available for latency-sensitive tool use. The included ds4-eval harness ships with embedded test questions so users can quickly verify that a given quantization and hardware combination is producing sane outputs before integrating the engine into a larger workflow. ## Honest Tradeoffs DS4 is labeled beta-quality, and the ds4-agent component is explicitly alpha. The engine is locked to one model family, so users who want to run Llama, Qwen, or Mistral are pointed at llama.cpp. The project openly credits llama.cpp and GGML for source-level pieces retained under the MIT license. For the specific job of running DeepSeek V4 Flash on a 128 GB Mac at near-paper-quality and useful tokens per second, however, DS4 has no current peer in the open ecosystem, and its release is the clearest sign yet that frontier-grade local inference is now a practical option for individual developers.