Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
Rapid-MLX is an open-source local AI inference engine for Apple Silicon that positions itself as a drop-in replacement for Ollama, claiming 4.2x faster throughput on M-series hardware and time-to-first-token as low as 80 milliseconds on cached prompts. Released under Apache 2.0 by independent developer Raullen Chai in February 2026, it has crossed 2,300+ GitHub stars and 280+ forks in three months, becoming one of the fastest-growing MLX projects of the year. ## Why Another Local Engine The local LLM space on Mac is crowded. Ollama, LM Studio, llama.cpp, and Apple's own MLX examples all compete for the same hardware. Rapid-MLX argues that none of them fully exploit the unified memory architecture and Neural Engine pathways of the M1, M2, M3, and M4 chips. By rewriting the inference loop on top of Apple's MLX framework with aggressive prompt caching, KV-cache reuse, and a tight tool-calling layer, the project claims a 4.2x throughput edge over Ollama on identical models, plus a cached time-to-first-token of 0.08 seconds that approaches hosted-API latency. ## OpenAI-Compatible API The critical compatibility detail is that Rapid-MLX exposes an OpenAI-compatible REST API on localhost. Existing applications written against the OpenAI SDK, including Claude Code, Cursor, Aider, and any LangChain or LiteLLM pipeline, can point at the local endpoint with a one-line base URL change and continue to work. The project ships a FastAPI server, a Python SDK, and a CLI so a developer can be running Qwen, DeepSeek, or Llama models locally within a few minutes of pip-install. ## Tool Calling at 100 Percent A standout claim is 100% tool-calling reliability across 17 tool-parser implementations. Local models notoriously struggle with structured function-call output because their grammars drift mid-generation, breaking JSON. Rapid-MLX includes per-model tool parsers tuned to known quirks of Qwen, DeepSeek, Llama, and other open-weight families, plus a reasoning-separation layer that strips chain-of-thought tokens from the function-call payload before returning it to the client. The result is local tool calling that actually works in agentic coding setups like Claude Code or Cursor without falling over on malformed JSON. ## Cloud Routing and Hybrid Workflows Rapid-MLX also ships a cloud-routing feature that lets a developer transparently fall back to a hosted API when a request exceeds local capacity or requires a model not available on-device. Routing rules can be configured by token count, model name, or latency budget, which makes the engine usable as the front door of a hybrid inference stack rather than a strict local-only solution. ## Hardware and Model Support The project targets M1, M2, M3, and newer Apple Silicon Macs running macOS. Memory requirements scale with the chosen model: 7B and 8B quantized models run comfortably on 16GB unified memory, while 30B-class models benefit from 32GB or more. The MLX backend automatically handles quantization formats including 4-bit, 5-bit, and 8-bit GGUF-equivalent weights converted to MLX-native layouts. ## Limitations Rapid-MLX is Apple Silicon only. There is no Linux or Windows CUDA path and no plan to add one, so heterogeneous teams will still need a separate engine for non-Mac developers. Independent throughput benchmarks against the latest Ollama and llama.cpp builds are limited so far, and the 4.2x figure should be reproduced on your own workload before architectural decisions hinge on it. Tool-parser coverage is excellent for major open-weight models but lags for smaller niche fine-tunes, which may need manual parser registration. As a young project from a single primary maintainer, long-term sustainment risk is real and worth weighing for production deployments.