Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

Rapid-MLX - Open Source | Evermx | Evermx

Back to Open Source

TrendingFeatured

Rapid-MLX

raullenchaiApache 2.0

View on GitHub

Inference2.4K Stars287 Forks1 views

Rapid-MLX is an open-source local AI inference engine for Apple Silicon that positions itself as a drop-in replacement for Ollama, claiming 4.2x faster throughput on M-series hardware and time-to-first-token as low as 80 milliseconds on cached prompts. Released under Apache 2.0 by independent developer Raullen Chai in February 2026, it has crossed 2,300+ GitHub stars and 280+ forks in three months, becoming one of the fastest-growing MLX projects of the year. ## Why Another Local Engine The local LLM space on Mac is crowded. Ollama, LM Studio, llama.cpp, and Apple's own MLX examples all compete for the same hardware. Rapid-MLX argues that none of them fully exploit the unified memory architecture and Neural Engine pathways of the M1, M2, M3, and M4 chips. By rewriting the inference loop on top of Apple's MLX framework with aggressive prompt caching, KV-cache reuse, and a tight tool-calling layer, the project claims a 4.2x throughput edge over Ollama on identical models, plus a cached time-to-first-token of 0.08 seconds that approaches hosted-API latency. ## OpenAI-Compatible API The critical compatibility detail is that Rapid-MLX exposes an OpenAI-compatible REST API on localhost. Existing applications written against the OpenAI SDK, including Claude Code, Cursor, Aider, and any LangChain or LiteLLM pipeline, can point at the local endpoint with a one-line base URL change and continue to work. The project ships a FastAPI server, a Python SDK, and a CLI so a developer can be running Qwen, DeepSeek, or Llama models locally within a few minutes of pip-install. ## Tool Calling at 100 Percent A standout claim is 100% tool-calling reliability across 17 tool-parser implementations. Local models notoriously struggle with structured function-call output because their grammars drift mid-generation, breaking JSON. Rapid-MLX includes per-model tool parsers tuned to known quirks of Qwen, DeepSeek, Llama, and other open-weight families, plus a reasoning-separation layer that strips chain-of-thought tokens from the function-call payload before returning it to the client. The result is local tool calling that actually works in agentic coding setups like Claude Code or Cursor without falling over on malformed JSON. ## Cloud Routing and Hybrid Workflows Rapid-MLX also ships a cloud-routing feature that lets a developer transparently fall back to a hosted API when a request exceeds local capacity or requires a model not available on-device. Routing rules can be configured by token count, model name, or latency budget, which makes the engine usable as the front door of a hybrid inference stack rather than a strict local-only solution. ## Hardware and Model Support The project targets M1, M2, M3, and newer Apple Silicon Macs running macOS. Memory requirements scale with the chosen model: 7B and 8B quantized models run comfortably on 16GB unified memory, while 30B-class models benefit from 32GB or more. The MLX backend automatically handles quantization formats including 4-bit, 5-bit, and 8-bit GGUF-equivalent weights converted to MLX-native layouts. ## Limitations Rapid-MLX is Apple Silicon only. There is no Linux or Windows CUDA path and no plan to add one, so heterogeneous teams will still need a separate engine for non-Mac developers. Independent throughput benchmarks against the latest Ollama and llama.cpp builds are limited so far, and the 4.2x figure should be reproduced on your own workload before architectural decisions hinge on it. Tool-parser coverage is excellent for major open-weight models but lags for smaller niche fine-tunes, which may need manual parser registration. As a young project from a single primary maintainer, long-term sustainment risk is real and worth weighing for production deployments.

Key Features

MLX-native inference engine optimized for M1, M2, M3, and M4 Apple Silicon
Claims 4.2x throughput over Ollama with 0.08s cached time-to-first-token
OpenAI-compatible REST API as a drop-in replacement for hosted endpoints
100% tool-calling reliability with 17 model-specific tool parsers
Reasoning-separation layer that strips chain-of-thought from function-call output
Prompt cache and KV-cache reuse for repeated agentic workloads
Cloud routing layer for hybrid local plus hosted inference
First-class integration with Claude Code, Cursor, and Aider
FastAPI server, Python SDK, and CLI for one-line installation

Related Projects

TrendingInference

GitHub

165.0K15.0K

Ollama

ollama

MIT161

Open Source

Rapid-MLX

Key Features

Tags

Related Projects

Ollama

llama.cpp

Unsloth

SGLang