Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
MLX-VLM is a package for running and fine-tuning Vision Language Models on Apple Silicon Macs using the MLX framework. With 5,007 GitHub stars under the MIT license, it has become the standard way to serve modern VLMs — Qwen2-VL, Gemma 4, Phi-4, MiniCPM, DeepSeek-OCR, DOTS-OCR, Pixtral, Idefics, LLaVA, Molmo, PaliGemma — locally on M-series hardware, without a discrete GPU and without the round-trips and rate limits of hosted vision APIs. ## Why a Mac-Native VLM Engine Matters The VLM ecosystem in 2026 has split into two practical deployment patterns. One uses hosted APIs from OpenAI, Anthropic, and Google for general-purpose vision tasks. The other runs models locally for document OCR, on-device assistants, privacy-sensitive image analysis, and cost-sensitive batch processing. The local path has been dominated by CUDA-only frameworks, which leaves Mac users — a substantial portion of developers and most of the ML research community — without a fast inference path. MLX-VLM closes that gap by targeting Apple's MLX framework directly, using the unified memory architecture of M-series chips to load and run VLMs that would otherwise require a discrete NVIDIA card. ## Supported Model Surface The project supports an unusually wide model catalog: general VLMs (Qwen2-VL, Gemma 4 multimodal, Phi-4 multimodal, MiniCPM-V, LLaVA, Pixtral, Idefics, Molmo, PaliGemma), specialist OCR models (DeepSeek-OCR, DOTS-OCR, Florence-2), and omni models that accept audio in addition to images. Each model is packaged with weight conversions in the MLX Community on HuggingFace, so users can pull a quantized checkpoint and run it with a single command rather than building their own conversion pipeline. New model families typically land within days of their original release, which is how MLX-VLM has stayed current with a vision-language space that ships new architectures almost monthly. ## Inference Modes MLX-VLM exposes four ways to consume a model. The CLI handles single-prompt and batch jobs. The Python API provides programmatic access for embedding into applications. A Gradio chat UI gives a local web interface that mirrors the experience of hosted chat products — useful for evaluation and demos. A FastAPI server exposes the same models over HTTP with continuous batching, so multiple concurrent requests can be served from one model load with the throughput characteristics expected of a real serving stack rather than a notebook script. ## Speculative Decoding and Throughput Generation speed is the dimension where MLX-VLM has invested most heavily. Speculative decoding ships with DFlash, EAGLE-3, and Multi-Token Prediction drafters, delivering 2-4x faster generation by drafting candidate tokens with a small model and verifying them with the target model. For multi-turn conversations, vision feature caching reuses the encoded image representation across turns and reports an 11x speedup in conversations that reference the same image multiple times. Automatic prefix caching reuses computed KV state for shared prompt prefixes, which is the standard optimization for long-context vision prompts where system messages and image tokens repeat. ## Quantization for Memory-Constrained Hardware The quantization story is the practical enabler for running large VLMs on consumer Mac hardware. Model weights can be quantized down to 4-bit or 2-bit, providing up to 8x compression. KV cache quantization is supported in both uniform and TurboQuant 3.5-bit forms, which matters because KV cache memory often becomes the binding constraint at long context lengths in vision models. Activation quantization is also supported for CUDA targets, since MLX-VLM is increasingly used in cross-platform comparisons even though its native execution path is MLX on Apple Silicon. ## Fine-Tuning and Distributed Inference Beyond inference, MLX-VLM supports LoRA and QLoRA fine-tuning, which lets users adapt a base VLM to a specific document format, domain vocabulary, or visual style using a modest GPU memory footprint — entirely on a Mac. Distributed inference across multiple Macs is supported through MLX's distributed primitives, so a small studio with two or three Mac Studios can serve a model that would not fit in any single device's memory. This is a meaningful alternative to renting NVIDIA capacity for teams whose workloads fit the Apple Silicon throughput envelope. ## Position in the Local-AI Ecosystem MLX-VLM sits in the same ecological niche as llama.cpp does for text-only LLMs — the open-source, Mac-native inference layer that turns a developer machine into a serious AI workstation. At 5,007 stars and 578 forks with an MIT license, it has the adoption pattern of infrastructure rather than a demo project, and its breadth of model support and active integration with new VLM releases suggest it will continue to be the path of least resistance for any developer who wants modern vision-language inference on a Mac without writing their own MLX kernels.