Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

FastVLM - Open Source | Evermx | Evermx

Back to Open Source

Trending

FastVLM

AppleApple Sample Code License

View on GitHub

Multimodal7.4K Stars553 Forks43 views

FastVLM is Apple's open-source vision-language model that has crossed 7,400 GitHub stars by attacking the part of the multimodal stack everyone else accepted as fixed: the vision encoder. The project introduces FastViTHD, a hybrid vision encoder that produces fewer visual tokens than ViT-style encoders at the same resolution, which translates directly into a reported 85x faster Time-to-First-Token versus LLaVA-OneVision-0.5B and 7.9x faster TTFT than Cambrian-1-8B while matching or beating their accuracy. For on-device vision-language inference on iPhones and Macs this is the difference between a feature that ships and a research demo. ## What FastVLM Is For The project is built for one specific deployment target: vision-language inference on consumer hardware, particularly Apple Silicon. The repository ships a working iOS/iPadOS demo app, Apple Silicon-optimized quantization settings, Python and Swift code, and 0.5B / 1.5B / 7B model variants designed so that the right size always fits on the target device. Everything else — the training recipe, the visual instruction tuning, the LLM backbone — looks like a standard modern VLM. The differentiator is that the vision tower is genuinely faster and smaller without giving up accuracy, which is what unlocks the on-device use cases. ## FastViTHD: A Hybrid Vision Encoder The technical contribution is the FastViTHD encoder. Conventional ViT vision encoders at high resolution produce a long sequence of visual tokens that dominate the prefill cost of multimodal inference — this is why most open VLMs feel sluggish on the first response. FastViTHD is a hybrid convolutional/transformer design that outputs roughly 3x fewer tokens at the same resolution, with a 3.4x smaller encoder footprint, while preserving the visual features the LLM needs. The 85x TTFT speedup against LLaVA-OneVision-0.5B reported in the paper is the cumulative effect of fewer tokens, a smaller encoder, and the resulting smaller prefill. ## Three Sizes Targeted at Real Devices The 0.5B variant is sized for current-generation iPhones and produces the headline 85x TTFT result. The 1.5B variant fits comfortably on iPad Pro and recent Macs. The 7B variant pairs FastViTHD with a Qwen2-7B LLM and is the variant that beats Cambrian-1-8B at 7.9x faster TTFT — this is the Mac-class checkpoint, intended for users who want desktop-grade VLM accuracy without sending images to a cloud API. Each variant ships pre-trained checkpoints, so adopters do not need to retrain from scratch. ## CVPR 2025 Paper, Production-Aimed Repository FastVLM was presented at CVPR 2025 and the repository is structured for production adoption rather than research reproduction. The codebase is 81.6% Python (training and inference) and 17.1% Swift (iOS demo and Apple Silicon integration), which is unusual — most research VLMs are pure Python and leave deployment to third parties. Including the Swift demo app means Apple developers can clone the repo, build the sample, and see real-time VLM inference on their device on day one. ## Practical Implications The FastVLM result has broader implications than the Apple-specific framing suggests. Visual token count is the dominant inference cost for any high-resolution multimodal model, so the FastViTHD idea — a hybrid encoder that outputs fewer tokens at the same resolution — applies to any VLM serving stack, including server-side deployments where prefill latency limits throughput. Several recent open VLM releases in 2026 have started incorporating similar token-reduction strategies, and FastVLM is the reference implementation teams point to. ## Limitations The repository has only one branch with effectively no formal release tagging, which makes pinning a version harder than for more conventional open projects. Documentation is research-paper-quality rather than production-onboarding-quality — getting a non-Apple-Silicon deployment running takes more reading than for Hugging Face TGI-style projects. The Apple-Sample-Code style license on the model assets and the standard view-source license on weights mean commercial integrators need to read the license files carefully. Finally, the iOS demo, while excellent for showing what is possible, is not a full app — productionizing the on-device experience still requires real iOS engineering. Within those constraints, FastVLM in 2026 is the strongest open answer to the question "how do I run a VLM on a phone without it feeling broken," and a useful reference for any team that wants to cut prefill latency on a server-side VLM stack.