Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

MiniCPM-V - Open Source | Evermx | Evermx

Back to Open Source

Trending

MiniCPM-V

OpenBMBApache-2.0

View on GitHub

Multimodal25.7K Stars2.0K Forks1 views

MiniCPM-V is OpenBMB's open-source series of multimodal large language models (MLLMs) built around a single, ambitious goal: deliver GPT-4V-class image and video understanding in a footprint small enough to run on a phone. Rather than chasing ever-larger parameter counts, the project optimizes for on-device efficiency, and that focus has resonated — the repository has gathered more than 25,000 GitHub stars, and as of June 2026 its latest model was merged into Ollama's official library for easy local use. ## A Pocket-Sized Multimodal LLM The headline model, MiniCPM-V 4.6, packs roughly 1.3B parameters yet reportedly surpasses larger models such as Gemma4-E2B on vision-language benchmarks while running faster than even smaller competitors. It accepts images, multi-image inputs, and video alongside text, and it is designed to be deployed directly on iOS, Android, and HarmonyOS, with the edge-adaptation code open-sourced. For developers, that means a genuinely capable vision model that does not depend on a cloud endpoint or a workstation GPU. ## Efficient Visual Encoding Much of MiniCPM-V 4.6's efficiency comes from its visual encoding pipeline. Built on the LLaVA-UHD v4 approach, it uses an intra-ViT early compression technique that cuts visual encoding compute by more than half, and supports a mixed 4x/16x visual token compression rate. That flexibility lets developers dial the performance-efficiency trade-off per task — spending more tokens on dense OCR or document understanding, and fewer on simpler scenes — which is what makes high-resolution image and video understanding practical on constrained hardware. ## From V to o: Omnimodal Streaming The sibling MiniCPM-o line extends the family from vision-language into full omnimodal interaction. MiniCPM-o 4.5, a 9B end-to-end model, approaches Gemini 2.5 Flash on vision and speech and adds full-duplex multimodal live streaming: its speech and text outputs and its real-time video and audio inputs do not block one another. In practice that means the model can see, listen, and speak at the same time during a live conversation, and even perform proactive interactions such as reminders — capabilities usually reserved for closed, hosted assistants. ## Ecosystem and Deployment MiniCPM-V is more than model weights. The project ships a realtime web demo that can be deployed on a Mac or a local GPU, a free public API for the latest model, a cookbook of recipes, and packaged mobile apps. The June 2026 Ollama integration means users can pull and run the model locally with a single command, lowering the barrier for anyone who wants private, offline multimodal inference. Weights are distributed on HuggingFace, and the models slot into common inference stacks. ## Considerations The trade-offs are the familiar ones for compact MLLMs. The smallest models prioritize efficiency, so the hardest reasoning, niche languages, or extremely fine-grained document tasks can still favor much larger frontier models. The most capable streaming model, MiniCPM-o 4.5, is 9B parameters and benefits from a real GPU for low-latency live interaction, so the lightest on-phone experience and the full omnimodal experience sit at different points on the hardware curve. Licensing also varies by component — the code is Apache-2.0, but anyone shipping a product should confirm the specific model-weight terms. Even so, for developers who want strong, deployable multimodal understanding without a cloud dependency, MiniCPM-V is one of the most practical open options available today.

Key Features

MiniCPM-V 4.6 (~1.3B params) for strong on-device image and video understanding
Runs natively on iOS, Android, and HarmonyOS with open-sourced edge adaptation code
Intra-ViT early compression cuts visual encoding compute by 50%+ (LLaVA-UHD v4)
Mixed 4x/16x visual token compression for flexible performance-efficiency trade-offs
MiniCPM-o 4.5 (9B) approaches Gemini 2.5 Flash on vision and speech
Full-duplex multimodal live streaming — see, listen, and speak simultaneously
Merged into Ollama's official library for one-command local deployment (June 2026)
Free public API, realtime web demo, cookbook, and HuggingFace weights