Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
MiniCPM-V is OpenBMB's open-source series of multimodal large language models (MLLMs) built around a single, ambitious goal: deliver GPT-4V-class image and video understanding in a footprint small enough to run on a phone. Rather than chasing ever-larger parameter counts, the project optimizes for on-device efficiency, and that focus has resonated — the repository has gathered more than 25,000 GitHub stars, and as of June 2026 its latest model was merged into Ollama's official library for easy local use. ## A Pocket-Sized Multimodal LLM The headline model, MiniCPM-V 4.6, packs roughly 1.3B parameters yet reportedly surpasses larger models such as Gemma4-E2B on vision-language benchmarks while running faster than even smaller competitors. It accepts images, multi-image inputs, and video alongside text, and it is designed to be deployed directly on iOS, Android, and HarmonyOS, with the edge-adaptation code open-sourced. For developers, that means a genuinely capable vision model that does not depend on a cloud endpoint or a workstation GPU. ## Efficient Visual Encoding Much of MiniCPM-V 4.6's efficiency comes from its visual encoding pipeline. Built on the LLaVA-UHD v4 approach, it uses an intra-ViT early compression technique that cuts visual encoding compute by more than half, and supports a mixed 4x/16x visual token compression rate. That flexibility lets developers dial the performance-efficiency trade-off per task — spending more tokens on dense OCR or document understanding, and fewer on simpler scenes — which is what makes high-resolution image and video understanding practical on constrained hardware. ## From V to o: Omnimodal Streaming The sibling MiniCPM-o line extends the family from vision-language into full omnimodal interaction. MiniCPM-o 4.5, a 9B end-to-end model, approaches Gemini 2.5 Flash on vision and speech and adds full-duplex multimodal live streaming: its speech and text outputs and its real-time video and audio inputs do not block one another. In practice that means the model can see, listen, and speak at the same time during a live conversation, and even perform proactive interactions such as reminders — capabilities usually reserved for closed, hosted assistants. ## Ecosystem and Deployment MiniCPM-V is more than model weights. The project ships a realtime web demo that can be deployed on a Mac or a local GPU, a free public API for the latest model, a cookbook of recipes, and packaged mobile apps. The June 2026 Ollama integration means users can pull and run the model locally with a single command, lowering the barrier for anyone who wants private, offline multimodal inference. Weights are distributed on HuggingFace, and the models slot into common inference stacks. ## Considerations The trade-offs are the familiar ones for compact MLLMs. The smallest models prioritize efficiency, so the hardest reasoning, niche languages, or extremely fine-grained document tasks can still favor much larger frontier models. The most capable streaming model, MiniCPM-o 4.5, is 9B parameters and benefits from a real GPU for low-latency live interaction, so the lightest on-phone experience and the full omnimodal experience sit at different points on the hardware curve. Licensing also varies by component — the code is Apache-2.0, but anyone shipping a product should confirm the specific model-weight terms. Even so, for developers who want strong, deployable multimodal understanding without a cloud dependency, MiniCPM-V is one of the most practical open options available today.
hacksider
Real-time AI face swap and one-click video deepfake with only a single image
harry0703
AI-powered short video generator that automates scripting, footage sourcing, subtitles, and composition — supporting 10+ LLM providers and batch production.