Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

GLM-V - Open Source | Evermx | Evermx

Back to Open Source

TrendingFeatured

GLM-V

zai-orgApache-2.0

View on GitHub

Multimodal2.3K Stars168 Forks103 views

GLM-V is Z.ai's open-source vision-language model family — the GLM-4.1V-Thinking, GLM-4.5V, and the new GLM-4.6V lineup — designed around a single thesis: multimodal models should reason, not just describe. The repository, hosted under the `zai-org` GitHub organization and released under the Apache 2.0 license, has crossed 2,300 stars and 168 forks while serving as the canonical inference and fine-tuning codebase for what Z.ai calls 'versatile multimodal reasoning with scalable reinforcement learning'. It is, in 2026, one of the most credible open challengers to closed-source vision models from OpenAI and Anthropic. ## The Thinking Paradigm Applied to Vision The flagship contribution of the GLM-V series is the explicit 'thinking' mode, originally introduced in GLM-4.1V-9B-Thinking and co-developed with Tsinghua University's KEG lab. Where most vision-language models produce a one-shot answer to an image-grounded prompt, GLM-V emits a structured chain of reasoning steps before committing to a final response. The behavior is induced through a training recipe Z.ai calls Reinforcement Learning with Curriculum Sampling (RLCS), which staggers task difficulty so the model learns to allocate more reasoning tokens to harder multimodal inputs and fewer to trivial ones. The result is a vision model that behaves like a small agent rather than a captioner. ## Scaling Up to MoE: GLM-4.5V and GLM-4.6V GLM-4.5V scales the thinking recipe onto a Mixture-of-Experts backbone derived from GLM-4.5-Air, with 106B total parameters and roughly 12B active per token. The MoE design pushes peak quality on 41 published multimodal benchmarks while keeping inference cost in the same ballpark as a dense 12B-class model. GLM-4.6V, added to the same repository in spring 2026, extends context length and improves long-video understanding without changing the public API. All three checkpoints share one inference codebase, which is the central practical benefit of consolidating them under GLM-V. ## Video Understanding as a First-Class Citizen The repository tags itself with `video-understanding`, `reasoning`, `vlm`, and `image2text`, and the order is informative. Long video reasoning is a deliberate focus rather than an afterthought. The included demos cover frame-by-frame question answering on multi-minute clips, temporal grounding (where in the video does X happen), and structured event extraction. The 66K context window in GLM-4.5V is large enough to hold dense per-second descriptors for typical product-demo or surveillance-length footage, which is where most closed-source models still rely on aggressive sampling. ## Reproducible Inference and Fine-Tuning Recipes GLM-V ships with reference inference code for vLLM and Transformers, a single-GPU FP16 mode for smaller checkpoints, and a tensor-parallel path for the MoE variants. The fine-tuning recipes are the second reason the repository attracts forks: Z.ai publishes the SFT and RLCS pipelines used internally, which is unusually transparent for a flagship multimodal release. Community recipes for LoRA adapters on the dense 9B checkpoint have already appeared in the issue tracker, lowering the bar for domain-specific deployments such as medical imaging or document analysis. ## Benchmark Posture Z.ai claims state-of-the-art open-source results on 41 multimodal benchmarks spanning MMMU, MathVista, MMBench, MMVet, and a battery of OCR and document-VQA evaluations. The figures the project publishes place GLM-4.5V within a few points of GPT-4o and Gemini 2.5 Pro on most reasoning-heavy multimodal tasks, while remaining open-weight under Apache 2.0. The independent verification community has so far reproduced the headline numbers on MMMU and MathVista, with the larger benchmark sweep still in progress. ## Limitations GLM-V's strengths are also its tradeoffs. The MoE checkpoints require multi-GPU or high-VRAM single-GPU hosting that is non-trivial to provision. The 'thinking' mode increases output token counts substantially, which matters if the model is being served behind a per-token billing layer. Documentation is bilingual (English and Chinese) but tilts Chinese-first for some of the more advanced training scripts. And while the Apache 2.0 license is permissive, the model weights themselves carry an additional acceptable-use policy that downstream redistributors need to read. ## Who Should Use GLM-V GLM-V is the right starting point for teams that need an open-weight vision-language model with real reasoning capability — document AI, visual agent prototypes, video search and surveillance analytics, and education products that grade visual work. It is also a natural research target for groups studying multimodal RLHF and curriculum learning, given the published RLCS recipe. Teams that need only image captioning or simple VQA will find smaller, faster models a better fit; the value of GLM-V is realized when the workload genuinely requires step-by-step multimodal reasoning.

Key Features

GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V checkpoints under one repository
Explicit 'thinking' reasoning mode trained with Reinforcement Learning with Curriculum Sampling (RLCS)
Mixture-of-Experts backbone in 4.5V/4.6V: 106B total, ~12B active per token
Long video understanding with temporal grounding and per-frame reasoning
66K context window for dense per-second video descriptors
Reference inference code for vLLM and Transformers, single-GPU and tensor-parallel paths
Published SFT and RLCS fine-tuning recipes for reproducibility
Apache 2.0 license with state-of-the-art results on 41 multimodal benchmarks