Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

GLM-V - Open Source | Evermx | Evermx

Back to Open Source

Trending

GLM-V

zai-orgApache-2.0

View on GitHub

Multimodal2.3K Stars173 Forks43 views

## Introduction GLM-V is the vision-language branch of Z.ai's GLM model family, consolidating three generations of open-source VLMs into a single repository: GLM-4.1V-9B-Thinking, GLM-4.5V, and the latest GLM-4.6V series. Where many VLMs treat visual understanding as an add-on to a text backbone, GLM-V is architected around the premise that multimodal reasoning — spanning images, video, documents, GUI interfaces, and structured charts — requires dedicated training objectives and scalable reinforcement learning, not just more visual tokens. Released in late 2025 and actively updated into 2026, the project has accumulated over 2,300 GitHub stars from developers building document intelligence tools, GUI automation agents, and video analysis pipelines. ## What It Is The GLM-V repository hosts two model scales for GLM-4.6V: - **GLM-4.6V (106B)**: The full-scale model built on the GLM-4.6 language backbone with an AIMv2-Huge Vision Transformer encoder and an MLP projector that aligns visual features to the LLM's embedding space. It supports 128K token context windows, enabling long-form video summarization and multi-page document analysis in a single pass. - **GLM-4.6V-Flash (9B)**: A lightweight variant that fits on a single high-end GPU while preserving most of the 106B model's capabilities. It uses the same ViT encoder and native tool-calling interface, making it practical for edge deployment and cost-sensitive inference. Both variants support arbitrary image resolutions and aspect ratios — including panoramic inputs up to 200:1 — through bicubic interpolation of absolute positional embeddings combined with 2D-RoPE encoding. Video inputs are handled via 3D convolutions with temporal compression. ## Key Capabilities ### Native Multimodal Function Calling GLM-4.6V is among the first open-source VLMs to support native vision-grounded tool use: the model can identify a UI element in a screenshot and invoke a corresponding API or action, enabling vision-driven automation without a separate planning layer. ### Long-Context Visual Reasoning The 128K token window — covering interleaved image patches, video frames, and text — allows the model to process multi-chapter documents with embedded charts, long meeting recordings, and multi-image comparisons in a single inference call. On MMLongBench and ChartQAPro, it outperforms much larger models on long-context tasks. ### Thinking Mode Toggle GLM-4.5V and GLM-4.6V both include a thinking mode that activates extended chain-of-thought reasoning at inference time. Developers can switch between fast single-pass responses and deeper reasoning depending on task complexity, without swapping model weights. ### Document and Chart Comprehension The series performs strongly on structured visual content: PDF layout parsing, spreadsheet interpretation, flowcharts, and infographic extraction. This makes it applicable to enterprise document workflows where tables and diagrams carry as much information as prose. ### GUI Automation and Grounding Built-in grounding capabilities allow the model to localize specific visual elements — buttons, text fields, icons — and return bounding box coordinates. Combined with tool calling, this enables computer-use agents that act on screenshots. ### Scalable Reinforcement Learning All three model generations in the repository were trained with scalable RL objectives, moving beyond standard supervised fine-tuning on curated image-caption pairs. This training methodology contributes to the models' robustness on adversarial visual reasoning tasks and multi-step problem solving. ## Deployment and Integration GLM-V integrates with four inference frameworks: - **Hugging Face Transformers** for standard research workflows - **vLLM** with tensor parallel serving for production throughput - **SGLang** for optimized structured generation - **ModelScope** for the Chinese developer ecosystem The repository includes structured examples for image Q&A, video understanding, document parsing, and tool-calling workflows. ## Why It Matters Native multimodal tool use has been a capability gap in open-source VLMs. Closed-source models like GPT-4o and Gemini have offered function calling with visual grounding for some time, but open-source equivalents either required complex prompt engineering or separate grounding modules. GLM-4.6V's native tool-calling interface — where the model reasons about visual content and decides which tool to invoke based on what it sees — brings this capability into the open-source ecosystem. The 128K context window also differentiates GLM-V from competing open-source VLMs that cap at 32K or 64K tokens. For applications involving long video analysis or dense document corpora, this is a meaningful practical advantage. The availability of a capable 9B Flash variant at single-GPU memory footprints further lowers the barrier for teams who want production-grade multimodal reasoning without a multi-node inference cluster.

Key Features

Native multimodal function calling: vision-grounded tool invocation from screenshots and images
128K token context window supporting long video and multi-page document analysis
Dual model scale: 106B flagship and 9B Flash variant for single-GPU deployment
AIMv2-Huge ViT encoder with 2D-RoPE and 3D convolution for arbitrary aspect ratios and video
Thinking mode toggle: switch between fast inference and extended chain-of-thought at runtime
GUI grounding: bounding box localization of UI elements for computer-use agents
Scalable reinforcement learning training across all model generations
Multi-framework support: Transformers, vLLM, SGLang, and ModelScope