GLM-4.6V Open Source Multimodal Model Adds Native Tool Use

TL;DR

Zhipu releases GLM-4.6V (106B) and GLM-4.6V-Flash (9B) with native multimodal tool calling, 128K context window, and state-of-the-art performance on visual understanding benchmarks.

Key Points

Native multimodal tool calling bridges visual perception to executable action without intermediate text conversions
128K token context window processes ~150 document pages, 200 slides, or one-hour videos in single inference pass
Achieves SOTA performance on 20+ benchmarks including MMBench, MathVista, OCRBench among comparable open-source models
Supports Model Context Protocol (MCP) extensions with URL-based multimodal handling and interleaved text-image output

Why It Matters

Native multimodal tool use eliminates information loss from image-to-text conversions and enables practical agentic workflows for document analysis, web search, and code generation. The 128K context window and open-source availability make this a significant advancement for developers building multimodal agents and document processing pipelines.

Access model weights on HuggingFace

Source: z.ai