TL;DR
Zhipu releases GLM-4.6V (106B) and GLM-4.6V-Flash (9B) with native multimodal tool calling, 128K context window, and state-of-the-art performance on visual understanding benchmarks.
Key Points
- Native multimodal tool calling bridges visual perception to executable action without intermediate text conversions
- 128K token context window processes ~150 document pages, 200 slides, or one-hour videos in single inference pass
- Achieves SOTA performance on 20+ benchmarks including MMBench, MathVista, OCRBench among comparable open-source models
- Supports Model Context Protocol (MCP) extensions with URL-based multimodal handling and interleaved text-image output
Why It Matters
Native multimodal tool use eliminates information loss from image-to-text conversions and enables practical agentic workflows for document analysis, web search, and code generation. The 128K context window and open-source availability make this a significant advancement for developers building multimodal agents and document processing pipelines.
Source: z.ai