Fine-Tune GLM-4.7 with Ertas

Z.ai's December 2025 coding-focused release — a 400-billion parameter mixture-of-experts with 'Preserved Thinking' multi-turn reasoning, plus a smaller GLM-4.7 Flash variant for production serving. Topped Code Arena among open-weight models at release before being succeeded by the GLM-5 series.

~400B (Flagship)Flash (smaller)Z.ai

Overview

GLM-4.7, released by Z.ai on December 22 2025, is the coding-focused successor to GLM-4.6 and the model that established Z.ai's competitive position on agentic coding benchmarks before the GLM-5 series took over the family flagship role. The flagship is approximately 400 billion parameters in a mixture-of-experts architecture, paired with a Flash variant — a smaller distilled tier optimized for production serving where inference economics matter more than peak capability.

The headline benchmark result was GLM-4.7 topping Code Arena among open-weight models at release. Code Arena measures real-world coding capability across diverse programming tasks and was notably less saturated than HumanEval-style benchmarks at the time, providing meaningful differentiation among the top-tier coding models. GLM-4.7's lead — though it was a moment rather than a sustained position, with Qwen3-Coder-Next, MiMo V2.5 Pro, and Kimi K2.5 subsequently establishing leads on different coding benchmarks — was an important data point in the open-weight coding model competition through early 2026.

The distinctive architectural innovation in GLM-4.7 is 'Preserved Thinking' — a multi-turn reasoning pattern where the model maintains its reasoning state across multiple turns of a conversation, allowing more coherent long-running agentic execution than typical hybrid-reasoning models. Where Qwen 3+ and DeepSeek V3.2/V4's thinking modes operate within a single turn, Preserved Thinking is designed for workflows that span many turns over hours of execution. This pattern was a precursor to GLM-5.1's 8-hour autonomous run capability.

GLM-4.7 has been substantively superseded as the Z.ai flagship by GLM-5 (February 2026) and GLM-5.1 (April 2026), both of which use a different 745B base architecture rather than continuing the GLM-4 lineage. GLM-4.7 remains relevant as a documented step in the GLM family's evolution and as a production option for teams that want coding-focused capability with distinctive multi-turn reasoning behavior. Weights are available on Hugging Face under `zai-org/GLM-4.7` and `zai-org/GLM-4.7-Flash`.

Key Features

Code Arena leadership at release was GLM-4.7's headline benchmark result. The model briefly held the top open-weight position on Code Arena, demonstrating that coding-focused training and the Preserved Thinking architecture together produced measurable real-world capability gains over alternative open-weight options. While the lead was contested within months by newer releases, the moment validated Z.ai's strategic focus on agentic coding capability.

Preserved Thinking is the architectural feature that distinguishes GLM-4.7 from contemporaries. Standard hybrid-reasoning models (Qwen 3+, DeepSeek V3.2/V4) compute reasoning traces within a single conversation turn — the next turn starts fresh. GLM-4.7's Preserved Thinking maintains reasoning state across turns, allowing the model to reference its prior thinking when handling subsequent queries in the same conversation. For long-running agentic workflows where context drift is a quality issue, this pattern produces measurable improvements.

The Flash variant fills the production-serving niche. While the flagship 400B model is substantial enough to require multi-GPU server deployment, GLM-4.7 Flash targets single-GPU and consumer-hardware deployment with quality competitive with mid-tier dense alternatives. For teams running production coding agents at scale, the Flash variant's combination of strong coding capability and production-friendly economics is particularly attractive.

GLM-4.7 was the model that established Z.ai as a serious open-weight coding-model contender. Prior to 4.7, Z.ai was widely viewed as a competent but second-tier Chinese-lab open-weight provider. The Code Arena result and the broader 4.7 reception positioned Z.ai for the GLM-5/5.1 successor releases that subsequently established the company's position in the top tier of open-weight model providers.

Fine-Tuning with Ertas

GLM-4.7 fine-tuning in Ertas Studio works through the standard MoE training pipeline. The flagship 400B variant requires multi-GPU server configurations for QLoRA — approximately 250-320GB total VRAM at typical sequence lengths. The Flash variant is substantially more accessible, fitting QLoRA training on a single 48-80GB GPU.

For coding-specific fine-tuning, GLM-4.7 benefits from training data that includes complete agentic execution traces — task descriptions, planning, multi-turn tool use, and observed outcomes. The Preserved Thinking architecture preserves reasoning state through fine-tuning when training data appropriately exercises the multi-turn reasoning pattern. Ertas Studio supports these formats natively, including agentic conversation formats with explicit thinking traces.

For most teams without multi-GPU server access, the recommended pattern is to use GLM-4.7 flagship as a teacher to generate synthetic training data, then fine-tune GLM-4.7 Flash or a smaller base on that data. This produces a domain-specialized coding model at production-friendly deployment cost while inheriting GLM-4.7's coding patterns and Preserved Thinking behavior.

After training, Ertas Studio exports to GGUF format with full GLM-4.7 chat template preservation. Both flagship and Flash variants deploy cleanly via Ollama, llama.cpp, or vLLM with single-click integration into Claude Code, Cline, or Aider via their custom-model configuration.

Use Cases

Multi-turn agentic coding workflows benefit from GLM-4.7's Preserved Thinking architecture. Long-running implementation tasks — features that span multiple development cycles, complex refactors that proceed iteratively, codebase migrations that pause and resume across sessions — handle the multi-turn pattern more reliably with Preserved Thinking than with single-turn reasoning models. For teams building production coding agents that operate over substantial time windows, GLM-4.7 is worth specific evaluation against alternatives.

The Flash variant targets high-throughput coding agent serving. Customer-facing coding tools, internal developer assistants, and CI-integrated code review agents all benefit from the smaller variant's combination of strong coding quality and production-friendly inference economics. For teams choosing between GLM-4.7 Flash and Qwen3-Coder-Next as self-hosted alternatives to Claude Code, both are credible options with different operational tradeoffs.

For teams running stable production deployments adopted before the GLM-5 series became available, GLM-4.7 remains a documented and supported option. The migration to GLM-5/5.1 offers measurable capability improvements but comes with non-trivial operational change costs. GLM-4.7 fine-tuning workflows remain valid for teams with existing pipeline investments.

Hardware Requirements

GLM-4.7 flagship at Q4_K_M quantization requires approximately 220GB of memory, fitting on a 4x A100 80GB or 4x H100 80GB server, or a CPU inference host with 384GB+ RAM. The Flash variant requires substantially less — approximately 30-50GB depending on quantization tier — fitting on a single 48-80GB GPU.

For smaller deployments, Q3_K_M quantization (approximately 165GB flagship, 22-38GB Flash) trades modest quality for reduced memory. The Flash variant Q3 deployment is genuinely accessible to consumer-hardware setups (high-end Mac Studio configurations, dual-GPU workstation builds).

For fine-tuning in Ertas Studio: GLM-4.7 flagship QLoRA needs approximately 250-320GB total VRAM (multi-GPU server). GLM-4.7 Flash QLoRA needs 32-48GB VRAM, fitting on a single 48-80GB GPU. The Flash variant's training accessibility makes it the practical choice for most teams interested in domain specialization without server-class infrastructure.