Fine-Tune GLM-4.5 with Ertas

Z.ai's July 2025 mixture-of-experts release — 355 billion total parameters with 32 billion active per token, designed to run on 8× Huawei Ascend H20 chips. The workhorse predecessor to the GLM-5 flagship.

355B-A32BZ.ai

Overview

GLM-4.5, released by Z.ai (formerly Zhipu) in July 2025, is the company's most widely deployed open-weight model and the practical workhorse of the GLM family. The 355-billion-parameter mixture-of-experts architecture with 32 billion active parameters per token gives GLM-4.5 strong inference economics — comparable to a 32B dense model — while delivering quality competitive with much larger dense models on most benchmarks.

A notable design constraint: GLM-4.5 was engineered to run on 8× Huawei Ascend H20 chips, making it one of the first frontier-scale open-weight models intentionally targeted at non-NVIDIA training and inference hardware. The model's architecture and quantization recipes are tuned to work efficiently across this alternative hardware path, though deployment on standard NVIDIA infrastructure (vLLM, TensorRT-LLM, etc.) is also fully supported.

GLM-4.5 was succeeded as Z.ai's flagship by GLM-4.6 in late 2025 (the Claude Code alternative-positioned variant) and then GLM-5 in February 2026 (the 745B scale-up). For deployment-cost-sensitive teams, GLM-4.5 remains a popular choice — the 32B active parameter count delivers substantially better inference economics than GLM-5's dense 745B architecture, even if peak benchmark scores are lower.

Weights are available on Hugging Face under `zai-org/GLM-4.5`. The model is released under Z.ai's commercial-permissive licensing terms.

Key Features

MoE architecture with 32B active parameters delivers production-friendly inference economics. Generation throughput on standard inference frameworks runs at approximately 32B-class speeds, well within the operating range of mid-tier server hardware. For high-throughput API serving where token-cost matters, this is a meaningful advantage over dense alternatives at equivalent quality.

The 8× Huawei Ascend H20 deployment target is a notable architectural detail. GLM-4.5 is one of the few frontier-scale open-weight models with documented optimization for non-NVIDIA inference infrastructure. For teams in regions where Ascend deployment is preferred or required, this provides a clear deployment path.

Pre-GLM-5 strengths are still strong. GLM-4.5 delivers competitive performance on coding (with the GLM-4.6 follow-up Claude Code-alternative variant building on this foundation), reasoning, and instruction-following workloads. While not at the absolute frontier of open-weight quality in 2026, GLM-4.5 remains a credible production choice for the right deployment shape.

Broad commercial-permissive licensing combined with the active 32B parameter inference profile makes GLM-4.5 well-suited for cost-sensitive production serving — particularly in scenarios where team familiarity with Z.ai's stack or regional ecosystem advantages weigh into the decision.

Fine-Tuning with Ertas

GLM-4.5's 32B active parameter MoE architecture makes it relatively accessible to fine-tune in Ertas Studio. QLoRA fine-tuning fits on a single 80GB GPU at typical sequence lengths, or split across two 48GB GPUs with model parallelism. This is substantially more accessible than fine-tuning GLM-5's dense 745B architecture, which requires multi-GPU server scale.

For the MoE architecture specifically, Ertas Studio handles expert routing stability during low-rank adaptation automatically. Training data formats with multi-turn conversations, tool-use traces, and reasoning examples all work natively.

After training, Ertas Studio exports GLM-4.5 fine-tunes to GGUF format. The Q4_K_M quantization is approximately 200GB — fitting on a multi-GPU server (4x A100 80GB or 4x H100 80GB) with margin. For teams running on Huawei Ascend infrastructure, alternative quantization formats optimized for that hardware are also supported.

Use Cases

GLM-4.5 is the practical workhorse for teams adopting the Z.ai ecosystem, particularly in regions where Z.ai's support and ecosystem advantages are strongest. The 32B active parameter inference economics make it well-suited to production API serving where token-cost matters more than peak benchmark scores.

For teams running on Huawei Ascend infrastructure, GLM-4.5's documented optimization for that deployment target is a meaningful advantage over models tuned primarily for NVIDIA hardware. Alternative-accelerator deployment patterns are increasingly relevant for supply-chain diversity and regional preferences.

Production serving of customer-facing chatbots, document analysis pipelines, and content generation workloads all benefit from GLM-4.5's combination of strong quality and reasonable inference economics. While GLM-5 delivers higher peak quality, GLM-4.5 often delivers better total cost of ownership for high-throughput deployments.

Hardware Requirements

GLM-4.5 at Q4_K_M quantization requires approximately 200GB of memory, fitting on a 4x A100 80GB or 4x H100 80GB server, or a CPU inference host with 384GB+ RAM. Active parameter count of 32B determines token generation throughput once loaded.

For smaller deployments, Q3_K_M quantization (approximately 150GB) trades modest quality for reduced memory, fitting on a 2x H100 80GB or 3x A100 80GB configuration.

For fine-tuning in Ertas Studio: GLM-4.5 QLoRA needs approximately 100-160GB total VRAM, fitting on a single 80GB GPU at typical sequence lengths or two 48GB GPUs with model parallelism. The 32B active parameter MoE architecture makes this substantially more accessible than fine-tuning GLM-5 directly.

Supported Quantizations

Q3_K_MQ4_0Q4_K_MQ5_K_MQ6_KQ8_0

Related Resources

Integration

llama.cpp

Integration

Ollama

Integration

vLLM

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →