Fine-Tune Qwen 3.5 with Ertas

Alibaba's February 2026 flagship reasoning release — a 397B-A17B mixture-of-experts model that currently leads the open-weight GPQA Diamond benchmark at 88.4, with sibling variants from 0.8B through 122B-A10B. Apache 2.0.

0.8B2B4B9B27B35B-A3B122B-A10B397B-A17BAlibaba

Overview

Qwen 3.5, released by Alibaba in February 2026, is the reasoning-focused successor to the Qwen 3 family and the version that established Alibaba's leadership on graduate-level science benchmarks. The flagship Qwen3.5-397B-A17B currently leads the open-weight GPQA Diamond leaderboard at 88.4, with strong performance across MMLU-Pro (84.9), AIME 2025, and complex code reasoning. The lineup is unusually broad, spanning eight sizes from 0.8B (mobile) to 397B (server flagship), with both dense and mixture-of-experts variants in the mid-tier.

The 35B-A3B MoE variant in particular has become a popular workhorse choice — at ~3B active parameters per token, it serves at small-model speeds while delivering quality competitive with mid-tier dense models. The smaller dense variants (0.8B, 2B, 4B, 9B) extend Qwen 3's already-strong small-model coverage further. All variants ship with the unified hybrid thinking mode introduced in Qwen 3, allowing adaptive reasoning depth via a runtime control parameter.

Qwen 3.5 was superseded as the Qwen flagship by Qwen 3.6 in April 2026 (which delivers stronger coding performance), but Qwen 3.5 remains the better choice when reasoning capability — particularly on graduate-level science questions — is the primary requirement. The 397B-A17B variant also remains the strongest option in the Qwen family for teams with multi-GPU server infrastructure who can deploy the larger active parameter count.

All Qwen 3.5 variants are released under Apache 2.0. Weights are available on Hugging Face under the Qwen organization with paths like `Qwen/Qwen3.5-397B-A17B`, `Qwen/Qwen3.5-122B-A10B`, and the smaller dense variants.

Key Features

GPQA Diamond leadership at 88.4 is Qwen 3.5's defining benchmark result. GPQA is a graduate-level science Q&A benchmark designed to be unsolvable through search or shallow knowledge, making strong performance a credible signal of deep reasoning capability. Qwen 3.5's lead here — ahead of every other open-weight flagship at the time of release — is driven by the unified thinking mode plus targeted post-training on graduate-level scientific reasoning data.

The family's parameter range is unusually wide. The 0.8B variant enables on-device deployment patterns that no other 2026 flagship reaches; the 397B-A17B flagship competes with top closed-source models on reasoning benchmarks. This range gives architectural flexibility — teams can use the same family across mobile, desktop, and server deployments while maintaining consistent prompting conventions and tool-use behavior.

The MoE variants (35B-A3B and 122B-A10B) use fine-grained expert routing similar to Qwen3-Next. The 35B-A3B in particular serves at 3B-class inference speeds while delivering quality closer to 14B-32B dense models — making it one of the most efficient mid-tier deployment options available.

Qwen 3.5 inherits Qwen's broad multilingual capability (119 languages) and native Qwen-Agent integration with MCP, function calling, and code interpreter support out of the box. For agentic workflows requiring strong reasoning quality, Qwen 3.5 with thinking mode enabled is among the strongest open-weight options.

Fine-Tuning with Ertas

All Qwen 3.5 variants are well-suited to fine-tuning in Ertas Studio. The smaller dense variants (0.8B, 2B, 4B, 9B) fit on consumer GPUs with 4-12GB VRAM using QLoRA. The 27B dense variant fine-tunes on a single 48GB GPU at full sequence lengths. The 35B-A3B MoE variant is particularly efficient — QLoRA fits on a 24GB GPU thanks to the 3B active parameter count.

The 122B-A10B and 397B-A17B variants require multi-GPU server configurations for QLoRA fine-tuning. For teams without that infrastructure, the recommended pattern is teacher-student distillation: use Qwen3.5-397B as a teacher to generate synthetic reasoning-trace data, then fine-tune a smaller base model (Qwen3.5-27B, Qwen3.5-9B, or even a Qwen 3.5 distilled variant) on that data.

When fine-tuning Qwen 3.5 for reasoning-heavy use cases, Ertas Studio supports training data formats with explicit thinking-mode traces (`<think>...</think>` tags or equivalent). This preserves the adaptive reasoning behavior in the fine-tuned model rather than collapsing into one mode or the other. After training, Ertas Studio exports to GGUF format with full Qwen 3.5 chat-template preservation.

Use Cases

Qwen 3.5 is the strongest open-weight choice for graduate-level scientific reasoning — research assistance, scientific literature analysis, technical content generation, and STEM education applications all benefit from the GPQA Diamond-leading capability. The 397B-A17B variant is particularly well-suited to expert-level analysis tasks where reasoning depth matters more than inference speed.

The mid-tier MoE variants (35B-A3B, 122B-A10B) target production API serving where reasoning capability is needed but inference economics matter. The 35B-A3B in particular is widely deployed for customer support, document analysis, and content generation workloads where quality and speed must both be strong.

The smaller dense variants enable edge and consumer-hardware deployment for reasoning workloads — a 4B Qwen 3.5 with thinking mode enabled is more capable on hard reasoning tasks than 7B dense models without dedicated reasoning training. For mobile and embedded deployment of reasoning capability, Qwen 3.5's small variants are competitive with anything else in the open-weight ecosystem.

Hardware Requirements

Qwen 3.5 small dense variants at Q4_K_M: 0.8B ≈ 700MB, 2B ≈ 1.5GB, 4B ≈ 2.5GB, 9B ≈ 5.5GB. The 27B dense variant requires approximately 16GB at Q4_K_M, fitting on a single 24GB GPU.

The 35B-A3B MoE at Q4_K_M needs approximately 20GB (all expert weights must be loaded), runnable on a 24GB GPU. The 122B-A10B at Q4_K_M needs approximately 65GB, fitting on an 80GB GPU or split across two 48GB GPUs. The 397B-A17B at Q4_K_M needs approximately 220GB, requiring multi-GPU server deployment (4x A100 80GB or 4x H100 80GB).

For fine-tuning in Ertas Studio: small dense variants need 4-12GB VRAM, the 27B needs 32-40GB, the 35B-A3B MoE needs 22-28GB (thanks to low active count), the 122B-A10B needs 80-100GB (multi-GPU), and the 397B-A17B requires multi-GPU server scale similar to DeepSeek V4 Flash fine-tuning.

Supported Quantizations

Q4_0Q4_K_MQ5_K_MQ6_KQ8_0F16

Related Resources

llama.cpp

LM Studio

Ollama

vLLM

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →