Fine-Tune Qwen 3 with Ertas

Alibaba's latest-generation model family featuring both dense and mixture-of-experts architectures, with sizes from 0.6B to 235B and built-in hybrid thinking modes for adaptive reasoning depth.

0.6B1.7B4B8B14B32B30B-A3B235B-A22BAlibaba

Overview

Qwen 3, released by Alibaba in early 2025, represents a significant architectural evolution for the Qwen family. The lineup now includes both dense models (0.6B through 32B) and mixture-of-experts models (30B-A3B and 235B-A22B), offering unprecedented flexibility for different deployment scenarios. The MoE variants provide dramatically better quality-to-compute ratios — the 30B-A3B model activates only 3B parameters per token while accessing the knowledge of a 30B model, and the flagship 235B-A22B activates 22B of its 235B total parameters.

A headline feature of Qwen 3 is its hybrid thinking mode, which allows the model to dynamically choose between fast direct responses and slower chain-of-thought reasoning based on query complexity. This adaptive behavior means the model uses additional compute only when the task genuinely requires deeper reasoning, optimizing both response quality and inference cost.

Qwen 3 was trained on over 36 trillion tokens, double the dataset size of Qwen 2.5, with expanded coverage to 119 languages. The training process includes a four-stage pipeline: large-scale pretraining, long-context extension, post-training with reasoning-heavy data, and reinforcement learning with both reward models and rule-based signals.

All Qwen 3 models are released under the Apache 2.0 license. The MoE variants have quickly become popular for production deployments, offering a compelling alternative to running much larger dense models.

Key Features

The hybrid thinking mode is Qwen 3's most innovative feature. When enabled, the model internally generates reasoning traces before producing its final answer on complex questions, similar to dedicated reasoning models like DeepSeek-R1. However, unlike pure reasoning models, Qwen 3 can also respond directly without thinking when the query is straightforward. Users can control this behavior through a thinking budget parameter, setting maximum reasoning token counts or disabling thinking entirely for latency-sensitive applications.

The MoE architecture in the 30B-A3B and 235B-A22B variants uses top-2 expert routing across fine-grained expert networks. The 30B-A3B model contains 128 experts and routes each token to 2, achieving remarkably efficient inference — it runs at approximately the speed of a 3B dense model while delivering quality closer to models in the 14B-32B range. The 235B-A22B flagship similarly runs at roughly 22B-class inference cost while competing with the best open-weight dense models.

Multilingual support has been dramatically expanded, with 119 languages represented in the training data. This includes comprehensive support for languages with limited resources in other model families, such as Thai, Vietnamese, Indonesian, Malay, Tagalog, Swahili, and numerous other languages.

Fine-Tuning with Ertas

Qwen 3's diverse lineup makes Ertas Studio fine-tuning accessible at every scale. The dense models from 0.6B to 8B can all be fine-tuned with QLoRA on consumer GPUs with 6-16GB VRAM, making them ideal for rapid prototyping and experimentation. The MoE variant 30B-A3B is particularly interesting for fine-tuning — despite its 30B total parameter count, the active parameter footprint during training is only 3B, enabling QLoRA fine-tuning with approximately 18-24GB VRAM.

The hybrid thinking mode creates unique fine-tuning opportunities. In Ertas Studio, you can train on datasets that include explicit reasoning traces, teaching the model when and how to apply extended thinking to domain-specific problems. This is particularly powerful for technical domains like medical diagnosis, legal analysis, or scientific research where showing reasoning steps improves both accuracy and user trust.

After training, Ertas Studio exports to GGUF format with full support for MoE quantization. Both Ollama and llama.cpp handle Qwen 3 MoE inference natively, making deployment straightforward. The 30B-A3B variant with QLoRA adapter merged and quantized to Q4_K_M produces a model of approximately 17GB that runs at 3B-class speeds — an exceptional quality-to-resource ratio.

Use Cases

Qwen 3 is the leading choice for multilingual applications requiring broad language coverage. The 119-language support makes it suitable for global platforms, international customer support systems, and cross-language content processing. The MoE variants are particularly cost-effective for API serving, as they process tokens at small-model speeds while maintaining large-model quality.

The hybrid thinking mode makes Qwen 3 well-suited for applications where reasoning depth varies by query: educational platforms where some questions need step-by-step explanations, technical support systems where some issues require deeper analysis, and research tools where some queries benefit from extended deliberation.

The 30B-A3B variant is an excellent choice for organizations wanting to run a high-quality model on moderate hardware. With only 3B active parameters, it can serve real-time applications with low latency while providing quality that exceeds most 7B-14B dense models. The 235B-A22B flagship targets high-capability applications: complex reasoning, creative generation, expert-level analysis, and agentic workflows.

Hardware Requirements

The dense Qwen 3 models have standard requirements: the 0.6B at Q4_K_M needs about 500MB, the 4B about 2.5GB, the 8B about 5GB, the 14B about 8.5GB, and the 32B about 19GB. These are straightforward to deploy on consumer hardware at smaller sizes and server-class hardware at larger sizes.

The MoE variants require loading all expert weights even though only a subset is active per token. The 30B-A3B at Q4_K_M requires approximately 17-18GB of RAM, runnable on a single 24GB GPU or systems with 32GB RAM. Despite the 30B total parameter count, inference speed is comparable to a 3B dense model. The 235B-A22B at Q4_K_M requires approximately 130-140GB, necessitating multi-GPU setups or large-memory CPU inference.

For fine-tuning in Ertas Studio, the 30B-A3B MoE model with QLoRA requires approximately 18-24GB VRAM due to the efficient active parameter count. The dense 8B model needs 8-12GB VRAM, and the dense 14B needs 12-16GB VRAM.

Supported Quantizations

Q4_0Q4_K_MQ5_K_MQ6_KQ8_0F16

Related Resources

Integration

llama.cpp

Integration

LM Studio

Integration

Ollama

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →