Fine-Tune Qwen3-Omni with Ertas

Alibaba's omni-modal model — accepting text, image, audio, and video input and producing text plus realtime speech output in a single 30B-A3B mixture-of-experts checkpoint. Apache 2.0.

30B-A3BAlibaba

Overview

Qwen3-Omni is Alibaba's omni-modal flagship within the Qwen 3 family — a single 30B-A3B mixture-of-experts checkpoint that accepts text, image, audio, and video as input and produces both text and realtime speech as output. This unified architecture is unusual in the open-weight ecosystem, where most multimodal models handle one or two non-text modalities and require external bolt-on TTS systems for speech output. Qwen3-Omni handles the full spectrum natively.

The model ships in three task-specific variants: Qwen3-Omni-Instruct (general instruction-following across all modalities), Qwen3-Omni-Thinking (reasoning-mode for complex multi-modal queries), and Qwen3-Omni-Captioner (specialized for caption generation across image, audio, and video). Released under Apache 2.0, Qwen3-Omni was followed by Qwen3.5-Omni (Plus, Flash, Light variants released March 30, 2026) which extended the architecture to additional sizes and improved benchmark performance.

The 3B active parameter count gives Qwen3-Omni outstanding inference economics for an omni-modal model — token generation runs at speeds comparable to a 3B dense model on standard frameworks. Combined with Apache 2.0 licensing and broad capability, Qwen3-Omni is among the strongest open-weight choices for multimodal applications without the operational overhead of stitching together separate vision, audio, and TTS systems.

Key Features

Native omni-modal input is the headline capability. Where most multimodal models accept one or two extra modalities (typically vision plus text), Qwen3-Omni handles text, image, audio, and video natively in the same checkpoint. This eliminates the architectural complexity of separate model deployments for each modality and produces more coherent reasoning across modalities — the model can correlate spoken language with on-screen visuals, or image content with embedded audio, in ways that fragmented pipelines handle poorly.

Realtime speech output is unusual in open-weight releases. Qwen3-Omni produces speech alongside text without a separate TTS deployment, simplifying the architecture for voice-interface applications. On audio-specific benchmarks, the model has been reported to beat Gemini on some tasks despite the 3B active parameter count.

The 30B-A3B MoE architecture gives Qwen3-Omni strong inference economics. With 3B active parameters per token, generation runs at small-model speeds while the 30B total parameter capacity delivers quality competitive with larger dense multimodal models. For production omni-modal serving where token-cost matters, this is a meaningful advantage.

Apache 2.0 licensing combined with the unified architecture makes Qwen3-Omni well-suited for commercial deployment in voice-interface applications, accessibility tools, multimodal content moderation, and similar use cases where the operational simplicity of a single model checkpoint is valuable.

Fine-Tuning with Ertas

Qwen3-Omni is supported in Ertas Studio's fine-tuning pipeline with multimodal training data formats. QLoRA fine-tuning fits on a 24GB GPU at typical sequence lengths thanks to the 3B active parameter count, though longer multimodal sequence lengths (combining text + image + audio data) push memory requirements higher.

For fine-tuning, Ertas Studio supports interleaved multimodal training data: text prompts paired with images, audio clips, and video frames as needed for your specific use case. This is particularly valuable for domain-specific applications — fine-tuning on medical imaging with paired clinical notes, technical documentation with embedded diagrams and audio explanations, or industry-specific video content with transcripts.

After training, Ertas Studio exports Qwen3-Omni fine-tunes to GGUF format with multimodal projector preservation. Deployment via vLLM (with multimodal support enabled) is recommended for production serving; Ollama also has growing support for omni-modal Qwen variants.

Use Cases

Voice-interface applications are a natural fit for Qwen3-Omni. Customer service chatbots that handle both voice and text, accessibility applications that combine visual and auditory input, and voice-driven productivity assistants all benefit from the unified speech input/output capability without separate TTS deployment.

Multimodal content moderation is another strong use case. Platforms moderating user-generated content (which mixes text, images, audio, and video) can use Qwen3-Omni to apply consistent moderation logic across all modalities in a single model rather than separate vision, audio, and text moderation systems.

For accessibility applications — transcription, captioning, image description, multimodal search — Qwen3-Omni's combination of capabilities and efficient inference makes it well-suited to deployment in browser-based or edge-deployed assistive technologies.

Hardware Requirements

Qwen3-Omni at Q4_K_M requires approximately 18-20GB of memory (all expert weights loaded). A 24GB GPU is the deployment sweet spot, fitting both the model and reasonable context with multimodal projectors loaded.

For multimodal inference specifically, plan for additional memory headroom for image/audio/video preprocessing and projector activations — typically an extra 4-8GB beyond the base model footprint depending on input sequence length.

For fine-tuning in Ertas Studio: Qwen3-Omni QLoRA needs 22-32GB VRAM at typical multimodal sequence lengths. Pure text fine-tuning fits on 24GB; mixed multimodal fine-tuning typically requires 32GB or more depending on the modality mix.

Supported Quantizations

Q4_0Q4_K_MQ5_K_MQ6_KQ8_0F16

Related Resources

Integration

llama.cpp

Integration

Ollama

Integration

vLLM

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →