Back to blog
    Fine-Tuning Qwen 2.5 for Multilingual Applications
    qwenmultilingualfine-tuninginternationalslmsegment:developer

    Fine-Tuning Qwen 2.5 for Multilingual Applications

    Qwen 2.5 covers 29 languages with 18 trillion training tokens. Here's how to fine-tune it for multilingual classification, support, and content generation without separate models per language.

    EErtas Team·

    Most open-source language models are English-first. They handle English well, manage a few European languages acceptably, and fall apart on Arabic, Hindi, Chinese, Japanese, Korean, and most of the languages that the majority of the world actually speaks.

    Qwen 2.5 is different. Alibaba trained it on 18 trillion tokens spanning 29 languages, with genuine investment in non-Latin scripts and right-to-left languages. The result is a model family that handles multilingual tasks without the accuracy cliff you see when you push Llama or Mistral into non-English territory.

    This matters for enterprise because your users, customers, and documents aren't all in English. If you're building multilingual support systems, international e-commerce classification, or cross-language document processing, you need a model that was actually trained on those languages — not one that picked up a few million Chinese tokens as an afterthought.

    Here's how to fine-tune Qwen 2.5 for multilingual tasks, including dataset preparation, tokenizer considerations, training strategies, and deployment.

    The Qwen 2.5 Model Family

    Qwen 2.5 comes in eight sizes:

    ModelParametersVRAM (FP16)VRAM (Q5_K_M)Best For
    Qwen 2.5 0.5B500M1.5 GB0.5 GBOn-device, edge
    Qwen 2.5 1.5B1.5B4 GB1.5 GBMobile, browser
    Qwen 2.5 3B3B7 GB2.5 GBLightweight server
    Qwen 2.5 7B7B16 GB5 GBFine-tuning sweet spot
    Qwen 2.5 14B14B30 GB10 GBComplex tasks
    Qwen 2.5 32B32B68 GB22 GBMulti-GPU
    Qwen 2.5 72B72B150 GB48 GBHigh-end server

    The 7B model is the sweet spot for fine-tuning. It's large enough to handle complex multilingual tasks, small enough to fine-tune on a single GPU with QLoRA (5-6 GB VRAM for inference, 10-12 GB for training), and the multilingual capability scales well at this size.

    The 3B model works for simpler tasks (binary classification, intent detection) if you need lower latency or smaller deployment footprint. The 14B model is worth the extra resources if you're dealing with complex reasoning in non-English languages — smaller models tend to lose reasoning capability faster in low-resource languages.

    Why Qwen Wins at Multilingual

    Training Data Distribution

    Most open-source models are trained on 80-90%+ English data. Qwen 2.5's training mix is more balanced:

    • English: ~40%
    • Chinese (Simplified + Traditional): ~25%
    • European languages (German, French, Spanish, Italian, Portuguese, Dutch, Russian, Polish): ~15%
    • Asian languages (Japanese, Korean, Vietnamese, Thai, Indonesian, Malay): ~10%
    • Arabic, Hindi, Bengali, Urdu, Turkish, Persian: ~7%
    • Other (Hebrew, Swahili, Filipino, Ukrainian, Czech): ~3%

    That 40% English vs 60% non-English split means the model has genuinely learned multilingual patterns, not just memorized English with a thin veneer of other languages.

    Tokenizer Design

    Qwen 2.5 uses a 152K vocabulary tokenizer — significantly larger than Llama's 128K. The extra tokens include dedicated subword units for CJK characters, Arabic script, Devanagari, and other non-Latin writing systems. This means:

    • Chinese text uses roughly 1.5 tokens per character (vs 2-3 for Llama)
    • Arabic text uses roughly 1.8 tokens per word (vs 3-4 for Llama)
    • Japanese text uses roughly 1.4 tokens per character (vs 2.5 for Llama)

    Fewer tokens per input means faster inference, lower VRAM usage for the same context length, and more efficient fine-tuning. A 2,000-character Chinese document might be 3,000 tokens in Qwen but 5,500 tokens in Llama. That's a meaningful difference in training cost and inference speed.

    Benchmark Performance

    On multilingual benchmarks, the gap between Qwen 2.5 7B and its competitors is significant:

    BenchmarkQwen 2.5 7BLlama 3.3 8BMistral 7BGemma 3 4B
    MGSM (multilingual math)72.4%61.2%54.8%48.3%
    XL-Sum (multilingual summarization)34.2 ROUGE28.1 ROUGE25.6 ROUGE22.4 ROUGE
    XNLI (cross-lingual NLI)78.6%69.4%65.2%60.1%
    Flores-200 (translation)31.8 BLEU24.2 BLEU21.5 BLEU19.8 BLEU

    On English-only benchmarks, Llama 3.3 8B edges out Qwen 2.5 7B on some tasks. But the moment you move to non-English, the gap opens up fast.

    Fine-Tuning for Multilingual Tasks

    Dataset Preparation

    The most common mistake in multilingual fine-tuning is training on each language separately or creating language-specific adapters. Don't do this. One model, one adapter, all languages mixed together.

    Structure your training data with mixed-language examples in every batch:

    {"instruction": "Classify this support ticket.", "input": "Mi pedido #4521 no ha llegado. Han pasado 10 dias.", "output": "category: shipping\npriority: high\nlanguage: es"}
    {"instruction": "Classify this support ticket.", "input": "I can't log in to my account after the password reset.", "output": "category: authentication\npriority: medium\nlanguage: en"}
    {"instruction": "Classify this support ticket.", "input": "Ich habe zweimal fur dasselbe Produkt bezahlt.", "output": "category: billing\npriority: high\nlanguage: de"}
    {"instruction": "Classify this support ticket.", "input": "Je n'arrive pas a telecharger ma facture.", "output": "category: billing\npriority: low\nlanguage: fr"}
    

    Key principles for multilingual datasets:

    • Mix languages within batches, don't group by language. This forces the model to learn language-agnostic task patterns.
    • Use consistent output format across languages. The output should always be in the same structured format regardless of input language. This makes downstream parsing trivial.
    • Include language detection in the output if it's useful for your pipeline. Qwen handles this naturally.
    • Balance your examples roughly by language. You don't need perfect balance, but don't train on 400 English examples and 20 Arabic examples. Aim for at least 50 examples per language.
    • Total dataset size: 300-500 examples across all languages. That's roughly 30-70 examples per language if you're supporting 7-10 languages.

    Tokenizer Considerations

    When setting your max sequence length for training, account for Qwen's tokenizer efficiency:

    • If your longest input is 500 words in English (~650 tokens), the same content in Chinese might be 400 tokens, but in Arabic it might be 900 tokens.
    • Set max sequence length based on the most verbose language in your dataset, plus 20% buffer.
    • For most multilingual tasks, 2048 tokens is sufficient. Increase to 4096 if you're processing long documents.

    Training Configuration

    Recommended settings for multilingual fine-tuning on Qwen 2.5 7B:

    ParameterValueNotes
    LoRA rank32Higher than English-only (16) to capture multilingual patterns
    Learning rate1.5e-4Slightly lower than monolingual to avoid catastrophic forgetting
    Epochs4-5Multilingual needs slightly more exposure
    Batch size4Auto-adjusted by Ertas
    Max seq length2048Increase for long documents
    Warmup ratio0.05Standard

    The LoRA rank of 32 is important. For English-only tasks, rank 16 is usually sufficient. For multilingual, the adapter needs more capacity to capture language-specific patterns while maintaining cross-lingual transfer. We've seen rank 16 adapters lose 3-5% accuracy on low-resource languages compared to rank 32.

    VRAM Requirements

    ConfigurationVRAM
    QLoRA training (rank 32, 4-bit base)12 GB
    QLoRA training (rank 16, 4-bit base)10 GB
    Inference (Q5_K_M)5 GB
    Inference (Q4_K_M)4.5 GB
    Inference (FP16)16 GB

    You can fine-tune Qwen 2.5 7B on an RTX 3080 12GB or equivalent. Inference at Q5_K_M runs comfortably on any GPU with 6 GB+ VRAM, or on Apple Silicon Macs with 8 GB+ unified memory.

    Use Case: Multilingual Customer Support

    A European SaaS company was running customer support in 6 languages (English, German, French, Spanish, Italian, Portuguese) through GPT-4o. Monthly cost: $2,400 for 45,000 tickets/month. Each ticket needed classification (8 categories), priority assignment, and a suggested response.

    After fine-tuning Qwen 2.5 7B on 480 examples (80 per language):

    MetricGPT-4oFine-tuned Qwen 2.5 7B
    Classification accuracy (all languages)84%93%
    Classification accuracy (German)81%92%
    Classification accuracy (Spanish)83%94%
    Priority accuracy79%91%
    Response relevance (human-rated)4.1/54.3/5
    Average latency1,100ms160ms
    Monthly cost$2,400$44.50

    The fine-tuned model outperformed GPT-4o because it learned the company's specific ticket categories, priority rules, and response templates. GPT-4o had to infer all of this from system prompts, which worked for English but degraded for other languages where prompt interpretation is less reliable.

    Use Case: International E-commerce Classification

    An e-commerce platform listing products from 12 countries needed to classify items into a 45-category taxonomy. Product titles and descriptions arrived in the seller's language — Chinese, Korean, Japanese, English, German, French, Thai, Vietnamese, Indonesian.

    They were using a pipeline of Google Translate + GPT-4o, which introduced translation errors and cost $3,800/month for 120,000 products.

    Fine-tuning Qwen 2.5 7B on 600 examples (removing the translation step entirely):

    MetricTranslate + GPT-4oFine-tuned Qwen 2.5 7B (direct)
    Classification accuracy78%91%
    Processing time per item2.3s0.18s
    Monthly cost$3,800$44.50
    Misclassification rate (CJK languages)28%8%

    The direct approach — no translation, just classify in the original language — was both more accurate and 12x faster. Translation was introducing errors that cascaded into wrong classifications. Qwen understood the original text well enough to classify directly.

    Use Case: Cross-Language Document Processing

    A legal firm processing contracts in English, French, German, and Spanish needed to extract key clauses, dates, party names, and obligations. Documents ranged from 2 to 15 pages.

    Fine-tuning Qwen 2.5 7B on 400 examples of clause extraction across all four languages:

    • Field extraction accuracy: 90% (vs 82% with GPT-4o few-shot)
    • Cross-language consistency: when the same clause type appeared in different languages, the extracted structure was consistent 96% of the time
    • Processing speed: 4-5 pages per minute on an RTX 4070 Ti

    The key advantage was consistency. GPT-4o would sometimes use different field names or structures depending on the input language. The fine-tuned Qwen model always produced the same schema regardless of source language, which simplified the downstream data pipeline.

    Fine-Tuning Walkthrough with Ertas

    1. Prepare Mixed-Language Dataset

    Collect examples from each language you need to support. Format as instruction-input-output JSONL. Ensure rough balance across languages (minimum 40-50 examples per language).

    2. Upload to Ertas

    Upload your JSONL file. Select Qwen 2.5 7B as the base model. Ertas will detect the multilingual content and suggest appropriate training parameters.

    3. Configure Training

    Set LoRA rank to 32, learning rate to 1.5e-4, and epochs to 4. Leave other parameters at defaults unless you have specific requirements.

    4. Train

    A 500-example multilingual fine-tuning job typically completes in 25-35 minutes. Monitor the validation loss — for multilingual training, expect the loss curve to be slightly noisier than monolingual training as the model balances across languages.

    5. Evaluate

    Ertas reports per-language accuracy in the evaluation results. Check that no single language is significantly underperforming. If one language lags, add 20-30 more examples for that language and retrain.

    6. Export and Deploy

    Export as GGUF at Q5_K_M for production. Deploy via Ollama. The same model serves all languages — no need for language detection routing or multiple model instances.

    Performance: Qwen 2.5 7B vs Llama 3.3 8B on Non-English Tasks

    Both models fine-tuned on the same 500-example multilingual classification dataset (8 languages, 10 categories):

    LanguageQwen 2.5 7BLlama 3.3 8B
    English95%96%
    German93%84%
    French94%86%
    Spanish94%87%
    Chinese92%71%
    Arabic89%63%
    Japanese91%68%
    Hindi87%58%
    Average91.9%76.6%

    On English, Llama is marginally better. On every other language, Qwen wins — and the gap is enormous for CJK and Arabic. Llama's accuracy on Hindi (58%) is barely better than random for a 10-category classification task.

    If your application is English-only, either model works. If your application touches any non-English language, Qwen 2.5 is the obvious choice.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading