Back to blog
    Curating Training Data for Phi-4 and Qwen 2.5: What Enterprise Teams Need to Know
    phi-4qwentraining-dataenterprisedata-preparationsegment:enterprise

    Curating Training Data for Phi-4 and Qwen 2.5: What Enterprise Teams Need to Know

    Phi-4 and Qwen 2.5 have different tokenizers, context windows, and training data biases. Your fine-tuning dataset needs to account for these differences. Here's what to watch for with each model.

    EErtas Team·

    "Just fine-tune it on our data" is a sentence that skips over a critical detail: which model you're fine-tuning determines how you should prepare the data. Phi-4 and Qwen 2.5 — two of the most capable open-weight models for enterprise fine-tuning in 2026 — have different architectures, different tokenizers, different context windows, and different training data biases. A dataset optimized for Phi-4 may perform poorly with Qwen 2.5, and vice versa.

    This is not about which model is "better." Both are excellent. It's about understanding the model-specific considerations that affect data preparation so your fine-tuning dataset matches what the model expects.

    Phi-4: What You Need to Know

    Background

    Phi-4 is Microsoft's 14B parameter model, released in late 2024 and refined through early 2025. It represents the culmination of Microsoft's "small models, high-quality data" research philosophy. Phi-4 was trained heavily on synthetic data generated by larger models and curated textbook-style content.

    Strengths to Leverage

    Reasoning and math. Phi-4 scores 80.4 on MATH benchmark and 82.6 on GPQA — competitive with models 4-5x its size. The model was specifically trained on multi-step reasoning and mathematical problem-solving data.

    Data preparation implication: Your fine-tuning data should include chain-of-thought examples where appropriate. If your task involves any form of reasoning (document analysis, classification with explanation, structured extraction requiring inference), format your output to include the reasoning steps, not just the final answer.

    Example — instead of:

    Input: "The contract specifies 90-day payment terms with a 2% early payment discount."
    Output: {"payment_terms": "net-90", "discount": "2% early payment"}
    

    Use:

    Input: "The contract specifies 90-day payment terms with a 2% early payment discount."
    Output: {"reasoning": "The clause states '90-day payment terms' indicating net-90. The '2% early payment discount' is a standard incentive for payment before the net-90 deadline.", "payment_terms": "net-90", "discount": "2% early payment"}
    

    The reasoning field leverages Phi-4's trained strength. Including it improves the model's accuracy on the extraction task itself, even if you discard the reasoning at inference time.

    Structured output. Phi-4's synthetic training data included heavy exposure to structured formats. The model handles JSON, YAML, and tabular output well out of the box. Fine-tuning on structured output tasks builds on an existing strength.

    Limitations to Account For

    English-dominant training. Phi-4 was trained primarily on English-language data. While it can handle other languages, its performance drops noticeably for non-English text. If your enterprise documents are in German, French, Japanese, or other languages, Phi-4 is not the optimal choice — or you'll need to over-represent non-English examples in your fine-tuning data.

    Data preparation implication: If you're fine-tuning Phi-4 for multilingual tasks, include 2-3x more examples per non-English language compared to English. The model needs extra signal to overcome its English-language bias.

    Context window: 16K tokens. Phi-4's effective context window is 16,384 tokens. This is adequate for many enterprise tasks but constrains the length of input documents you can process in a single pass.

    Data preparation implication: Ensure no training example exceeds 16K tokens (input + output combined). If your production documents are longer, you'll need to chunk them before processing. Your training data should include chunked examples that mirror how production inputs will be formatted.

    Template Format

    Phi-4 uses the ChatML template format:

    <|im_start|>system
    You are a helpful assistant.<|im_end|>
    <|im_start|>user
    {input}<|im_end|>
    <|im_start|>assistant
    {output}<|im_end|>
    

    Your training data must use this exact template. Using Llama's template format ([INST]...[/INST]) or Mistral's format will confuse the model and degrade performance. This is one of the most common fine-tuning mistakes — using the wrong chat template for the target model.

    Qwen 2.5: What You Need to Know

    Background

    Qwen 2.5 is Alibaba's model family, available in sizes from 0.5B to 72B. The most commonly fine-tuned variants are the 7B and 14B versions. Qwen 2.5 was trained on 18 trillion tokens across 29 languages, making it one of the most multilingual open models available.

    Strengths to Leverage

    Multilingual capability. Qwen 2.5 supports 29 languages with strong performance in English, Chinese, Japanese, Korean, and European languages. For enterprises operating across multiple languages or dealing with multilingual document collections, Qwen 2.5 is the stronger choice.

    Data preparation implication: Include examples in all languages your production system will encounter. Unlike Phi-4, Qwen 2.5 doesn't need extra non-English examples to compensate for training bias — it handles multilingual input natively. You can even include language-mixed examples (e.g., documents with English headings and German body text) if that matches your production data.

    Extended context with YaRN. Qwen 2.5 supports up to 128K token context through YaRN (Yet another RoPE extensioN) scaling. This means you can process much longer documents in a single pass compared to Phi-4's 16K limit.

    Data preparation implication: If your production documents are 20K-100K tokens, Qwen 2.5 lets you process them without chunking. However, training on long-context examples requires more GPU memory. A practical approach: include a mix of standard-length (2K-8K tokens) and long-context (16K-64K tokens) examples. Don't make all examples maximum length — the model needs to handle varied lengths gracefully.

    CJK language support. Qwen 2.5's tokenizer is specifically optimized for Chinese, Japanese, and Korean text. The token-to-character ratio for CJK text is approximately 1:1.5 for Qwen versus 1:3 for most English-centric tokenizers. This means you can fit roughly twice as much CJK text in the same token budget.

    Limitations to Account For

    Qwen's tokenizer produces different token counts. The same English text produces different token counts with Qwen's tokenizer versus Phi-4's. A 1,000-word English passage might be 1,300 tokens with Phi-4's tokenizer and 1,400 tokens with Qwen's. This affects cost estimation, training time, and input length planning.

    Data preparation implication: Tokenize your training data with Qwen's actual tokenizer (available in the transformers library) to get accurate token counts. Don't estimate based on word count or another model's tokenizer.

    Training data biases. Qwen 2.5 was trained on internet-scale data that includes Chinese-language web content heavily. For some tasks, this manifests as a slight bias toward Chinese internet conventions — date formats (YYYY/MM/DD), number formatting (10,000 as 1万), and certain phrase structures.

    Data preparation implication: If your output must conform to specific formatting conventions (US date format, Western number formatting), include explicit format requirements in the system prompt and ensure all training examples demonstrate the correct convention.

    Template Format

    Qwen 2.5 uses the ChatML template format — the same as Phi-4:

    <|im_start|>system
    You are a helpful assistant.<|im_end|>
    <|im_start|>user
    {input}<|im_end|>
    <|im_start|>assistant
    {output}<|im_end|>
    

    This shared template format means that if you prepare data correctly for one, the template works for the other. The difference is in the tokenization and the model's handling of the content within the template, not the template structure itself.

    Tokenizer Differences: Why They Matter

    Phi-4 and Qwen 2.5 use different tokenizers with different vocabulary sizes and different subword segmentation. This creates practical differences:

    The same text = different token counts. A 10,000-word document might be 13,200 tokens with Phi-4's tokenizer and 14,100 tokens with Qwen's. When planning maximum input lengths, always tokenize with your target model's tokenizer.

    Different handling of special characters. Domain-specific symbols (§ for section references in legal text, ± for tolerances in engineering, µ for micro-units in science) may be tokenized differently. If these symbols carry meaning in your domain, verify that the tokenizer handles them correctly and that they're represented consistently in your training data.

    Different efficiency for different scripts. Qwen's tokenizer is more efficient for CJK text; Phi-4's is slightly more efficient for English. This means the effective context window for non-English text is larger with Qwen and smaller with Phi-4.

    The Practical Strategy: Fine-Tune Both, Pick the Winner

    For most enterprise use cases, the optimal approach is not to guess which model will work better — it's to fine-tune both on the same dataset and compare.

    Step 1: Prepare a single high-quality dataset following the stricter of both models' requirements. Use ChatML template format (works for both). Ensure no example exceeds 16K tokens (Phi-4's limit — the more restrictive constraint). Include chain-of-thought reasoning where applicable (benefits Phi-4, doesn't hurt Qwen).

    Step 2: Fine-tune both models on the same dataset with comparable hyperparameters. Use the same learning rate schedule, batch size (adjusted for model size), and training duration.

    Step 3: Evaluate both models on the same held-out test set using task-specific metrics. Don't evaluate on generic benchmarks — evaluate on YOUR task with YOUR data.

    Step 4: Pick the winner. For most English-only enterprise tasks, performance is close enough that inference cost and latency become the deciding factors. For multilingual tasks, Qwen 2.5 typically wins. For reasoning-heavy tasks, Phi-4 typically edges ahead.

    This dual fine-tuning approach costs 2x the compute for training (typically a few hundred dollars difference for SLM-sized models) but eliminates guesswork. The cost of choosing the wrong model and discovering it after deployment is far higher than the cost of two training runs.

    Common Mistakes

    Using Llama-format templates for non-Llama models. Llama's [INST]...[/INST] template is not universal. Both Phi-4 and Qwen use ChatML. Using the wrong template produces a model that works but performs 5-15% worse than it should — a subtle failure that's hard to diagnose.

    Estimating token counts with the wrong tokenizer. If you plan your input lengths using GPT-4's tokenizer but train on Qwen, your estimates are wrong. Always use the target model's tokenizer for planning.

    Ignoring model-specific strengths. Fine-tuning Phi-4 without chain-of-thought examples leaves performance on the table. Fine-tuning Qwen 2.5 with English-only data when your production data includes other languages wastes the model's multilingual capability.

    Over-optimizing for one model. If you prepare data specifically for Phi-4's strengths (English, reasoning, 16K context), you've made the dataset less portable. Unless you're certain about your model choice, prepare data that works well for both.

    Ertas Data Suite handles model-specific data formatting automatically. Select your target model (Phi-4, Qwen 2.5, Llama, Mistral, or others) and the platform applies the correct chat template, validates token counts against the model's context window, and flags examples that exceed length limits. For teams fine-tuning multiple models, the same labeled dataset can be exported in different model-specific formats without re-labeling — change the export target, and the formatting adjusts automatically.


    Your data is the bottleneck — not your models.

    Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

    Further Reading

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading