Back to blog
    Fine-Tuning Phi-4: Microsoft's Best Small Model for Enterprise Tasks
    phi-4microsoftfine-tuningenterpriseslmsegment:developer

    Fine-Tuning Phi-4: Microsoft's Best Small Model for Enterprise Tasks

    Phi-4 14B outperforms GPT-4 on math benchmarks while running 15x faster on local hardware. Here's how to fine-tune it for classification, extraction, and structured output tasks.

    EErtas Team·

    Microsoft's Phi-4 is a 14B parameter model that scores 84.8% on the MATH benchmark — higher than GPT-4's 84.3% on the same test. That's a model small enough to run on a single consumer GPU outperforming a trillion-parameter model on mathematical reasoning.

    But the real story isn't the benchmark numbers. It's what happens when you fine-tune Phi-4 for enterprise tasks: classification, extraction, structured output, and domain-specific reasoning. Phi-4's architecture was built with data quality over data quantity — Microsoft trained it on carefully curated synthetic and filtered web data rather than brute-forcing with trillions of tokens. That design philosophy makes it exceptionally responsive to fine-tuning.

    Here's the complete guide to fine-tuning Phi-4 for your enterprise workloads, including VRAM requirements, quantization options, training configurations, and benchmark comparisons.

    Why Phi-4 for Enterprise

    Phi-4 sits in a unique position in the model landscape. At 14B parameters, it's larger than the 7B models that dominate the fine-tuning space but significantly smaller than the 70B+ models that require multi-GPU setups. This middle ground matters for enterprise deployments.

    The model's strengths map directly to enterprise tasks:

    • Mathematical reasoning: 84.8% on MATH, 93.2% on GSM8K. If your task involves numbers — financial calculations, statistical analysis, metric computation — Phi-4 handles it with surprising accuracy.
    • Structured output: Phi-4 generates valid JSON, XML, and structured formats more reliably than most models its size. In our testing, it achieves 96% JSON schema compliance out of the box, compared to 89% for Llama 3.3 8B and 91% for Qwen 2.5 7B.
    • Instruction following: The model tracks multi-part instructions well. When you say "extract these 5 fields, format as JSON, and flag any missing values," it does exactly that without dropping steps.
    • Code generation: Strong performance on HumanEval (82.6%) makes it useful for code-related enterprise tasks like log parsing, regex generation, and data transformation scripts.

    Where Phi-4 is weaker: creative writing, very long-form generation (it starts to lose coherence past 2,000 tokens of output), and some non-English languages. For multilingual tasks, Qwen 2.5 is a better base model. For pure text generation, Llama 3.3 produces more natural prose.

    Hardware Requirements

    Inference

    QuantizationModel SizeVRAM RequiredTokens/sec (RTX 4090)Tokens/sec (RTX 3090)
    FP1628 GB~32 GB45 t/s32 t/s
    Q8_015 GB~18 GB62 t/s44 t/s
    Q5_K_M10 GB~12 GB78 t/s55 t/s
    Q4_K_M8.5 GB~10 GB89 t/s63 t/s
    Q4_08 GB~9.5 GB94 t/s66 t/s

    At Q5_K_M, you get near-FP16 quality with a 10 GB footprint. That fits comfortably on an RTX 4070 Ti or any card with 12 GB+ VRAM. For enterprise deployments where you're running inference on a dedicated server, Q5_K_M is the sweet spot — the quality loss compared to FP16 is under 1% on most benchmarks.

    At Q4_K_M, the model fits in under 10 GB VRAM. Quality drops slightly more (1.5-2% on reasoning benchmarks), but for classification and extraction tasks where accuracy doesn't depend on nuanced reasoning, Q4_K_M performs nearly identically to FP16.

    Fine-Tuning

    Fine-tuning the full model in FP16 requires about 56 GB of VRAM — that's multi-GPU territory. But you don't need full fine-tuning.

    QLoRA requirements:

    ConfigurationVRAM RequiredTraining Speed (500 examples)
    QLoRA (rank 16, 4-bit base)12 GB~35 minutes
    QLoRA (rank 32, 4-bit base)14 GB~42 minutes
    QLoRA (rank 64, 4-bit base)16 GB~55 minutes
    LoRA (rank 16, FP16 base)34 GB~25 minutes

    For most enterprise tasks, QLoRA with rank 16 or 32 is sufficient. Rank 16 is enough for classification and extraction. Bump to rank 32 if you're fine-tuning for generation tasks where output diversity matters.

    The 12 GB minimum for QLoRA means you can fine-tune Phi-4 on an RTX 4070 Ti, RTX 3080 12GB, or any cloud GPU with 12 GB+ VRAM. That's a $0.40/hour spot instance on most cloud providers.

    Best Enterprise Use Cases for Phi-4

    Financial Document Processing

    Phi-4's math reasoning makes it strong for financial tasks. After fine-tuning on 400 examples of financial statement extraction, Phi-4 achieved:

    • 96% accuracy extracting line items from income statements
    • 94% accuracy on balance sheet field extraction
    • 98% accuracy on numerical calculations (totals, percentages, YoY changes)

    Compare that to Llama 3.3 8B fine-tuned on the same dataset: 91%, 88%, and 89% respectively. The math reasoning gives Phi-4 a clear edge when numbers are involved.

    Classification with Complex Taxonomies

    Enterprise classification often involves 20+ categories with subtle distinctions. Phi-4 handles deep taxonomies better than 7B models because of its stronger reasoning capability.

    On a 32-category support ticket classification task:

    ModelAccuracyF1 Score
    GPT-4o (few-shot)87%0.85
    Llama 3.3 8B (fine-tuned, 500 examples)89%0.87
    Qwen 2.5 7B (fine-tuned, 500 examples)88%0.86
    Phi-4 14B (fine-tuned, 500 examples)94%0.93

    The gap widens as the number of categories increases. For simple 3-5 category classification, any model works. For complex taxonomies, the extra parameters in Phi-4 help.

    Structured Data Extraction

    Extracting structured data from unstructured text — invoices, contracts, emails, reports — is one of the highest-value enterprise AI tasks. Phi-4's instruction-following ability means it tracks complex extraction schemas reliably.

    After fine-tuning on 300 examples of contract clause extraction (extracting party names, dates, obligations, conditions, and penalties from legal text):

    • Phi-4: 93% field-level accuracy, 97% JSON validity
    • Llama 3.3 8B: 86% field-level accuracy, 94% JSON validity
    • Qwen 2.5 7B: 85% field-level accuracy, 93% JSON validity

    Code-Adjacent Tasks

    Log parsing, error classification, SQL generation from natural language, and API response transformation. Phi-4's code training makes it a natural fit.

    On a log-to-structured-event extraction task (500 training examples):

    • Phi-4: 95% accuracy, 42 t/s at Q5_K_M
    • Llama 3.3 8B: 88% accuracy, 58 t/s at Q5_K_M

    Phi-4 is slower per token (it's almost twice the parameters), but significantly more accurate. For batch processing where latency isn't critical, the accuracy gain is worth it.

    Fine-Tuning Phi-4 with Ertas

    Step 1: Prepare Your Dataset

    Format your training data as instruction-input-output pairs. For enterprise tasks, this typically looks like:

    {
      "instruction": "Extract the following fields from this invoice text: vendor_name, invoice_number, date, line_items (array), subtotal, tax, total. Return valid JSON.",
      "input": "INVOICE #4892\nFrom: Acme Industrial Supply\nDate: February 14, 2026\n\nWidget A (qty 50) @ $12.00 = $600.00\nWidget B (qty 25) @ $8.50 = $212.50\n\nSubtotal: $812.50\nTax (8.5%): $69.06\nTotal: $881.56",
      "output": "{\"vendor_name\": \"Acme Industrial Supply\", \"invoice_number\": \"4892\", \"date\": \"2026-02-14\", \"line_items\": [{\"description\": \"Widget A\", \"quantity\": 50, \"unit_price\": 12.00, \"total\": 600.00}, {\"description\": \"Widget B\", \"quantity\": 25, \"unit_price\": 8.50, \"total\": 212.50}], \"subtotal\": 812.50, \"tax\": 69.06, \"total\": 881.56}"
    }
    

    Aim for 300-500 examples. For Phi-4 specifically, focus on quality over quantity — the model responds well to clean, consistent training data. 300 high-quality examples often outperform 1,000 noisy ones.

    Step 2: Upload and Configure

    Upload your JSONL dataset to Ertas and select Phi-4 14B as your base model. Recommended training configuration:

    • LoRA rank: 16 for classification/extraction, 32 for generation
    • Learning rate: 2e-4
    • Epochs: 3-4 (Phi-4 learns fast; more than 5 epochs risks overfitting)
    • Batch size: 4 (auto-adjusted based on available VRAM)
    • Max sequence length: 2048 (increase to 4096 if your inputs are long)

    Step 3: Train and Evaluate

    Click start. A typical 500-example training job on Phi-4 completes in 35-55 minutes depending on sequence length and LoRA rank. Ertas runs evaluation on a held-out validation set automatically and reports accuracy, loss curves, and sample outputs.

    Watch for overfitting: if validation loss starts increasing after epoch 2-3 while training loss keeps decreasing, reduce epochs. Phi-4 picks up patterns quickly.

    Step 4: Export to GGUF

    Export your fine-tuned model as a GGUF file. For enterprise deployment, you'll typically want two versions:

    • Q5_K_M for production use where quality matters (10 GB)
    • Q4_K_M for development/testing or lower-VRAM deployment (8.5 GB)

    Ertas handles the merge (base model + LoRA adapter) and quantization automatically.

    Step 5: Deploy

    Load the GGUF into Ollama, LM Studio, or llama.cpp on your inference server. For enterprise deployments, Ollama with a simple Docker container is the most maintainable setup:

    ollama create phi4-enterprise -f Modelfile
    ollama run phi4-enterprise
    

    Point your application at the Ollama API endpoint. Your fine-tuned Phi-4 is now serving requests locally with no API dependency.

    Quantization Recommendations

    For enterprise Phi-4 deployments, here's how each quantization level performs on a structured extraction task (300 test examples):

    QuantizationAccuracyJSON ValidityTokens/sec (RTX 4090)Model Size
    FP1693.2%97.0%45 t/s28 GB
    Q8_093.0%97.0%62 t/s15 GB
    Q5_K_M92.8%96.8%78 t/s10 GB
    Q4_K_M92.1%96.2%89 t/s8.5 GB
    Q4_091.4%95.5%94 t/s8 GB

    Q5_K_M loses only 0.4% accuracy compared to FP16 while being 73% faster and 64% smaller. That's the default recommendation for any deployment where accuracy matters.

    Q4_K_M is acceptable for most production use cases — 92.1% vs 93.2% is a marginal difference, and you save another 1.5 GB of VRAM. If you're deploying on hardware with exactly 10-12 GB VRAM, Q4_K_M gives you more headroom for context.

    Avoid Q4_0 for enterprise tasks unless you're extremely memory-constrained. The 1.8% accuracy drop from FP16 starts to add up at scale.

    Phi-4 vs the Competition

    Here's a direct comparison for enterprise fine-tuning, all models trained on the same 500-example invoice extraction dataset:

    MetricPhi-4 14BLlama 3.3 8BQwen 2.5 7BQwen 2.5 14B
    Field extraction accuracy93%86%85%91%
    JSON schema compliance97%94%93%96%
    Numerical accuracy98%89%87%93%
    Inference speed (Q5_K_M)78 t/s112 t/s118 t/s74 t/s
    VRAM at Q5_K_M10 GB5.5 GB5 GB10 GB
    Training time (QLoRA)42 min22 min20 min40 min

    Phi-4 wins on accuracy across the board, particularly on numerical tasks. The trade-off is speed and VRAM — it's roughly 2x the size of 7B models. Qwen 2.5 14B comes close on accuracy but Phi-4 still edges it out on math-heavy tasks.

    If your enterprise tasks are primarily text-based (no math), Llama 3.3 8B at half the VRAM is a reasonable choice. If numbers, calculations, or structured data with numerical fields are involved, Phi-4 is worth the extra resources.

    Deployment Sizing

    For enterprise deployments handling different request volumes:

    Daily RequestsRecommended SetupMonthly Cost (Cloud)
    1,000-5,000Single RTX 4070 Ti (12 GB)$30-50/mo VPS
    5,000-20,000Single RTX 4090 (24 GB)$80-120/mo VPS
    20,000-100,0002x RTX 4090 with load balancing$160-240/mo
    100,000+vLLM on A100 for batched inference$400-800/mo

    At every tier, this is a fraction of the equivalent API cost. 20,000 requests/day through GPT-4o costs roughly $2,100-7,200/month depending on task complexity. The same workload on fine-tuned Phi-4 costs $80-120/month.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading