Fine-Tune Phi-4 with Ertas
Microsoft's 14-billion parameter small language model that emphasizes reasoning quality through synthetic data training, achieving performance competitive with models several times its size on math and logic benchmarks.
Overview
Phi-4 is Microsoft's latest entry in the Phi small language model series, released in December 2024. With 14 billion parameters, Phi-4 was specifically designed to maximize reasoning capability relative to model size. Microsoft achieved this through a training methodology that heavily emphasizes synthetic data — carefully generated training examples that target specific reasoning patterns, mathematical problem-solving, and logical deduction.
The model demonstrates remarkable benchmark performance for its size class. On mathematical reasoning benchmarks like MATH and GSM8K, Phi-4 competes with models in the 70B+ parameter range and even approaches some frontier models. This makes it particularly valuable for applications where strong reasoning is required but computational resources are limited.
Phi-4 uses a dense transformer architecture with 40 layers, a hidden dimension of 5120, and 40 attention heads. It supports a context window of 16K tokens and uses the tiktoken tokenizer with a 100K vocabulary. The architecture includes standard modern features like RoPE positional embeddings and grouped-query attention.
The model is released under the MIT license, making it one of the most permissively licensed high-quality models available. This has encouraged broad adoption in both research and commercial applications, particularly in domains requiring structured reasoning.
Key Features
Phi-4's standout feature is its reasoning capability, achieved through Microsoft's innovative synthetic data training pipeline. Rather than relying solely on web-scraped text, the training data includes millions of synthetically generated question-answer pairs, step-by-step mathematical proofs, logic puzzles, and code reasoning traces. This targeted training approach produces a model that reasons more reliably than models trained primarily on natural text.
The model demonstrates particularly strong performance on structured tasks: mathematical problem-solving, code generation with logical constraints, scientific reasoning, and formal logic. On the MATH benchmark, Phi-4 achieves scores that rival GPT-4 Turbo, despite being roughly 100x smaller in parameter count.
Phi-4 also shows improved instruction-following compared to Phi-3, with better adherence to output format requirements, more consistent handling of multi-step instructions, and reduced tendency to hallucinate. The chat-tuned variant supports system prompts and multi-turn conversations effectively.
Fine-Tuning with Ertas
Phi-4 is an excellent candidate for fine-tuning in Ertas Studio, particularly for applications requiring domain-specific reasoning. At 14B parameters, it sits in a sweet spot — large enough to capture complex patterns but small enough for efficient QLoRA training on a single 24GB GPU. With 4-bit quantization, fine-tuning requires approximately 10-14GB VRAM, achievable on an RTX 4090, RTX 3090, or A5000.
In Ertas Studio, upload your reasoning-focused dataset (chain-of-thought examples work particularly well with Phi-4), select the model, and configure LoRA parameters. The model responds well to relatively low LoRA ranks (8-32) for reasoning tasks, keeping adapter sizes small and training fast. A typical fine-tuning run on 10,000 examples completes in 1-2 hours on a single GPU.
After training, export to GGUF format. Phi-4's 14B size quantizes efficiently — at Q4_K_M, the resulting model is approximately 8.5GB, small enough to run on a laptop. This makes Phi-4 ideal for creating specialized reasoning models that can be deployed anywhere without cloud dependencies.
Use Cases
Phi-4 excels in applications requiring structured reasoning: mathematical tutoring systems, scientific analysis tools, code review and debugging assistants, and decision-support systems. Its strong performance on logic tasks makes it particularly suitable for rule-based processing, compliance checking, and structured data extraction.
The model is an excellent choice for educational technology applications, where step-by-step problem-solving explanations are valued. Fine-tuned Phi-4 can serve as a math tutor, a science explainer, or a programming instructor, providing detailed reasoning traces that help users understand the solution process.
For enterprise deployments, Phi-4 offers a compelling combination of strong reasoning with manageable resource requirements. It is well-suited for document analysis pipelines that require logical inference, automated report generation with data-driven conclusions, and quality assurance workflows that need to verify logical consistency.
Hardware Requirements
Phi-4 at Q4_K_M quantization requires approximately 8.5GB of RAM, making it comfortable to run on systems with 16GB RAM, most modern GPUs with 10GB+ VRAM, and Apple Silicon Macs with 16GB unified memory. At Q8_0, expect approximately 15GB, still manageable on a 24GB GPU or 32GB system.
Full FP16 inference requires approximately 28GB VRAM, fitting on a single A6000 48GB or A100 40GB. Inference speed on consumer hardware is excellent — expect 30-50 tokens per second on an RTX 4090 at Q4_K_M and 10-20 tokens per second on an M2 Pro MacBook with 32GB RAM.
For fine-tuning with QLoRA in Ertas Studio, 12-16GB VRAM is sufficient (RTX 4070 Ti, RTX 4080, RTX 4090). Full LoRA fine-tuning requires approximately 20-24GB VRAM. The model's moderate size allows for rapid iteration during the fine-tuning process.
Supported Quantizations
Related Resources
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.