Fine-Tune Phi-3 with Ertas

Microsoft's family of compact yet capable language models available in 3.8B, 7B, and 14B sizes, designed for on-device and edge deployment with surprisingly strong performance on reasoning and instruction-following tasks.

3.8B7B14BMicrosoft

Overview

Phi-3 is Microsoft's third-generation small language model family, released in April 2024. The lineup includes Phi-3 Mini (3.8B), Phi-3 Small (7B), and Phi-3 Medium (14B). The Phi series pioneered the concept that carefully curated training data can compensate for smaller model sizes, and Phi-3 pushes this philosophy further with a training mixture that combines filtered web data with extensive synthetic datasets generated by larger models.

Phi-3 Mini, the flagship of the family at just 3.8B parameters, delivers performance comparable to models like Mixtral 8x7B and GPT-3.5 on many benchmarks, despite being over 10x smaller. This makes it one of the most efficient models ever released in terms of quality per parameter. The model supports a 128K context window through the LongRoPE extension, enabling long-document processing even on devices with limited compute.

Phi-3 Small (7B) and Phi-3 Medium (14B) further improve quality while remaining efficient. Phi-3 Small uses a novel blocksparse attention mechanism that reduces memory usage during long-context inference. Phi-3 Medium approaches the quality of Llama 3 8B and Mistral 7B while offering competitive performance at similar inference costs.

All Phi-3 models are released under the MIT license and are available in both base and instruction-tuned variants. Microsoft also provides ONNX-optimized versions for deployment on mobile devices and browsers, and has demonstrated Phi-3 Mini running efficiently on smartphones and Raspberry Pi devices.

Key Features

The Phi-3 family's most distinctive feature is its training data methodology. Microsoft employs a multi-stage training pipeline that begins with web data filtered through a classifier trained to identify educational and high-quality content, then augments this with millions of synthetically generated textbook-style passages, reasoning chains, and code examples. This data quality focus enables small models to learn more effectively from each training token.

Phi-3 Mini supports context windows up to 128K tokens through LongRoPE, a positional encoding extension that enables efficient processing of long sequences without significant quality degradation. This is remarkable for a 3.8B model and enables use cases typically reserved for much larger models, such as analyzing entire documents or maintaining very long conversation histories.

All models in the family support ONNX Runtime deployment, enabling hardware-accelerated inference on a wide range of devices including mobile phones (via ONNX Runtime Mobile), web browsers (via WebAssembly/WebGPU), and edge devices. This makes Phi-3 uniquely suited for on-device AI applications where cloud connectivity is unreliable or data privacy requirements prohibit cloud processing.

Fine-Tuning with Ertas

Phi-3 models are among the most accessible for fine-tuning in Ertas Studio due to their small sizes. Phi-3 Mini (3.8B) can be fine-tuned with QLoRA using as little as 4-6GB VRAM — this runs on virtually any modern GPU, including the RTX 3060 6GB, GTX 1660 Ti 6GB, or even integrated GPU systems with sufficient shared memory. Training is fast, with typical runs completing in under an hour for datasets of 10,000 examples.

Phi-3 Medium (14B) requires approximately 10-14GB VRAM for QLoRA training, well within the capability of consumer GPUs like the RTX 4070 12GB or RTX 4080 16GB. The instruction-tuned variants respond well to domain adaptation, making them excellent starting points for specialized assistants.

Ertas Studio's export pipeline generates GGUF files that can be deployed through Ollama or llama.cpp. The small model sizes mean the resulting GGUF files are highly portable — a Q4_K_M quantized Phi-3 Mini is only about 2.3GB, small enough to distribute as part of a desktop application or embed in an edge computing pipeline. This makes Phi-3 ideal for creating custom, specialized models that run entirely offline.

Use Cases

Phi-3 Mini is the premier choice for on-device AI applications. Its 3.8B parameter size enables deployment on smartphones, tablets, embedded systems, and IoT devices where larger models simply cannot fit. Use cases include offline conversational assistants, on-device document summarization, privacy-preserving text analysis, and real-time language processing in environments without internet connectivity.

The model family excels at structured tasks in resource-constrained settings: form processing, data extraction, classification, and simple code generation. For applications like customer support automation, FAQ answering, and content moderation, fine-tuned Phi-3 models offer an outstanding cost-to-quality ratio.

Phi-3 is also valuable as a component in larger systems. It can serve as a fast draft model in speculative decoding pipelines, a lightweight classifier or router that directs queries to appropriate specialized models, or a preprocessing step that extracts structured information before passing to more capable models for complex reasoning.

Hardware Requirements

Phi-3 Mini (3.8B) at Q4_K_M quantization requires approximately 2.3GB of RAM. This is small enough to run on virtually any modern device: smartphones with 4GB+ RAM, Raspberry Pi 5 (8GB), older laptops, and even some browser-based deployments via WebAssembly. At Q8_0, the requirement is approximately 4.1GB, still remarkably portable.

Phi-3 Small (7B) at Q4_K_M needs approximately 4.3GB, and Phi-3 Medium (14B) requires approximately 8.4GB — both comfortable on consumer hardware with 16GB RAM or GPUs with 8GB+ VRAM. Full FP16 inference for Medium requires approximately 28GB VRAM.

For fine-tuning in Ertas Studio, Phi-3 Mini requires just 4-6GB VRAM with QLoRA, Phi-3 Small needs 6-10GB, and Phi-3 Medium requires 10-14GB. These low requirements make the entire Phi-3 family accessible to individual developers and small teams without specialized hardware.

Supported Quantizations

Q4_0Q4_K_MQ5_K_MQ6_KQ8_0F16

Related Resources

Integration

llama.cpp

Integration

LM Studio

Integration

Ollama

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →