
Quantization for Mobile: Q4, Q5, and Q8 Across Real Devices
A practical guide to GGUF quantization levels for mobile deployment. How Q4, Q5, and Q8 affect model size, speed, quality, and memory usage on iPhones and Android devices.
Quantization is what makes LLMs fit on phones. A 3B parameter model at full precision (FP16) takes 6GB. Quantized to Q4, it takes 1.7GB. The model is roughly the same, just stored more efficiently.
But "roughly the same" hides important nuances. Different quantization levels trade size and speed against quality. Choosing the right level for your mobile app requires understanding these trade-offs on real hardware.
What Quantization Does
LLM weights are stored as numbers. Full precision uses 16 bits per weight (FP16). Quantization reduces the bits per weight, shrinking the model:
| Format | Bits per Weight | 3B Model Size | Relative Quality |
|---|---|---|---|
| FP16 | 16 | ~6GB | 100% (baseline) |
| Q8_0 | 8 | ~3.2GB | ~99.5% |
| Q6_K | 6 | ~2.5GB | ~99% |
| Q5_K_M | 5 | ~2.1GB | ~98.5% |
| Q4_K_M | 4 | ~1.7GB | ~97.5% |
| Q4_0 | 4 | ~1.6GB | ~96% |
| Q3_K_M | 3 | ~1.4GB | ~93% |
| Q2_K | 2 | ~1.1GB | ~85% |
The "_K_M" suffix indicates k-quant with medium quality, a quantization method in llama.cpp that applies different bit widths to different layers based on their importance. This preserves quality better than uniform quantization.
The Sweet Spot: Q4_K_M
For mobile deployment, Q4_K_M is the standard choice. Here is why:
Size: Roughly 4x smaller than FP16. A 3B model fits in 1.7GB, well within the storage budget for a mobile app.
Quality: At ~97.5% of FP16 quality, the degradation is barely measurable on most tasks. For domain-specific tasks where the model is fine-tuned, the fine-tuning data compensates for any quantization loss.
Speed: Smaller weights mean less memory bandwidth used during inference. Q4 models often run faster than higher-precision versions because the memory bottleneck is reduced.
Compatibility: Fits on 4GB+ RAM devices (1B) or 6GB+ RAM devices (3B). Covers the vast majority of active smartphones.
Quantization Levels Compared on Real Devices
3B Model, iPhone 15 Pro (A17 Pro, 8GB RAM)
| Quantization | Size | Tokens/sec | Memory Usage | Quality (perplexity) |
|---|---|---|---|---|
| Q8_0 | 3.2GB | 14-18 | 4.0GB | 8.12 |
| Q6_K | 2.5GB | 16-20 | 3.2GB | 8.18 |
| Q5_K_M | 2.1GB | 17-22 | 2.7GB | 8.25 |
| Q4_K_M | 1.7GB | 18-25 | 2.2GB | 8.38 |
| Q4_0 | 1.6GB | 19-26 | 2.1GB | 8.52 |
| Q3_K_M | 1.4GB | 20-27 | 1.9GB | 9.05 |
| Q2_K | 1.1GB | 22-29 | 1.5GB | 11.4 |
Lower perplexity is better. Q4_K_M to Q8_0 spans a narrow quality range (8.12-8.38). Below Q4, quality drops noticeably. Below Q3, it degrades significantly.
3B Model, Galaxy S24 (Snapdragon 8 Gen 3, 8GB RAM)
| Quantization | Size | Tokens/sec (Vulkan) | Memory Usage |
|---|---|---|---|
| Q8_0 | 3.2GB | 16-20 | 4.0GB |
| Q5_K_M | 2.1GB | 20-25 | 2.7GB |
| Q4_K_M | 1.7GB | 22-28 | 2.2GB |
| Q3_K_M | 1.4GB | 24-30 | 1.9GB |
1B Model, iPhone 14 (A15, 6GB RAM)
| Quantization | Size | Tokens/sec | Memory Usage |
|---|---|---|---|
| Q8_0 | 1.1GB | 20-26 | 1.4GB |
| Q5_K_M | 750MB | 23-30 | 1.0GB |
| Q4_K_M | 600MB | 25-32 | 800MB |
| Q3_K_M | 500MB | 27-34 | 700MB |
When to Use Each Level
Q4_K_M (Default Choice)
Use this unless you have a specific reason not to. Best overall balance of size, speed, and quality. The standard for mobile deployment.
- Ideal for: All mobile apps shipping GGUF models
- Device requirement: 4GB+ (1B) or 6GB+ (3B)
Q5_K_M (Quality Priority)
Use when quality is critical and your target devices have room for the larger file. The quality improvement over Q4_K_M is small but measurable on tasks requiring precise wording.
- Ideal for: Medical, legal, or financial apps where word-level accuracy matters
- Device requirement: 6GB+ (1B) or 8GB+ (3B)
- Size increase: ~25% over Q4_K_M
Q8_0 (Maximum Quality)
Use only on flagship devices with ample RAM. The quality is nearly identical to FP16. The speed is slower and memory usage is higher.
- Ideal for: Testing and benchmarking, not typical production deployment
- Device requirement: 8GB+ (1B) or 12GB+ (3B)
- Size increase: ~90% over Q4_K_M
Q3_K_M (Size Priority)
Use when you need to squeeze a model onto constrained devices. Quality drops noticeably, but for simple tasks (classification, yes/no, short answers), it may be acceptable.
- Ideal for: Budget devices, or fitting a 3B model where only 1B Q4 would normally fit
- Quality risk: Noticeable degradation on nuanced tasks
- Size savings: ~18% over Q4_K_M
Q2_K (Not Recommended)
Severe quality degradation. The model frequently generates incoherent or incorrect output. Not suitable for production.
Fine-Tuning and Quantization
Fine-tuning happens at full precision (FP16 or BF16). Quantization is applied after training as an export step. The workflow:
- Fine-tune the model with LoRA on your training data (full precision)
- Merge the LoRA adapter into the base model
- Quantize to GGUF at your target level (Q4_K_M for most cases)
- Test the quantized model on your evaluation set
Platforms like Ertas handle this pipeline end-to-end. You upload training data, select your base model, train, and export directly to GGUF at your chosen quantization level.
Does Quantization Hurt Fine-Tuned Models More?
Slightly. Fine-tuned models have more information packed into their weights (domain-specific knowledge). Aggressive quantization (Q3 and below) can lose some of this specific knowledge.
In practice, Q4_K_M preserves fine-tuned accuracy well. The quality difference between Q4_K_M and FP16 on domain-specific benchmarks is typically under 2 percentage points. Q3_K_M can show 3-5 points of degradation.
Practical Recommendations
- Default to Q4_K_M. It is the standard for a reason. Test with this first.
- Fine-tune before optimizing quantization. A fine-tuned Q4 model outperforms an un-tuned Q8 model on domain tasks.
- Test on your task, not general benchmarks. Perplexity is a proxy. What matters is your model's accuracy on your specific evaluation set at your chosen quantization level.
- Offer one quantization level per model size. Do not make users choose between Q4 and Q5. Ship Q4_K_M for broad compatibility. If you support both 1B and 3B, that is enough optionality.
- Measure real-world performance. Run your evaluation suite on the quantized model. If accuracy is within 2% of FP16, ship it.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Can LLMs Actually Run on iPhones? Benchmarks and Real-World Performance
Real benchmark data for running LLMs on iPhones via llama.cpp. Token generation speeds, memory usage, and thermal behavior across iPhone models from the iPhone 12 to iPhone 16 Pro.

On-Device AI Model Size Guide: 1B vs 3B vs 7B for Mobile
How to choose the right model size for your mobile app. Capability breakdown, device requirements, quality benchmarks, and the fine-tuning factor that changes the math.

Llama 3.2 for Mobile Apps: Fine-Tuning and On-Device Deployment
A complete guide to using Meta's Llama 3.2 1B and 3B models in mobile apps. Fine-tuning with LoRA, exporting to GGUF, and deploying on iOS and Android via llama.cpp.