Quantization for Mobile: Q4, Q5, and Q8 Across Real Devices

Quantization is what makes LLMs fit on phones. A 3B parameter model at full precision (FP16) takes 6GB. Quantized to Q4, it takes 1.7GB. The model is roughly the same, just stored more efficiently.

But "roughly the same" hides important nuances. Different quantization levels trade size and speed against quality. Choosing the right level for your mobile app requires understanding these trade-offs on real hardware.

What Quantization Does

LLM weights are stored as numbers. Full precision uses 16 bits per weight (FP16). Quantization reduces the bits per weight, shrinking the model:

Format	Bits per Weight	3B Model Size	Relative Quality
FP16	16	~6GB	100% (baseline)
Q8_0	8	~3.2GB	~99.5%
Q6_K	6	~2.5GB	~99%
Q5_K_M	5	~2.1GB	~98.5%
Q4_K_M	4	~1.7GB	~97.5%
Q4_0	4	~1.6GB	~96%
Q3_K_M	3	~1.4GB	~93%
Q2_K	2	~1.1GB	~85%

The "_K_M" suffix indicates k-quant with medium quality, a quantization method in llama.cpp that applies different bit widths to different layers based on their importance. This preserves quality better than uniform quantization.

The Sweet Spot: Q4_K_M

For mobile deployment, Q4_K_M is the standard choice. Here is why:

Size: Roughly 4x smaller than FP16. A 3B model fits in 1.7GB, well within the storage budget for a mobile app.

Quality: At ~97.5% of FP16 quality, the degradation is barely measurable on most tasks. For domain-specific tasks where the model is fine-tuned, the fine-tuning data compensates for any quantization loss.

Speed: Smaller weights mean less memory bandwidth used during inference. Q4 models often run faster than higher-precision versions because the memory bottleneck is reduced.

Compatibility: Fits on 4GB+ RAM devices (1B) or 6GB+ RAM devices (3B). Covers the vast majority of active smartphones.

Quantization Levels Compared on Real Devices

3B Model, iPhone 15 Pro (A17 Pro, 8GB RAM)

Quantization	Size	Tokens/sec	Memory Usage	Quality (perplexity)
Q8_0	3.2GB	14-18	4.0GB	8.12
Q6_K	2.5GB	16-20	3.2GB	8.18
Q5_K_M	2.1GB	17-22	2.7GB	8.25
Q4_K_M	1.7GB	18-25	2.2GB	8.38
Q4_0	1.6GB	19-26	2.1GB	8.52
Q3_K_M	1.4GB	20-27	1.9GB	9.05
Q2_K	1.1GB	22-29	1.5GB	11.4

Lower perplexity is better. Q4_K_M to Q8_0 spans a narrow quality range (8.12-8.38). Below Q4, quality drops noticeably. Below Q3, it degrades significantly.

3B Model, Galaxy S24 (Snapdragon 8 Gen 3, 8GB RAM)

Quantization	Size	Tokens/sec (Vulkan)	Memory Usage
Q8_0	3.2GB	16-20	4.0GB
Q5_K_M	2.1GB	20-25	2.7GB
Q4_K_M	1.7GB	22-28	2.2GB
Q3_K_M	1.4GB	24-30	1.9GB

1B Model, iPhone 14 (A15, 6GB RAM)

Quantization	Size	Tokens/sec	Memory Usage
Q8_0	1.1GB	20-26	1.4GB
Q5_K_M	750MB	23-30	1.0GB
Q4_K_M	600MB	25-32	800MB
Q3_K_M	500MB	27-34	700MB

When to Use Each Level

Q4_K_M (Default Choice)

Use this unless you have a specific reason not to. Best overall balance of size, speed, and quality. The standard for mobile deployment.

Ideal for: All mobile apps shipping GGUF models
Device requirement: 4GB+ (1B) or 6GB+ (3B)

Q5_K_M (Quality Priority)

Use when quality is critical and your target devices have room for the larger file. The quality improvement over Q4_K_M is small but measurable on tasks requiring precise wording.

Ideal for: Medical, legal, or financial apps where word-level accuracy matters
Device requirement: 6GB+ (1B) or 8GB+ (3B)
Size increase: ~25% over Q4_K_M

Q8_0 (Maximum Quality)

Use only on flagship devices with ample RAM. The quality is nearly identical to FP16. The speed is slower and memory usage is higher.

Ideal for: Testing and benchmarking, not typical production deployment
Device requirement: 8GB+ (1B) or 12GB+ (3B)
Size increase: ~90% over Q4_K_M

Q3_K_M (Size Priority)

Use when you need to squeeze a model onto constrained devices. Quality drops noticeably, but for simple tasks (classification, yes/no, short answers), it may be acceptable.

Ideal for: Budget devices, or fitting a 3B model where only 1B Q4 would normally fit
Quality risk: Noticeable degradation on nuanced tasks
Size savings: ~18% over Q4_K_M

Q2_K (Not Recommended)

Severe quality degradation. The model frequently generates incoherent or incorrect output. Not suitable for production.

Fine-Tuning and Quantization

Fine-tuning happens at full precision (FP16 or BF16). Quantization is applied after training as an export step. The workflow:

Fine-tune the model with LoRA on your training data (full precision)
Merge the LoRA adapter into the base model
Quantize to GGUF at your target level (Q4_K_M for most cases)
Test the quantized model on your evaluation set

Platforms like Ertas handle this pipeline end-to-end. You upload training data, select your base model, train, and export directly to GGUF at your chosen quantization level.

Does Quantization Hurt Fine-Tuned Models More?

Slightly. Fine-tuned models have more information packed into their weights (domain-specific knowledge). Aggressive quantization (Q3 and below) can lose some of this specific knowledge.

In practice, Q4_K_M preserves fine-tuned accuracy well. The quality difference between Q4_K_M and FP16 on domain-specific benchmarks is typically under 2 percentage points. Q3_K_M can show 3-5 points of degradation.

Practical Recommendations

Default to Q4_K_M. It is the standard for a reason. Test with this first.
Fine-tune before optimizing quantization. A fine-tuned Q4 model outperforms an un-tuned Q8 model on domain tasks.
Test on your task, not general benchmarks. Perplexity is a proxy. What matters is your model's accuracy on your specific evaluation set at your chosen quantization level.
Offer one quantization level per model size. Do not make users choose between Q4 and Q5. Ship Q4_K_M for broad compatibility. If you support both 1B and 3B, that is enough optionality.
Measure real-world performance. Run your evaluation suite on the quantized model. If accuracy is within 2% of FP16, ship it.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Quantization for Mobile: Q4, Q5, and Q8 Across Real Devices

What Quantization Does

The Sweet Spot: Q4_K_M

Quantization Levels Compared on Real Devices

3B Model, iPhone 15 Pro (A17 Pro, 8GB RAM)

3B Model, Galaxy S24 (Snapdragon 8 Gen 3, 8GB RAM)

1B Model, iPhone 14 (A15, 6GB RAM)

When to Use Each Level

Q4_K_M (Default Choice)

Q5_K_M (Quality Priority)

Q8_0 (Maximum Quality)

Q3_K_M (Size Priority)

Q2_K (Not Recommended)

Fine-Tuning and Quantization

Does Quantization Hurt Fine-Tuned Models More?

Practical Recommendations

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

Can LLMs Actually Run on iPhones? Benchmarks and Real-World Performance

On-Device AI Model Size Guide: 1B vs 3B vs 7B for Mobile

Llama 3.2 for Mobile Apps: Fine-Tuning and On-Device Deployment