
Quantization Levels Explained: Q4 vs Q5 vs Q8 and When Each Matters
A practical guide to choosing GGUF quantization levels for local AI deployment. Covers Q4_K_M, Q5_K_M, Q8_0, and how hardware constraints, fine-tuning, and use case requirements determine the right quantization for your model.
You've fine-tuned a model. You've exported it as GGUF. Now you need to pick a quantization level — and the naming conventions look like a password generator threw up: Q4_K_M, Q5_K_S, Q6_K, Q8_0, IQ3_XS.
This guide cuts through the confusion. Here's how to pick the right quantization level for your hardware, your use case, and your quality requirements.
What Quantization Does (30-Second Version)
Neural network weights are numbers. Full-precision weights use 16 or 32 bits per number. Quantization reduces that to 8, 5, 4, 3, or even 2 bits.
Fewer bits = smaller file = less memory = faster inference.
The trade-off: lower precision means some accuracy loss. The question is how much loss, and whether it matters for your task.
For a deeper dive on the GGUF format itself, see our GGUF format explainer. This guide focuses specifically on choosing between quantization levels.
The Quantization Ladder
Here's every common GGUF quantization level for a 7-8B parameter model, sorted from smallest to largest:
| Quantization | Bits | File Size (8B model) | Quality | Best For |
|---|---|---|---|---|
| IQ2_XXS | 2.06 | ~2.5 GB | Poor | Extreme memory constraints only |
| IQ3_XS | 3.05 | ~3.3 GB | Fair | Mobile / IoT with tight limits |
| Q3_K_M | 3.44 | ~3.6 GB | Fair | Budget edge devices |
| Q4_K_M | 4.83 | ~4.9 GB | Good | Default for most deployments |
| Q4_K_S | 4.58 | ~4.6 GB | Good | When Q4_K_M is slightly too large |
| Q5_K_M | 5.69 | ~5.7 GB | Very good | Production where quality matters |
| Q5_K_S | 5.54 | ~5.5 GB | Very good | Slight size savings over Q5_K_M |
| Q6_K | 6.57 | ~6.6 GB | Excellent | High-quality with reasonable size |
| Q8_0 | 8.50 | ~8.5 GB | Near-lossless | Quality-critical, memory available |
| F16 | 16.0 | ~16 GB | Lossless | Reference / research only |
The three levels in bold — Q4_K_M, Q5_K_M, and Q8_0 — cover 90% of practical use cases. Start there.
Understanding K-Quants
The "K" in Q4_K_M stands for k-quant — a technique that uses mixed precision within the model. Instead of quantizing every layer identically, k-quants identify which weights are most important to model quality and keep those at higher precision while compressing less critical weights more aggressively.
This is why Q4_K_M significantly outperforms legacy Q4_0 despite similar file sizes. The "M" suffix means medium — a balance between quality and compression. "S" (small) compresses more aggressively; "L" (large) preserves more precision.
The practical implication: always use K-quant variants. There's no reason to use legacy formats (Q4_0, Q5_0) anymore — K-quants are strictly better at similar sizes.
Choosing by Hardware Constraint
Your hardware's available memory is the hard constraint. The model must fit entirely in VRAM (GPU) or RAM (CPU inference) with room left for the KV cache (which grows with context length).
Rule of thumb: Leave 2-3 GB headroom beyond the model file size.
| Available Memory | Maximum Quantization | Recommended |
|---|---|---|
| 4 GB | IQ3_XS or smaller | Marginal — consider a smaller model |
| 8 GB | Q4_K_M (with short context) | Q4_K_M for 7B models |
| 12 GB | Q5_K_M or Q6_K | Q5_K_M for best quality/size |
| 16 GB | Q8_0 | Q5_K_M (leaves room for longer context) |
| 24 GB+ | Q8_0 or F16 | Q8_0 (near-lossless, plenty of headroom) |
By device type:
Smartphones / IoT (2-4 GB AI budget): IQ3_XS or Q3_K_M. Quality compromises are real. Consider using a smaller base model (3B) at Q4_K_M instead of a larger model at extreme quantization — a well-tuned 3B at Q4 often outperforms a squeezed 8B at IQ2.
Laptops / Consumer PCs (8-16 GB): Q4_K_M is the safe default. If you have 16 GB, stretch to Q5_K_M for noticeably better reasoning and coherence.
Apple Silicon Macs (16-128 GB unified memory): The unified memory architecture is uniquely suited for local LLM inference. M4 Pro (24 GB) handles 8B at Q8_0 comfortably. M4 Max (64-128 GB) can run 70B models at Q4_K_M. See our Apple Silicon deployment guide for details.
Desktop GPU (RTX 4090/5090 — 24 GB VRAM): Q8_0 for 8B models with plenty of headroom. Q4_K_M for 13-14B models. Consumer GPUs are serious inference hardware.
Dedicated inference hardware (Taalas HC1): Uses proprietary 3-bit quantization baked into silicon — a different approach than GGUF quantization. The model weights are in the transistors, with LoRA adapter weights loaded separately.
Choosing by Use Case
Hardware sets the ceiling. Your use case determines where in the range you should aim.
Q4_K_M: The Default
Use when:
- You need a good balance of quality and performance
- Your task is well-defined (classification, extraction, simple Q&A)
- You've fine-tuned the model for your domain (fine-tuning compensates for quantization loss)
- Memory is a constraint
Quality impact: Slight degradation on complex reasoning and nuanced language. Minimal impact on domain-specific tasks where the model has been fine-tuned. Most users cannot distinguish Q4_K_M output from full precision on routine tasks.
Q5_K_M: The Production Sweet Spot
Use when:
- Quality matters and you have the memory headroom
- You're deploying to production where output quality directly affects users
- Tasks involve reasoning, summarization, or content generation
- You want a safety margin above Q4_K_M without paying the full Q8 cost
Quality impact: Close to imperceptible degradation for most tasks. This is the level where most "is this as good as the cloud API?" comparisons start answering "yes."
Q8_0: Quality-Critical
Use when:
- Output quality is non-negotiable (medical, legal, financial documents)
- You have ample memory (24 GB+ VRAM or 32 GB+ RAM)
- You want to minimize any risk from quantization artifacts
- You're running evaluations and need a fair comparison to full-precision performance
Quality impact: Near-lossless. Effectively identical to full precision for practical purposes. The file is ~2x the size of Q4_K_M, which means twice the memory and slightly slower inference — but the quality delta is negligible.
Quantization and Fine-Tuning: The Interaction
If you're fine-tuning models (which you should be, if you're reading this on the Ertas blog), quantization interacts with your training pipeline in important ways.
Fine-Tune at Full Precision, Quantize for Deployment
Always fine-tune at full precision (BF16 or FP16). The small quality losses from fine-tuning plus the small quality losses from quantization can compound if you're not careful. Starting clean at full precision and quantizing only at export time gives the best results.
This is the workflow Ertas follows: training happens on cloud GPUs at full precision, then you export to your target quantization level.
Fine-Tuning Compensates for Quantization
Here's the counterintuitive insight: a fine-tuned model at Q4_K_M often outperforms a generic model at Q8_0 on domain-specific tasks.
Why? Fine-tuning teaches the model the specific patterns, terminology, and output formats for your use case. The model doesn't need to "figure out" what you want from a general-purpose prompt — it knows, because you trained it. This focused knowledge is more resilient to quantization than general-purpose reasoning.
For a fine-tuned 8B model on a specific task, Q4_K_M is often more than sufficient. The fine-tuning does the heavy lifting; quantization is just the delivery mechanism.
LoRA Adapter Precision
If you're deploying LoRA adapters on a quantized base model, the adapter weights are typically stored at higher precision (FP16 or BF16) while the base model is quantized. This is fine — the adapter weights are small (50-200 MB) and keeping them at higher precision preserves the fine-tuning quality while the base model compression handles the bulk of the memory savings.
Quick Decision Flowchart
- What's your available memory? → This sets your ceiling quantization level.
- Have you fine-tuned the model? → If yes, you can use a lower quantization (Q4_K_M is usually fine). If no, aim higher (Q5_K_M or Q8_0).
- Is the task domain-specific or general? → Domain-specific tasks tolerate quantization better. General knowledge tasks are more sensitive.
- Is output quality directly customer-facing? → If yes, aim for Q5_K_M minimum. If internal/batch processing, Q4_K_M is sufficient.
- Are you comparing models? → Use Q8_0 or higher for fair comparisons. Quantization artifacts can mask real quality differences between model variants.
The Practical Recommendation
For most teams deploying fine-tuned models:
- Development and testing: Use Q8_0 to establish a quality baseline
- Production (plenty of memory): Q5_K_M — the best balance of quality and efficiency
- Production (memory-constrained): Q4_K_M — good enough for fine-tuned domain models
- Edge/mobile: Q4_K_M on the smallest model that meets your accuracy bar
- Evaluation and comparison: Always Q8_0 or F16 — don't let quantization artifacts cloud your judgment
Export your fine-tuned models from Ertas at your target quantization level. Fine-tune once at full precision, deploy at whatever quantization your hardware demands.
References: Practical GGUF Quantization Guide, Choosing a GGUF Model — K-Quants and I-Quants, Local AI Zone Quantization Guide.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

GGUF Explained: The Open Format That Runs AI Anywhere
GGUF is the file format that made running AI models on consumer hardware practical. Here's what it is, how it works, and why every AI builder should understand it.
Fine-Tuning for Apple Silicon: Running Custom Models on M-Series Macs
A practical guide to deploying fine-tuned AI models on Apple Silicon Macs. Covers M4 hardware capabilities, unified memory advantages, Ollama and MLX setup, quantization choices, and Core ML LoRA adapter support.

Running AI Models Locally: The Complete Guide to Local LLM Inference
Everything you need to know about running large language models on your own hardware — from hardware requirements and model formats to tools like Ollama, LM Studio, and llama.cpp.