You've fine-tuned a model. You've exported it as GGUF. Now you need to pick a quantization level — and the naming conventions look like a password generator threw up: Q4_K_M, Q5_K_S, Q6_K, Q8_0, IQ3_XS.

This guide cuts through the confusion. Here's how to pick the right quantization level for your hardware, your use case, and your quality requirements.

What Quantization Does (30-Second Version)

Neural network weights are numbers. Full-precision weights use 16 or 32 bits per number. Quantization reduces that to 8, 5, 4, 3, or even 2 bits.

Fewer bits = smaller file = less memory = faster inference.

The trade-off: lower precision means some accuracy loss. The question is how much loss, and whether it matters for your task.

For a deeper dive on the GGUF format itself, see our GGUF format explainer. This guide focuses specifically on choosing between quantization levels.

The Quantization Ladder

Here's every common GGUF quantization level for a 7-8B parameter model, sorted from smallest to largest:

Quantization	Bits	File Size (8B model)	Quality	Best For
IQ2_XXS	2.06	~2.5 GB	Poor	Extreme memory constraints only
IQ3_XS	3.05	~3.3 GB	Fair	Mobile / IoT with tight limits
Q3_K_M	3.44	~3.6 GB	Fair	Budget edge devices
Q4_K_M	4.83	~4.9 GB	Good	Default for most deployments
Q4_K_S	4.58	~4.6 GB	Good	When Q4_K_M is slightly too large
Q5_K_M	5.69	~5.7 GB	Very good	Production where quality matters
Q5_K_S	5.54	~5.5 GB	Very good	Slight size savings over Q5_K_M
Q6_K	6.57	~6.6 GB	Excellent	High-quality with reasonable size
Q8_0	8.50	~8.5 GB	Near-lossless	Quality-critical, memory available
F16	16.0	~16 GB	Lossless	Reference / research only

The three levels in bold — Q4_K_M, Q5_K_M, and Q8_0 — cover 90% of practical use cases. Start there.

Understanding K-Quants

The "K" in Q4_K_M stands for k-quant — a technique that uses mixed precision within the model. Instead of quantizing every layer identically, k-quants identify which weights are most important to model quality and keep those at higher precision while compressing less critical weights more aggressively.

This is why Q4_K_M significantly outperforms legacy Q4_0 despite similar file sizes. The "M" suffix means medium — a balance between quality and compression. "S" (small) compresses more aggressively; "L" (large) preserves more precision.

The practical implication: always use K-quant variants. There's no reason to use legacy formats (Q4_0, Q5_0) anymore — K-quants are strictly better at similar sizes.

Choosing by Hardware Constraint

Your hardware's available memory is the hard constraint. The model must fit entirely in VRAM (GPU) or RAM (CPU inference) with room left for the KV cache (which grows with context length).

Rule of thumb: Leave 2-3 GB headroom beyond the model file size.

Available Memory	Maximum Quantization	Recommended
4 GB	IQ3_XS or smaller	Marginal — consider a smaller model
8 GB	Q4_K_M (with short context)	Q4_K_M for 7B models
12 GB	Q5_K_M or Q6_K	Q5_K_M for best quality/size
16 GB	Q8_0	Q5_K_M (leaves room for longer context)
24 GB+	Q8_0 or F16	Q8_0 (near-lossless, plenty of headroom)

By device type:

Smartphones / IoT (2-4 GB AI budget): IQ3_XS or Q3_K_M. Quality compromises are real. Consider using a smaller base model (3B) at Q4_K_M instead of a larger model at extreme quantization — a well-tuned 3B at Q4 often outperforms a squeezed 8B at IQ2.

Laptops / Consumer PCs (8-16 GB): Q4_K_M is the safe default. If you have 16 GB, stretch to Q5_K_M for noticeably better reasoning and coherence.

Apple Silicon Macs (16-128 GB unified memory): The unified memory architecture is uniquely suited for local LLM inference. M4 Pro (24 GB) handles 8B at Q8_0 comfortably. M4 Max (64-128 GB) can run 70B models at Q4_K_M. See our Apple Silicon deployment guide for details.

Desktop GPU (RTX 4090/5090 — 24 GB VRAM): Q8_0 for 8B models with plenty of headroom. Q4_K_M for 13-14B models. Consumer GPUs are serious inference hardware.

Dedicated inference hardware (Taalas HC1): Uses proprietary 3-bit quantization baked into silicon — a different approach than GGUF quantization. The model weights are in the transistors, with LoRA adapter weights loaded separately.

Choosing by Use Case

Hardware sets the ceiling. Your use case determines where in the range you should aim.

Q4_K_M: The Default

Use when:

You need a good balance of quality and performance
Your task is well-defined (classification, extraction, simple Q&A)
You've fine-tuned the model for your domain (fine-tuning compensates for quantization loss)
Memory is a constraint

Quality impact: Slight degradation on complex reasoning and nuanced language. Minimal impact on domain-specific tasks where the model has been fine-tuned. Most users cannot distinguish Q4_K_M output from full precision on routine tasks.

Q5_K_M: The Production Sweet Spot

Use when:

Quality matters and you have the memory headroom
You're deploying to production where output quality directly affects users
Tasks involve reasoning, summarization, or content generation
You want a safety margin above Q4_K_M without paying the full Q8 cost

Quality impact: Close to imperceptible degradation for most tasks. This is the level where most "is this as good as the cloud API?" comparisons start answering "yes."

Q8_0: Quality-Critical

Use when:

Output quality is non-negotiable (medical, legal, financial documents)
You have ample memory (24 GB+ VRAM or 32 GB+ RAM)
You want to minimize any risk from quantization artifacts
You're running evaluations and need a fair comparison to full-precision performance

Quality impact: Near-lossless. Effectively identical to full precision for practical purposes. The file is ~2x the size of Q4_K_M, which means twice the memory and slightly slower inference — but the quality delta is negligible.

Quantization and Fine-Tuning: The Interaction

If you're fine-tuning models (which you should be, if you're reading this on the Ertas blog), quantization interacts with your training pipeline in important ways.

Fine-Tune at Full Precision, Quantize for Deployment

Always fine-tune at full precision (BF16 or FP16). The small quality losses from fine-tuning plus the small quality losses from quantization can compound if you're not careful. Starting clean at full precision and quantizing only at export time gives the best results.

This is the workflow Ertas follows: training happens on cloud GPUs at full precision, then you export to your target quantization level.

Fine-Tuning Compensates for Quantization

Here's the counterintuitive insight: a fine-tuned model at Q4_K_M often outperforms a generic model at Q8_0 on domain-specific tasks.

Why? Fine-tuning teaches the model the specific patterns, terminology, and output formats for your use case. The model doesn't need to "figure out" what you want from a general-purpose prompt — it knows, because you trained it. This focused knowledge is more resilient to quantization than general-purpose reasoning.

For a fine-tuned 8B model on a specific task, Q4_K_M is often more than sufficient. The fine-tuning does the heavy lifting; quantization is just the delivery mechanism.

LoRA Adapter Precision

If you're deploying LoRA adapters on a quantized base model, the adapter weights are typically stored at higher precision (FP16 or BF16) while the base model is quantized. This is fine — the adapter weights are small (50-200 MB) and keeping them at higher precision preserves the fine-tuning quality while the base model compression handles the bulk of the memory savings.

Quick Decision Flowchart

What's your available memory? → This sets your ceiling quantization level.
Have you fine-tuned the model? → If yes, you can use a lower quantization (Q4_K_M is usually fine). If no, aim higher (Q5_K_M or Q8_0).
Is the task domain-specific or general? → Domain-specific tasks tolerate quantization better. General knowledge tasks are more sensitive.
Is output quality directly customer-facing? → If yes, aim for Q5_K_M minimum. If internal/batch processing, Q4_K_M is sufficient.
Are you comparing models? → Use Q8_0 or higher for fair comparisons. Quantization artifacts can mask real quality differences between model variants.

The Practical Recommendation

For most teams deploying fine-tuned models:

Development and testing: Use Q8_0 to establish a quality baseline
Production (plenty of memory): Q5_K_M — the best balance of quality and efficiency
Production (memory-constrained): Q4_K_M — good enough for fine-tuned domain models
Edge/mobile: Q4_K_M on the smallest model that meets your accuracy bar
Evaluation and comparison: Always Q8_0 or F16 — don't let quantization artifacts cloud your judgment

Export your fine-tuned models from Ertas at your target quantization level. Fine-tune once at full precision, deploy at whatever quantization your hardware demands.

References: Practical GGUF Quantization Guide, Choosing a GGUF Model — K-Quants and I-Quants, Local AI Zone Quantization Guide.

Quantization Levels Explained: Q4 vs Q5 vs Q8 and When Each Matters

What Quantization Does (30-Second Version)

The Quantization Ladder

Understanding K-Quants

Choosing by Hardware Constraint

Rule of thumb: Leave 2-3 GB headroom beyond the model file size.

By device type:

Choosing by Use Case

Q4_K_M: The Default

Q5_K_M: The Production Sweet Spot

Q8_0: Quality-Critical

Quantization and Fine-Tuning: The Interaction

Fine-Tune at Full Precision, Quantize for Deployment

Fine-Tuning Compensates for Quantization

LoRA Adapter Precision

Quick Decision Flowchart

The Practical Recommendation

Ship AI that runs on your users' devices.

Keep reading

GGUF Explained: The Open Format That Runs AI Anywhere

Fine-Tuning for Apple Silicon: Running Custom Models on M-Series Macs

Running AI Models Locally: The Complete Guide to Local LLM Inference