Back to blog
    Quantization Levels Explained: Q4 vs Q5 vs Q8 and When Each Matters
    quantizationgguflocal-inferencedeploymenthardwaremodel-optimization

    Quantization Levels Explained: Q4 vs Q5 vs Q8 and When Each Matters

    A practical guide to choosing GGUF quantization levels for local AI deployment. Covers Q4_K_M, Q5_K_M, Q8_0, and how hardware constraints, fine-tuning, and use case requirements determine the right quantization for your model.

    EErtas Team··Updated

    You've fine-tuned a model. You've exported it as GGUF. Now you need to pick a quantization level — and the naming conventions look like a password generator threw up: Q4_K_M, Q5_K_S, Q6_K, Q8_0, IQ3_XS.

    This guide cuts through the confusion. Here's how to pick the right quantization level for your hardware, your use case, and your quality requirements.

    What Quantization Does (30-Second Version)

    Neural network weights are numbers. Full-precision weights use 16 or 32 bits per number. Quantization reduces that to 8, 5, 4, 3, or even 2 bits.

    Fewer bits = smaller file = less memory = faster inference.

    The trade-off: lower precision means some accuracy loss. The question is how much loss, and whether it matters for your task.

    For a deeper dive on the GGUF format itself, see our GGUF format explainer. This guide focuses specifically on choosing between quantization levels.

    The Quantization Ladder

    Here's every common GGUF quantization level for a 7-8B parameter model, sorted from smallest to largest:

    QuantizationBitsFile Size (8B model)QualityBest For
    IQ2_XXS2.06~2.5 GBPoorExtreme memory constraints only
    IQ3_XS3.05~3.3 GBFairMobile / IoT with tight limits
    Q3_K_M3.44~3.6 GBFairBudget edge devices
    Q4_K_M4.83~4.9 GBGoodDefault for most deployments
    Q4_K_S4.58~4.6 GBGoodWhen Q4_K_M is slightly too large
    Q5_K_M5.69~5.7 GBVery goodProduction where quality matters
    Q5_K_S5.54~5.5 GBVery goodSlight size savings over Q5_K_M
    Q6_K6.57~6.6 GBExcellentHigh-quality with reasonable size
    Q8_08.50~8.5 GBNear-losslessQuality-critical, memory available
    F1616.0~16 GBLosslessReference / research only

    The three levels in bold — Q4_K_M, Q5_K_M, and Q8_0 — cover 90% of practical use cases. Start there.

    Understanding K-Quants

    The "K" in Q4_K_M stands for k-quant — a technique that uses mixed precision within the model. Instead of quantizing every layer identically, k-quants identify which weights are most important to model quality and keep those at higher precision while compressing less critical weights more aggressively.

    This is why Q4_K_M significantly outperforms legacy Q4_0 despite similar file sizes. The "M" suffix means medium — a balance between quality and compression. "S" (small) compresses more aggressively; "L" (large) preserves more precision.

    The practical implication: always use K-quant variants. There's no reason to use legacy formats (Q4_0, Q5_0) anymore — K-quants are strictly better at similar sizes.

    Choosing by Hardware Constraint

    Your hardware's available memory is the hard constraint. The model must fit entirely in VRAM (GPU) or RAM (CPU inference) with room left for the KV cache (which grows with context length).

    Rule of thumb: Leave 2-3 GB headroom beyond the model file size.

    Available MemoryMaximum QuantizationRecommended
    4 GBIQ3_XS or smallerMarginal — consider a smaller model
    8 GBQ4_K_M (with short context)Q4_K_M for 7B models
    12 GBQ5_K_M or Q6_KQ5_K_M for best quality/size
    16 GBQ8_0Q5_K_M (leaves room for longer context)
    24 GB+Q8_0 or F16Q8_0 (near-lossless, plenty of headroom)

    By device type:

    Smartphones / IoT (2-4 GB AI budget): IQ3_XS or Q3_K_M. Quality compromises are real. Consider using a smaller base model (3B) at Q4_K_M instead of a larger model at extreme quantization — a well-tuned 3B at Q4 often outperforms a squeezed 8B at IQ2.

    Laptops / Consumer PCs (8-16 GB): Q4_K_M is the safe default. If you have 16 GB, stretch to Q5_K_M for noticeably better reasoning and coherence.

    Apple Silicon Macs (16-128 GB unified memory): The unified memory architecture is uniquely suited for local LLM inference. M4 Pro (24 GB) handles 8B at Q8_0 comfortably. M4 Max (64-128 GB) can run 70B models at Q4_K_M. See our Apple Silicon deployment guide for details.

    Desktop GPU (RTX 4090/5090 — 24 GB VRAM): Q8_0 for 8B models with plenty of headroom. Q4_K_M for 13-14B models. Consumer GPUs are serious inference hardware.

    Dedicated inference hardware (Taalas HC1): Uses proprietary 3-bit quantization baked into silicon — a different approach than GGUF quantization. The model weights are in the transistors, with LoRA adapter weights loaded separately.

    Choosing by Use Case

    Hardware sets the ceiling. Your use case determines where in the range you should aim.

    Q4_K_M: The Default

    Use when:

    • You need a good balance of quality and performance
    • Your task is well-defined (classification, extraction, simple Q&A)
    • You've fine-tuned the model for your domain (fine-tuning compensates for quantization loss)
    • Memory is a constraint

    Quality impact: Slight degradation on complex reasoning and nuanced language. Minimal impact on domain-specific tasks where the model has been fine-tuned. Most users cannot distinguish Q4_K_M output from full precision on routine tasks.

    Q5_K_M: The Production Sweet Spot

    Use when:

    • Quality matters and you have the memory headroom
    • You're deploying to production where output quality directly affects users
    • Tasks involve reasoning, summarization, or content generation
    • You want a safety margin above Q4_K_M without paying the full Q8 cost

    Quality impact: Close to imperceptible degradation for most tasks. This is the level where most "is this as good as the cloud API?" comparisons start answering "yes."

    Q8_0: Quality-Critical

    Use when:

    • Output quality is non-negotiable (medical, legal, financial documents)
    • You have ample memory (24 GB+ VRAM or 32 GB+ RAM)
    • You want to minimize any risk from quantization artifacts
    • You're running evaluations and need a fair comparison to full-precision performance

    Quality impact: Near-lossless. Effectively identical to full precision for practical purposes. The file is ~2x the size of Q4_K_M, which means twice the memory and slightly slower inference — but the quality delta is negligible.

    Quantization and Fine-Tuning: The Interaction

    If you're fine-tuning models (which you should be, if you're reading this on the Ertas blog), quantization interacts with your training pipeline in important ways.

    Fine-Tune at Full Precision, Quantize for Deployment

    Always fine-tune at full precision (BF16 or FP16). The small quality losses from fine-tuning plus the small quality losses from quantization can compound if you're not careful. Starting clean at full precision and quantizing only at export time gives the best results.

    This is the workflow Ertas follows: training happens on cloud GPUs at full precision, then you export to your target quantization level.

    Fine-Tuning Compensates for Quantization

    Here's the counterintuitive insight: a fine-tuned model at Q4_K_M often outperforms a generic model at Q8_0 on domain-specific tasks.

    Why? Fine-tuning teaches the model the specific patterns, terminology, and output formats for your use case. The model doesn't need to "figure out" what you want from a general-purpose prompt — it knows, because you trained it. This focused knowledge is more resilient to quantization than general-purpose reasoning.

    For a fine-tuned 8B model on a specific task, Q4_K_M is often more than sufficient. The fine-tuning does the heavy lifting; quantization is just the delivery mechanism.

    LoRA Adapter Precision

    If you're deploying LoRA adapters on a quantized base model, the adapter weights are typically stored at higher precision (FP16 or BF16) while the base model is quantized. This is fine — the adapter weights are small (50-200 MB) and keeping them at higher precision preserves the fine-tuning quality while the base model compression handles the bulk of the memory savings.

    Quick Decision Flowchart

    1. What's your available memory? → This sets your ceiling quantization level.
    2. Have you fine-tuned the model? → If yes, you can use a lower quantization (Q4_K_M is usually fine). If no, aim higher (Q5_K_M or Q8_0).
    3. Is the task domain-specific or general? → Domain-specific tasks tolerate quantization better. General knowledge tasks are more sensitive.
    4. Is output quality directly customer-facing? → If yes, aim for Q5_K_M minimum. If internal/batch processing, Q4_K_M is sufficient.
    5. Are you comparing models? → Use Q8_0 or higher for fair comparisons. Quantization artifacts can mask real quality differences between model variants.

    The Practical Recommendation

    For most teams deploying fine-tuned models:

    1. Development and testing: Use Q8_0 to establish a quality baseline
    2. Production (plenty of memory): Q5_K_M — the best balance of quality and efficiency
    3. Production (memory-constrained): Q4_K_M — good enough for fine-tuned domain models
    4. Edge/mobile: Q4_K_M on the smallest model that meets your accuracy bar
    5. Evaluation and comparison: Always Q8_0 or F16 — don't let quantization artifacts cloud your judgment

    Export your fine-tuned models from Ertas at your target quantization level. Fine-tune once at full precision, deploy at whatever quantization your hardware demands.


    References: Practical GGUF Quantization Guide, Choosing a GGUF Model — K-Quants and I-Quants, Local AI Zone Quantization Guide.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading