Quantization

    What quantization does to a model, why Ertas ships Q4_K_M today, and what changes when other levels become available.

    Quantization is the step that takes a trained model from "fits on a GPU" to "fits on a phone." It rewrites every weight from full precision (16-bit) into a smaller, denser representation, typically 2 to 8 bits per weight. The result is a file several times smaller and meaningfully faster on consumer hardware, at the cost of a small amount of precision.

    For most fine-tunes Ertas trains, that trade-off is heavily favourable: a 4-bit quantised model is roughly 4x smaller than the fp16 original and typically loses under 1% of quality on standard benchmarks. That is why Ertas quantises every export by default.

    This page covers what quantization actually does, how the GGUF K-quant family works, why Ertas ships Q4_K_M today, and what changes when additional levels become available. For the upstream story of how the GGUF file is built, see GGUF overview. For file-size math, see File sizes and formats.

    The principle

    A transformer's weights are billions of floating-point numbers, typically stored in fp16 or bf16 (both 16 bits per weight). At inference time, every layer reads those weights, multiplies them against the activations, and writes intermediate tensors. Memory bandwidth dominates inference latency on devices without dedicated tensor cores, which is most of the world's hardware. Smaller weights mean less data to move per token, which usually translates directly into more tokens per second.

    Quantization compresses each weight by rounding it to one of 2^N possible values for an N-bit format. A 4-bit weight can take one of 16 values; an 8-bit weight, one of 256. The simplest schemes (Q4_0, Q8_0) quantise every weight to the same precision with a single scale per block of 32 weights. The K-quant family (Q4_K_M, Q5_K_S, and friends) is smarter: it uses a mix of precisions within each tensor, spending more bits on the parts of the model most sensitive to precision loss and fewer bits on the parts that absorb rounding error gracefully.

    At inference time, the runtime dequantises each block back to fp16 on the fly as it streams through. That dequantization is cheap on modern CPUs and Apple Silicon, which is why GGUF inference is fast even with no GPU at all.

    The K-quant family at a glance

    The K-quant naming convention encodes three things: the average bit width, the family suffix, and the variant.

    • The number (2, 3, 4, 5, 6, 8) is the nominal bits per weight. Lower is smaller and faster but less precise.
    • The _K suffix marks the K-quant family, which uses mixed precision within a tensor and a more sophisticated grouping than the older _0 family.
    • The trailing _S / _M picks how aggressively the higher-precision portions are pruned. _M is the balanced default; _S (when available) is smaller and slightly worse; _L (rare, e.g. Q3_K_L) is larger and slightly better.

    The level you pick is a single point on the size-vs-quality curve. Pick a smaller level if you are memory-bound on the target device; pick a larger one if you are quality-bound on the task.

    What Ertas exports today

    Ertas quantises every successful run to Q4_K_M. The "M" variant of 4-bit K-quant is the format the llama.cpp community has converged on as the practical sweet spot for on-device deployment: small enough to run comfortably on a recent phone or laptop, with a quality penalty almost no end user can perceive in conversational use.

    Concretely, for a typical fine-tune on a 3B to 8B base model:

    Metricfp16 baselineQ4_K_M
    File size1.0x~0.25x
    Memory required at inference1.0x~0.25x
    Quality on standard benchmarksbaselinetypically under 1% loss
    Inference speed on CPUbaselinetypically 2 to 4x faster, varies by chip and model
    Inference speed on Apple Siliconbaselinetypically 2 to 3x faster, varies by chip and model

    The Q4_K_M choice is locked in the export pipeline today. You cannot pick a different level inside Ertas; every successful run produces a Q4_K_M GGUF, and the LoRA adapter download lets you re-quantise yourself if you need a different level. See Local conversion to a different level below.

    What other levels look like

    The K-quant family has well-documented levels above and below Q4_K_M. Here is how the common ones compare to Q4_K_M as the baseline:

    LevelSize vs Q4_K_MUse case
    Q2_K~0.6xSmallest possible files for very memory-constrained devices. Quality drop is visible.
    Q3_K_M~0.8xA meaningful step down in size with a noticeable but not catastrophic quality cost.
    Q4_K_M (today)1.0xSweet spot. The default Ertas export.
    Q5_K_M~1.25xSlightly higher fidelity for borderline-quality tasks.
    Q6_K~1.5xHigher quality still; rarely worth the extra size unless you have headroom.
    Q8_0~2xNear-fp16 quality. Common in evaluation and benchmarking.

    Coming soon: selectable quantization in Ertas. Q5_K_M, Q8_0, Q3_K_M, and Q2_K are on the roadmap as exportable choices inside the Training Config picker. Until they ship, the local-conversion path below lets you produce any of these levels yourself from the downloaded LoRA adapter.

    Local conversion to a different level

    If you need a non-Q4_K_M GGUF today, the path is:

    Download the LoRA from Hub

    Step 1

    Open the completed run, click LoRA, and extract the PEFT-compatible bundle.

    Merge the LoRA into the base

    Step 2

    In your own Python environment, load the base model with Hugging Face Transformers, attach the adapter with PeftModel.from_pretrained(), call merge_and_unload(), and save the merged checkpoint as safetensors.

    Convert to GGUF

    Step 3

    Clone llama.cpp and run python3 convert_hf_to_gguf.py /path/to/merged --outfile model.fp16.gguf --outtype f16. This produces an unquantised GGUF, which is the source for any of the quantization levels below.

    Quantise to the level you want

    Step 4

    Build llama.cpp and run ./build/bin/llama-quantize model.fp16.gguf model.q5km.gguf Q5_K_M (substitute any other supported level name). The output is your final, deployment-ready GGUF.

    The same path also reproduces Q4_K_M, in case you want to verify Ertas's export against a local build. The two files will be in the same neighborhood but not byte-identical: Ertas's quantiser uses an importance-matrix calibration step that is not part of the public bundle, which biases the per-block precision allocation slightly differently than a vanilla local llama-quantize invocation. To get a closer match, pass --imatrix calibration.dat to llama-quantize with a calibration file of your own.

    This is also the recommended credit-saver: the GPU minutes Ertas would have spent on conversion are credits in your account if you do the conversion locally. See Credits and usage for the broader cost-saving guide.

    When to worry about the quality drop

    For instruction-following on small models, Q4_K_M is almost always fine. Cases where you might want a higher-fidelity level once they ship inside Ertas:

    • Math, code, and structured-output tasks where every token matters and a single wrong digit or syntax token breaks the answer. Q5_K_M or Q8_0 can recover a measurable fraction of accuracy on these.
    • Long-context summarisation where small precision errors compound over thousands of tokens. Q5_K_M is the conservative pick.
    • Multilingual fine-tunes where rare-token coverage matters. The K-quant family keeps embeddings at higher precision already; Q5_K_M adds margin.

    For everything else (chat, customer support, light reasoning, persona adoption, in-domain Q&A), Q4_K_M is the right default and is unlikely to be the bottleneck. If a Q4_K_M model is misbehaving, the dataset is much more likely to be the cause than the quantization level. See Iterating for the standard triage path.

    What's next