GGUF overview

    What GGUF is, what is inside the Ertas export bundle, and the conversion pipeline that produces it.

    GGUF stands for GPT-Generated Unified Format. It is a single-file model format invented by the llama.cpp project to bundle weights, tokenizer, architecture metadata, and chat templates into one portable artifact. Every common consumer-grade local-inference runtime reads GGUF natively, which is why Ertas's entire export pipeline produces one.

    This page covers what is inside the bundle Ertas hands you, how that bundle is produced from a trained LoRA adapter, and which base models export cleanly. For the size and quality trade-offs of the quantization step, see Quantization. For sanity-checking a download before you ship, see Verifying exports.

    Why GGUF

    GGUF was designed for on-device inference, and that is exactly where Ertas-trained models are meant to run. The format earns its keep on three properties:

    • Single file, no Python. Every weight, every config field, every tokenizer entry, and every chat-template hint lives in one binary. There is no companion config.json to forget, no separate tokenizer.model to ship, and no Python environment required to load the model at inference time.
    • Cross-runtime portability. Ollama, LM Studio, llama.cpp directly, GPT4All, and most mobile inference libraries all consume GGUF as their primary format. Producing GGUF means your fine-tune is portable across every consumer runtime worth shipping into.
    • Built for quantization. The format was designed to carry mixed-precision weights efficiently and to expose the per-block quantization metadata that runtimes need to dequantise on the fly. Quantization is not an afterthought; it is the reason GGUF exists.

    The trade-off worth knowing: GGUF is purpose-built for CPU and Apple-Silicon inference, plus consumer-GPU runtimes (CUDA, Metal, Vulkan, ROCm via llama.cpp's kernels). It is not the right format for high-throughput server inference on data-centre GPUs, where vLLM-friendly safetensors checkpoints win on tokens per second per dollar. If your deployment shape is closer to server-side hosting, the LoRA bundle plus the original base model in safetensors is the better source of truth; see the FAQ for the conversion path.

    What is in the export bundle

    The GGUF button on a completed run downloads a single ZIP with the following contents:

    FilePurpose
    model.ggufThe quantised model. Q4_K_M today. Loadable directly by any llama.cpp-compatible runner.
    ModelfileAn Ollama Modelfile pre-filled with the base model's chat template, stop tokens, and the sampling defaults you set in Training Config. See The Modelfile below.
    install.batOne-click Ollama installer for Windows. Checks Ollama is on PATH, exits with a download link if not, then calls ollama create with the bundled Modelfile.
    install.shOne-click Ollama installer for macOS and Linux. Same behaviour as install.bat.
    README.txtPer-OS install and ollama run instructions, recommended RAM and disk-space numbers for this bundle, the manual install command, and a snapshot of every training hyperparameter and LoRA setting the run was configured with.

    Extracted, the bundle is self-contained: nothing else from Ertas is required to load the model into Ollama or any other supported runtime. The download URL is a signed, time-limited link. If it expires before you click, open the run again in Hub and the next click generates a fresh one.

    The LoRA button downloads a separate, smaller bundle: a PEFT-compatible directory ZIP containing adapter_model.safetensors, adapter_config.json, chat_template.jinja, tokenizer.json, tokenizer_config.json, and a README.md model-card template. Total size is typically 25 to 50 MB depending on the base model's tokenizer. Use it when you want to re-merge into a different base in code, or load the adapter at inference time with PeftModel.from_pretrained().

    How the GGUF is produced

    When you press play with Convert to GGUF enabled (the default), Ertas runs the following steps after training:

    Save the LoRA adapter

    Step 1

    The trained adapter weights are written to your project storage. This is the artifact behind the LoRA download button. Everything below depends on this save succeeding.

    Merge into the base

    Step 2

    The pipeline merges the LoRA adapter into a fresh copy of the base model's fp16 weights, producing a standalone merged checkpoint in memory. The LoRA's target_modules determines which layers receive the adapter delta.

    Convert to GGUF

    Step 3

    The merged checkpoint is converted to the GGUF container format using llama.cpp's convert_hf_to_gguf.py. Architecture, tokenizer, and chat template are written into the file's metadata so any GGUF-aware runtime can load the model without external config files.

    Quantise to Q4_K_M

    Step 4

    llama.cpp's quantiser rewrites the weight blocks to 4-bit Q4_K_M packing, with mixed precision across blocks within each tensor to preserve quality on the parts of the model that matter most. The result is the model.gguf you download.

    Bundle

    Step 5

    Ertas templates the Ollama Modelfile against the base model's chat template, generates install.bat and install.sh referencing the model name, writes the README.txt, and zips everything together.

    The whole conversion adds roughly 12 minutes to the run on a T4 and is billed at the same per-minute GPU rate as training itself. If you only want the raw LoRA adapter (for example, to merge into a different base later in your own code), turn off Convert to GGUF in the Training Config picker before queueing the run. You can rerun with conversion enabled later if you decide you want the export. If you have a beefy enough machine, the local-conversion path documented in Credits and usage saves the credits the server-side conversion would otherwise cost.

    Which models export cleanly

    Every model in the Ertas verified catalog converts to GGUF without surprises. That covers Llama 3 and 4, Mistral 7B, Mixtral, Phi-3 and 4, Gemma 3 and 4, Qwen 2.5 and 3, SmolLM, and TinyLlama, including their instruction-tuned variants. See the full catalog for the exhaustive list.

    Models you bring from a Hugging Face URL outside the catalog convert correctly if their architecture is supported by llama.cpp. The picker flags unverified architectures at attach time. Most popular open-weights families work; exotic or research-only architectures sometimes do not. Ertas surfaces the failure at the conversion stage when the architecture is unsupported, and the run falls under the standard verified-catalog rule for refunds: an unverified base means you take the credit risk. See Handling failures.

    The Modelfile

    The bundled Modelfile is the contract between Ertas's export and Ollama's loader. It is a short text file (under 1 KB for most bundles) containing four kinds of lines:

    • FROM ./model.gguf points Ollama at the bundled weights.
    • A TEMPLATE block transcribes the base model's chat template into Ollama's template syntax, with the system, user, and model turn markers the base expects (e.g. <start_of_turn> / <end_of_turn> for Gemma, <|im_start|> / <|im_end|> for ChatML-style families).
    • PARAMETER stop lines list the base model's end-of-turn tokens (typically two: the chat-template's open and close markers), so generation halts cleanly without spilling into the next turn.
    • PARAMETER temperature, top_p, repeat_penalty, and repeat_last_n set the sampling defaults. The temperature and top-p come from the values you set in Training Config; the repeat penalty (default 1.15) and look-back window (default 128 tokens) are chosen by Ertas to suppress the kind of repetitive output that small quantised models occasionally drift into.

    You can edit the Modelfile before running install.bat or install.sh if you want different sampling defaults baked in. The install scripts derive the model name from the extracted folder name (lowercased, non-alphanumerics converted to hyphens), check for Ollama on PATH, exit with a download link if it is missing, and otherwise call ollama create <name> -f Modelfile. You can run that same command yourself if you want a custom model name.

    Coming soon: additional quantization levels. Q4_K_M is the practical sweet spot and is what every Ertas export uses today. Q5_K_M (slightly higher fidelity at roughly 25% more size), Q8_0 (near-fp16 quality at roughly 2x size), and lower-fidelity Q3_K_M and Q2_K for very low-end devices are on the roadmap. Until they ship, the local-conversion path in Credits and usage lets you re-quantise the merged checkpoint to whichever level you want, using llama.cpp's quantize binary.

    What's next