GGUF Format Guide

The universal format for quantized local LLM inference

Model Weights

Specification

GGUF (GPT-Generated Unified Format) is a binary file format designed for storing quantized large language models for efficient local inference. Developed by Georgi Gerganov as part of the llama.cpp project, GGUF succeeded the older GGML format in August 2023, adding a self-describing metadata system that embeds model architecture details, tokenizer configuration, and quantization parameters directly in the file. This makes GGUF files portable and self-contained — everything needed to load and run a model is in a single file.

The GGUF format consists of three sections: a header containing format version and tensor count, a metadata key-value store with model configuration (architecture type, context length, vocabulary size, embedding dimensions, attention head count, and tokenizer data), and the tensor data section containing the actual model weights in their quantized representation. The metadata system uses typed key-value pairs supporting integers, floats, booleans, strings, and arrays, enabling rich model descriptions without external configuration files.

GGUF supports numerous quantization types ranging from full-precision F32 and F16 to aggressive quantizations like Q2_K and IQ1_S. The most commonly used quantization levels are Q4_K_M (offering a good balance of quality and size), Q5_K_M (slightly larger but higher quality), and Q8_0 (near-lossless). K-quant variants (Q4_K_S, Q4_K_M, Q5_K_S, etc.) use a sophisticated mixed-precision approach where different layers are quantized to different levels based on their sensitivity, producing better quality than uniform quantization at similar file sizes.

When to Use GGUF

GGUF is the format of choice whenever you need to run LLM inference locally — on personal computers, edge devices, or on-premise servers without GPU-dependent inference frameworks. It is the native format for llama.cpp and is supported by a wide ecosystem of local inference tools including Ollama, LM Studio, GPT4All, koboldcpp, and text-generation-webui. If your goal is to deploy a fine-tuned model for local use without cloud dependencies, GGUF should be your target export format.

Choose GGUF when you need efficient CPU inference or when your deployment target has limited GPU memory. GGUF's quantization options allow you to trade model quality for smaller file sizes and faster inference, enabling large models to run on consumer hardware. A 7B parameter model that requires 14 GB in FP16 can be reduced to approximately 4 GB with Q4_K_M quantization while retaining most of its capability. This makes GGUF essential for privacy-sensitive deployments where data cannot leave the local machine.

GGUF is less suitable when you need maximum inference throughput on GPU clusters (use SafeTensors with vLLM or TensorRT-LLM instead) or when you need to continue training a model (GGUF is an inference-only format — use SafeTensors or PyTorch checkpoints for training). It is also not the right choice for non-transformer architectures that llama.cpp does not support.

Schema / Structure

text

GGUF File Structure:
┌─────────────────────────────────────┐
│ Header                              │
│  - Magic number: 0x46475547 "GGUF"  │
│  - Version: uint32 (currently 3)    │
│  - Tensor count: uint64             │
│  - Metadata KV count: uint64        │
├─────────────────────────────────────┤
│ Metadata Key-Value Pairs            │
│  - general.architecture: string     │
│  - general.name: string             │
│  - llama.context_length: uint32     │
│  - llama.embedding_length: uint32   │
│  - llama.block_count: uint32        │
│  - llama.attention.head_count: u32  │
│  - tokenizer.ggml.model: string     │
│  - tokenizer.ggml.tokens: [string]  │
│  - ... (additional metadata)        │
├─────────────────────────────────────┤
│ Tensor Info (for each tensor)       │
│  - Name: string                     │
│  - Dimensions: uint32[]             │
│  - Type: enum (F32/F16/Q4_K/...)    │
│  - Offset: uint64                   │
├─────────────────────────────────────┤
│ Tensor Data (aligned, contiguous)   │
│  - Raw quantized weight data        │
└─────────────────────────────────────┘

GGUF binary file structure with header, metadata, tensor info, and weight data sections

Example Data

bash

# Convert a HuggingFace model to GGUF using llama.cpp
python convert_hf_to_gguf.py ./my-fine-tuned-model \
  --outfile model-f16.gguf \
  --outtype f16

# Quantize to Q4_K_M for efficient local inference
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

# Run inference with llama.cpp
./llama-cli -m model-q4_k_m.gguf \
  -p "Explain quantum computing in simple terms:" \
  -n 256 --temp 0.7

# Inspect GGUF metadata
python -c "
from gguf import GGUFReader
reader = GGUFReader('model-q4_k_m.gguf')
for field in reader.fields.values():
    print(f'{field.name}: {field.parts[-1].tolist()}')"

Converting, quantizing, running, and inspecting GGUF model files

Ertas Support

GGUF is a first-class export format in Ertas Studio. After training or fine-tuning a model through the Ertas cloud training pipeline, you can export directly to GGUF with your choice of quantization level. The export process handles the conversion from training checkpoint format to GGUF automatically, including embedding the tokenizer configuration and model metadata. This produces a single, self-contained file ready for local inference.

The GGUF export capability is central to Ertas's privacy-first architecture. By exporting to GGUF, your fine-tuned model runs entirely on local hardware with no cloud inference calls, no API dependencies, and no data leaving your environment. This makes GGUF export essential for compliance-sensitive deployments where data sovereignty, HIPAA, GDPR, or air-gapped operation requirements prohibit cloud-based inference.

Related Resources

Glossary

Quantization

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →