GGUF Explained: The Open Format That Runs AI Anywhere

GGUF (GGML Universal Format) is a single-file format for storing quantized large language model weights, designed specifically for efficient local inference on consumer hardware. It is the standard file format used by llama.cpp, Ollama, and LM Studio — if you are running AI models locally, you are almost certainly using GGUF files.

According to data from Hugging Face, there are over 50,000 GGUF model files hosted on the platform as of early 2026, reflecting the format's dominance in the local inference ecosystem. GGUF's 4-bit quantization (Q4_K_M) compresses models by approximately 75% compared to full FP16 precision — a 14GB model becomes roughly 4GB — with only a 4-5% quality reduction on most benchmarks. According to benchmarks from llama.cpp, quantized GGUF models on Apple Silicon achieve 40-60 tokens per second for 7B parameter models, making real-time inference practical on consumer laptops.

This guide explains what GGUF is, why it matters, and what the naming conventions mean so you can make informed decisions when downloading or deploying models.

What GGUF Is

GGUF (GGML Universal Format) is a file format for storing large language model weights. It was created by Georgi Gerganov (the "GG" in the name), the developer behind llama.cpp, the C++ library that is the foundation of most local inference tools.

Before GGUF existed, there was GGML — an earlier format from the same developer. GGUF replaced GGML in August 2023 with a better-structured design that supports metadata, extensibility, and cleaner versioning.

The key properties of GGUF:

Single file. The entire model — weights, configuration, tokenizer — is in one file. No separate config.json or tokenizer.json files needed. This is a significant practical improvement over the PyTorch/Hugging Face safetensors format, which spreads a model across multiple files.

Self-describing. GGUF files include metadata about the model architecture, training parameters, and quantization configuration. A tool reading a GGUF file knows what it contains without external documentation.

Memory-mapped. GGUF supports memory mapping (mmap), which lets the operating system load only the parts of the file needed for a given inference pass. This enables running models that are larger than available RAM — slowly, but without crashing.

Quantization-first design. GGUF was built around quantized models — models where weight precision has been reduced from 16-bit or 32-bit floating point to lower-precision integers. This is why GGUF models run on consumer hardware.

Quantization: The Core Concept

A 7B parameter model in full 16-bit (FP16) precision occupies about 14GB of memory. Most consumer GPUs have 8-12GB of VRAM. The math does not work.

Quantization compresses model weights by reducing precision. Instead of storing each weight as a 16-bit float, you store it as a 4-bit integer. This reduces memory usage by roughly 4x, making 7B models fit comfortably in 8GB VRAM.

The quality tradeoff is surprisingly small for most tasks. A Q4 quantized model performs comparably to the original F16 model on most benchmarks — the information loss from reduced precision is partially compensated by the training process, and for inference (not training), the impact is modest.

Reading GGUF File Names

GGUF files follow a naming convention that encodes key information. Understanding it lets you choose the right file for your hardware and quality requirements.

A typical file name: Meta-Llama-3.2-7B-Instruct-Q4_K_M.gguf

Breaking it down:

Meta-Llama-3.2 — Model family and organization
7B — Number of parameters (7 billion)
Instruct — Model variant (instruction-tuned, not raw base)
Q4_K_M — Quantization type (this is the important part)

Quantization Types

The quantization type tells you how aggressively the model has been compressed:

Quantization	Bits per weight (approx)	7B model size	Quality vs F16
F16	16	~14 GB	Baseline (original)
Q8_0	8	~7 GB	~99% of F16
Q6_K	6	~5.5 GB	~98% of F16
Q5_K_M	5	~4.8 GB	~97% of F16
Q4_K_M	4	~4.1 GB	~95-96% of F16
Q4_K_S	4 (smaller)	~3.8 GB	~94% of F16
Q3_K_M	3	~3.1 GB	~90-93% of F16
Q2_K	2	~2.5 GB	~85% of F16

The _K variants use k-quants, a more sophisticated quantization method that applies different precision to different parts of the model. _M is "medium" and _S is "small" within the k-quant family.

Practical recommendation for most use cases: Q4_K_M is the most commonly used GGUF quantization. It offers a good balance of size reduction (~70% vs F16) with minimal quality loss (~4-5%). For client deployments, this is usually the right starting point.

When to go higher: If you have the VRAM/RAM to accommodate Q5_K_M or Q6_K, these offer noticeably better quality for tasks that are sensitive to precision (code generation, math, complex reasoning). The size increase is worth it if hardware allows.

When to go lower: If you are deploying on a machine with limited RAM (8GB total system memory or a small GPU), Q3_K_M may be necessary to run a 7B model at all. Expect some quality degradation on complex tasks.

GGUF vs Other Formats

Format	Used by	Notes
GGUF	llama.cpp, Ollama, LM Studio	Consumer local inference standard
Safetensors	Hugging Face ecosystem, training	Standard for model repositories
PyTorch (.bin)	Older Hugging Face models	Being phased out in favor of safetensors
GGML	Legacy llama.cpp	Superseded by GGUF, no longer supported
ONNX	Mobile, edge deployment	Cross-platform inference, different ecosystem
TensorRT	NVIDIA GPU inference	High-throughput server inference, NVIDIA only

If you download a model from Hugging Face for training, you get safetensors. When you want to run it locally with Ollama or LM Studio, you need GGUF. The conversion happens automatically when tools like Ertas export a fine-tuned model, or you can convert manually using the llama.cpp conversion scripts.

How Fine-Tuned Models Use GGUF

When you fine-tune a model with LoRA, you produce a LoRA adapter — a small set of additional weights that modify the base model's behavior. To deploy this for inference, you typically merge the adapter back into the base model weights and export the result to GGUF.

The export process:

Start with the base model weights (safetensors format)
Apply the LoRA adapter to produce a merged full-precision model
Quantize to your target precision (Q4_K_M recommended)
Output as a single GGUF file ready for Ollama/LM Studio

Ertas handles this export pipeline automatically — you get a GGUF file ready to load into Ollama without running conversion scripts manually. The resulting file is your client-specific model, fully self-contained.

Practical Hardware Requirements

How much memory do you need for different GGUF model sizes?

Model	Q4_K_M Size	Minimum VRAM (GPU)	Min RAM (CPU)
1-3B params	~1.5-2 GB	4 GB	8 GB
7B params	~4.1 GB	6 GB	8 GB
13B params	~7.5 GB	8 GB	16 GB
30B params	~17 GB	20 GB	32 GB
70B params	~40 GB	48 GB	64 GB

For most agency deployments, 7B models running on a machine with 8-16GB RAM (or a GPU with 6-8GB VRAM) cover the majority of use cases. The Mac Mini M4 (16GB unified memory) runs 7B models at 40+ tokens/second — comfortably fast for production API use.

Why GGUF Matters for AI Builders

The existence of GGUF and tools that use it (llama.cpp, Ollama, LM Studio) is what made consumer-hardware AI inference viable. Before this ecosystem, running a 7B model locally required a datacenter GPU or a very awkward Python environment.

For AI agencies, GGUF is the format that enables the entire local inference value proposition:

Models run on client hardware without cloud API dependencies
Fine-tuned models are portable, single-file assets that can be deployed anywhere
Inference costs drop to near-zero after hardware purchase

Understanding GGUF naming conventions lets you confidently select models for client deployments and communicate the tradeoffs clearly. The Q4_K_M of a 7B model is a different thing from the F16 of the same model — knowing the difference means choosing the right tool for the hardware constraint.

Frequently Asked Questions

What is GGUF format?

GGUF (GGML Universal Format) is a file format for storing large language model weights, created by Georgi Gerganov, the developer behind llama.cpp. It packages model weights, configuration, and tokenizer data into a single self-describing file optimized for local inference. GGUF is the standard format used by Ollama, LM Studio, and llama.cpp for running AI models on consumer hardware.

What's the difference between GGUF and GGML?

GGUF is the successor to GGML, replacing it in August 2023. Both were created by the same developer, but GGUF adds structured metadata, better extensibility, and cleaner versioning. GGML is no longer supported by modern versions of llama.cpp. If you encounter a GGML model file, it needs to be converted to GGUF before it can be used with current tools.

Which quantization level should I use?

For most use cases, Q4_K_M is the recommended starting point. It reduces model size by approximately 75% compared to full FP16 precision with only 4-5% quality loss. If you have extra VRAM or RAM, Q5_K_M or Q6_K offer noticeably better quality for precision-sensitive tasks like code generation and math. If you are constrained to 8GB of system memory, Q3_K_M may be necessary for 7B models but expect some degradation on complex tasks.

Can I convert any model to GGUF?

Most popular open-weight language models can be converted to GGUF using the conversion scripts included with llama.cpp. Models in safetensors or PyTorch format from Hugging Face are the typical starting point. However, the model architecture must be supported by llama.cpp — most major architectures (Llama, Mistral, Phi, Qwen, Gemma) are supported, but very new or unusual architectures may not be available immediately. Tools like Ertas Studio handle the conversion automatically when exporting fine-tuned models.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →