llama.cpp versions

Which runtimes load Ertas-exported GGUFs, the binary renames that broke older scripts, and the version floors to publish in your app's docs.

Ertas exports GGUF files using the current llama.cpp at the time of export. The GGUF format itself has been stable since v3 (2023), and Q4_K_M has been supported for nearly as long, so most runtime issues come from the binary renames and architecture support upstream rather than from format incompatibility.

This page lists the version floors you can publish in your app's deployment docs, and the breaking changes upstream that have broken third-party scripts.

The version floors

For each runtime, the version below is the floor that works with Ertas's current Q4_K_M GGUF exports across every catalogue model. Newer versions are always fine; older versions may work for some architectures and not others.

Runtime	Version floor	Notes
llama.cpp (binary)	b3479 (late July 2024)	Older builds use the legacy `main` / `quantize` binary names. b3479 is the earliest build with the Llama 3.1 rope-scaling fix; anything earlier may load but produce subtly wrong output on 3.1-class models.
Ollama	0.3.0 (July 25, 2024)	Adds Llama 3.1 and tool-calling support; the practical floor for current catalogue models.
LM Studio	Recent (within the last six months)	Check the in-app changelog for the floor that matches your catalogue picks; the project does not publish a long-tail blog history.
node-llama-cpp	v3.0.0 (September 24, 2024)	Ships llama.cpp b3808; earlier v2.x worked for older families.
wllama	v3.1.0 (May 2025)	Required for WebGPU acceleration; earlier v3.x ran on WebAssembly only.
llamadart	v0.6.x series (March 2026 and later)	Earlier versions predate the newest architecture handlers; the v0.6.0 tag does not exist, the series begins at v0.6.1.
llama.rn	v0.10 (January 2026)	Required for React Native New Architecture; v0.9 retained the old bridge.

Run a quick llama-cli --version (or the runtime equivalent) when in doubt. Anyone shipping a real app should pin to a known-working version of their chosen runtime rather than tracking head; runtimes update faster than apps can re-test.

The 2024 binary rename

On June 12, 2024 (PR #7809), llama.cpp renamed its primary binaries. Scripts written before this date no longer work against new builds without changes.

Old name	New name	Used for
`main`	`llama-cli`	Interactive chat and one-shot prompts
`quantize`	`llama-quantize`	Local quantisation to a different level
`server`	`llama-server`	HTTP inference server

The convert.py → convert_hf_to_gguf.py rename was a separate change in the conversion-tooling tree; it does not share the June 12 cutover but applies to the same modernisation effort.

Ertas's exported Modelfile and install scripts use Ollama, which is unaffected. If you are running llama.cpp directly (for verifying exports or for local quantisation in Quantization), use the new names.

The repo move to ggml-org

llama.cpp moved from github.com/ggml-org/llama.cpp to github.com/ggml-org/llama.cpp on February 15, 2025 (discussion #11801). The old URL still redirects, but:

Any pinned git remote to the old organisation should be updated for clean fetches.
Build instructions that referenced ggerganov/llama.cpp in CMake or submodule configurations should be updated to the new org.
The Android example and SwiftUI example referenced throughout the Ship section live in the new org.

Architecture support timeline

A few catalogue models rely on architecture handlers that landed in specific upstream versions. If you are targeting older runtimes, this is the table to consult.

Architecture	Landed in llama.cpp	Notes
Llama 3.1	~b3479 (July 27, 2024)	Rope-scaling fix; earlier builds load but use wrong scaling on long contexts
Llama 3.2	~b3825 (September 25, 2024)	Adds 1B / 3B architecture handlers (vision variants later); reuses Llama 3.1's tokenizer
Mistral 7B Instruct v0.3	Late May / early June 2024	The v0.3 variant landed shortly after the May 22, 2024 model release
Phi-3 mini	b2729 (April 25, 2024)	Earlier builds use the wrong rope scaling
Gemma 3	~b4860 (March 12, 2025)	Architecture support merged in PR #12343; the Gemma 3 model itself was released March 2025
Gemma 4 E2B / E4B	Recent (August 2025 builds and later)	Per-Layer Embedding handler support remains incomplete in upstream as of mid-2026 (issue #22243 is still open). Models load but per-layer signal is not injected during the forward pass on stock upstream. Ertas's exported bundles may rely on a patched fork; confirm runtime-side PLE support before pinning a Gemma 4 deployment to a non-Ertas runtime.
Qwen 2.5	September 2024 builds (post-release on September 19, 2024)	Earlier builds reject the architecture
Qwen 2.5 Coder	November 2024 builds (post the November 11 full-lineup release)	Same architecture as Qwen 2.5 with FIM tokens in the vocabulary

If a runtime version is too old to load a specific architecture, the failure mode is usually a model-load error, not a silent miscomputation. A model that loads but produces garbage is more often a chat template mismatch than a version issue.

Ollama upgrade path

Ollama users are the largest single segment, and the common question is: "do I have to upgrade Ollama every time Ertas releases a new model?" Practical guidance:

For catalogue models that existed when your Ollama version shipped, no upgrade needed.
For newly added catalogue models, upgrade to the latest Ollama. Ertas adds new catalogue models alongside upstream support; if Ertas is offering a model, Ollama's latest version supports it.
For Hugging Face URL imports, the floor is whatever Ollama version supports the architecture upstream. The Ship: desktop page documents the version floors for the most common families.

Ertas's install.sh script that ships in every GGUF bundle checks for Ollama on the user's machine before running ollama create, but does not enforce a version floor. If a user's Ollama is too old, the create step fails with an architecture-not-recognised error and they upgrade.

Verifying compatibility before you ship

Three quick checks before publishing an app version that depends on a specific runtime version:

Build the runtime at your pinned version, load your fine-tuned GGUF, and run a five-prompt probe set from your eval bench. The Verifying exports page describes this loop.
Diff the runtime's changelog between your pinned version and head. If a tokenizer fix or architecture handler change appears between them, the older version may produce subtly worse output even if it loads.
Publish the version floor in your app's documentation so users on customised builds know what to expect.

What's next

Verifying exports

Smoke-test the GGUF against your target runtime version.

GGUF overview

The bundle Ertas ships and how Ollama loads it.

Quantization

The local re-quantisation path uses the new llama-quantize binary.

Ship

Runtime version pins for each Ship target.