llama.cpp versions

    Which runtimes load Ertas-exported GGUFs, the binary renames that broke older scripts, and the version floors to publish in your app's docs.

    Ertas exports GGUF files using the current llama.cpp at the time of export. The GGUF format itself has been stable since v3 (2023), and Q4_K_M has been supported for nearly as long, so most runtime issues come from the binary renames and architecture support upstream rather than from format incompatibility.

    This page lists the version floors you can publish in your app's deployment docs, and the breaking changes upstream that have broken third-party scripts.

    The version floors

    For each runtime, the version below is the floor that works with Ertas's current Q4_K_M GGUF exports across every catalogue model. Newer versions are always fine; older versions may work for some architectures and not others.

    RuntimeVersion floorNotes
    llama.cpp (binary)b3479 (late July 2024)Older builds use the legacy main / quantize binary names. b3479 is the earliest build with the Llama 3.1 rope-scaling fix; anything earlier may load but produce subtly wrong output on 3.1-class models.
    Ollama0.3.0 (July 25, 2024)Adds Llama 3.1 and tool-calling support; the practical floor for current catalogue models.
    LM StudioRecent (within the last six months)Check the in-app changelog for the floor that matches your catalogue picks; the project does not publish a long-tail blog history.
    node-llama-cppv3.0.0 (September 24, 2024)Ships llama.cpp b3808; earlier v2.x worked for older families.
    wllamav3.1.0 (May 2025)Required for WebGPU acceleration; earlier v3.x ran on WebAssembly only.
    llamadartv0.6.x series (March 2026 and later)Earlier versions predate the newest architecture handlers; the v0.6.0 tag does not exist, the series begins at v0.6.1.
    llama.rnv0.10 (January 2026)Required for React Native New Architecture; v0.9 retained the old bridge.

    Run a quick llama-cli --version (or the runtime equivalent) when in doubt. Anyone shipping a real app should pin to a known-working version of their chosen runtime rather than tracking head; runtimes update faster than apps can re-test.

    The 2024 binary rename

    On June 12, 2024 (PR #7809), llama.cpp renamed its primary binaries. Scripts written before this date no longer work against new builds without changes.

    Old nameNew nameUsed for
    mainllama-cliInteractive chat and one-shot prompts
    quantizellama-quantizeLocal quantisation to a different level
    serverllama-serverHTTP inference server

    The convert.pyconvert_hf_to_gguf.py rename was a separate change in the conversion-tooling tree; it does not share the June 12 cutover but applies to the same modernisation effort.

    Ertas's exported Modelfile and install scripts use Ollama, which is unaffected. If you are running llama.cpp directly (for verifying exports or for local quantisation in Quantization), use the new names.

    The repo move to ggml-org

    llama.cpp moved from github.com/ggerganov/llama.cpp to github.com/ggml-org/llama.cpp on February 15, 2025 (discussion #11801). The old URL still redirects, but:

    • Any pinned git remote to the old organisation should be updated for clean fetches.
    • Build instructions that referenced ggerganov/llama.cpp in CMake or submodule configurations should be updated to the new org.
    • The Android example and SwiftUI example referenced throughout the Ship section live in the new org.

    Architecture support timeline

    A few catalogue models rely on architecture handlers that landed in specific upstream versions. If you are targeting older runtimes, this is the table to consult.

    ArchitectureLanded in llama.cppNotes
    Llama 3.1~b3479 (July 27, 2024)Rope-scaling fix; earlier builds load but use wrong scaling on long contexts
    Llama 3.2~b3825 (September 25, 2024)Adds 1B / 3B architecture handlers (vision variants later); reuses Llama 3.1's tokenizer
    Mistral 7B Instruct v0.3Late May / early June 2024The v0.3 variant landed shortly after the May 22, 2024 model release
    Phi-3 minib2729 (April 25, 2024)Earlier builds use the wrong rope scaling
    Gemma 3~b4860 (March 12, 2025)Architecture support merged in PR #12343; the Gemma 3 model itself was released March 2025
    Gemma 4 E2B / E4BRecent (August 2025 builds and later)Per-Layer Embedding handler support remains incomplete in upstream as of mid-2026 (issue #22243 is still open). Models load but per-layer signal is not injected during the forward pass on stock upstream. Ertas's exported bundles may rely on a patched fork; confirm runtime-side PLE support before pinning a Gemma 4 deployment to a non-Ertas runtime.
    Qwen 2.5September 2024 builds (post-release on September 19, 2024)Earlier builds reject the architecture
    Qwen 2.5 CoderNovember 2024 builds (post the November 11 full-lineup release)Same architecture as Qwen 2.5 with FIM tokens in the vocabulary

    If a runtime version is too old to load a specific architecture, the failure mode is usually a model-load error, not a silent miscomputation. A model that loads but produces garbage is more often a chat template mismatch than a version issue.

    Ollama upgrade path

    Ollama users are the largest single segment, and the common question is: "do I have to upgrade Ollama every time Ertas releases a new model?" Practical guidance:

    • For catalogue models that existed when your Ollama version shipped, no upgrade needed.
    • For newly added catalogue models, upgrade to the latest Ollama. Ertas adds new catalogue models alongside upstream support; if Ertas is offering a model, Ollama's latest version supports it.
    • For Hugging Face URL imports, the floor is whatever Ollama version supports the architecture upstream. The Ship: desktop page documents the version floors for the most common families.

    Ertas's install.sh script that ships in every GGUF bundle checks for Ollama on the user's machine before running ollama create, but does not enforce a version floor. If a user's Ollama is too old, the create step fails with an architecture-not-recognised error and they upgrade.

    Verifying compatibility before you ship

    Three quick checks before publishing an app version that depends on a specific runtime version:

    1. Build the runtime at your pinned version, load your fine-tuned GGUF, and run a five-prompt probe set from your eval bench. The Verifying exports page describes this loop.
    2. Diff the runtime's changelog between your pinned version and head. If a tokenizer fix or architecture handler change appears between them, the older version may produce subtly worse output even if it loads.
    3. Publish the version floor in your app's documentation so users on customised builds know what to expect.

    What's next