Desktop

    Ship your fine-tune to macOS, Windows, and Linux desktops via Ollama, Electron, Tauri, or llama.cpp directly.

    Desktop is the easiest target for an Ertas-trained GGUF. The exported bundle is already Ollama-ready, desktop RAM headroom is generous, and most consumer hardware released since 2020 runs Q4_K_M models comfortably on CPU alone. GPU acceleration via Metal (macOS), CUDA (Windows / Linux NVIDIA), Vulkan, or ROCm typically halves latency again.

    This page covers four integration paths, in order of friction: Ollama (the default), llama.cpp directly, Electron, and Tauri. For cross-platform model delivery and first-run UX, see Model delivery and UX. For background on the GGUF bundle the user downloads from Ertas, see GGUF overview.

    Path 1: Ollama (the default)

    If you do not have a strong reason to do otherwise, ship via Ollama. The Ertas GGUF bundle is built for this path: the install.bat / install.sh scripts register the model with Ollama, and the bundled Modelfile carries the right chat template, stop tokens, and sampling defaults for your base model.

    End-user steps to install your fine-tune:

    Install Ollama

    Step 1

    Direct users to ollama.com/download. One-click installers exist for macOS, Windows, and Linux.

    Extract the bundle

    Step 2

    Unzip the GGUF bundle to a folder. The folder name becomes the Ollama model name (lowercased, non-alphanumerics converted to hyphens).

    Run the installer

    Step 3

    Double-click install.bat on Windows, or run bash install.sh on macOS or Linux. The script checks Ollama is on PATH, exits with a download link if not, and otherwise calls ollama create <model-name> -f Modelfile.

    Run the model

    Step 4

    ollama run <model-name> from a terminal, or open the Ollama desktop app and pick the model from the dropdown.

    For applications that want to call the model programmatically rather than via the Ollama CLI:

    # Streaming chat via the Ollama HTTP API (default port 11434)
    curl http://localhost:11434/api/chat -d '{
      "model": "<model-name>",
      "messages": [{"role": "user", "content": "Write a haiku about distillation."}]
    }'

    Ollama also has official client libraries for JavaScript, Python, Go, Rust, and others. For most desktop app shapes, having Ollama as the inference dependency is the lowest-friction path: it manages the model storage, handles updates, and exposes a stable HTTP API.

    When Ollama is not the right fit: if you want zero external dependencies (no separate process for users to install) or you need tight control over the inference parameters per request, embed llama.cpp directly via one of the paths below.

    Path 2: llama.cpp directly (CLI or library)

    The reference implementation is llama.cpp. The shipped model.gguf from the Ertas bundle loads in any current llama.cpp build without modification.

    For testing and scripting:

    ./build/bin/llama-cli -m model.gguf -p "Hello." -n 64

    For programmatic use from another language, llama.cpp exposes a stable C API plus high-quality community bindings:

    • Python: llama-cpp-python. Most popular; tracks llama.cpp closely.
    • Node.js: node-llama-cpp. High-level chat and embeddings API; ships prebuilt binaries for major OS/arch combos.
    • Rust: llama-cpp-2. Safe wrapper around the C API.
    • Go: go-llama.cpp. Maintained by the LocalAI project.

    Choose the binding that matches the language of your existing desktop app. All of them load the same model.gguf without conversion.

    Path 3: Electron (Node + node-llama-cpp)

    For an Electron app embedding the model:

    npm install node-llama-cpp
    import { getLlama, LlamaChatSession } from "node-llama-cpp";
    
    const llama = await getLlama();
    const model = await llama.loadModel({ modelPath: "./model.gguf" });
    const context = await model.createContext();
    const session = new LlamaChatSession({ contextSequence: context.getSequence() });
    
    const response = await session.prompt("Write a haiku about distillation.");

    node-llama-cpp ships prebuilt binaries for macOS, Windows, and Linux, and adapts to the host hardware automatically (Metal on Apple Silicon, CUDA on NVIDIA, Vulkan as a fallback). Check the package's documentation for the current architecture matrix and any platform-specific install notes; the default install covers the common cases.

    Practical Electron considerations:

    • Bundle size. Embedding node-llama-cpp adds roughly 30 to 80 MB of native binaries when you bundle multiple platforms; the CUDA build sits at the high end because of the bundled CUDA runtime. Shipping a single platform is much smaller. The model itself is downloaded on first launch (see Model delivery and UX).
    • Worker threads. Run inference in a Node worker thread to avoid blocking the renderer's IPC bridge. node-llama-cpp is async, but the underlying allocation and load are heavy and benefit from being off the main process.
    • Auto-update. Electron Forge and Electron Builder handle binary updates; the model file is yours to manage. Plan to ship the binary update independently of the model file, since the model is much larger than the rest of the app.

    Path 4: Tauri (Rust + llama-cpp-2)

    For Tauri, which ships a smaller binary than Electron by using the OS-native webview:

    [dependencies]
    llama-cpp-2 = "0.1"
    use llama_cpp_2::{
        llama_backend::LlamaBackend,
        model::LlamaModel,
        model::params::LlamaModelParams,
    };
    
    let backend = LlamaBackend::init().unwrap();
    let model = LlamaModel::load_from_file(
        &backend,
        "./model.gguf",
        &LlamaModelParams::default(),
    ).unwrap();
    // ... build a context, run inference

    Tauri's bundle without the model is typically under 10 MB on each platform, which makes the first-launch model download feel proportionally larger to users. The first-run UX guidance in Model delivery and UX matters more here than in an Electron app where the 100 MB binary already sets the user's expectation.

    Tauri's Rust backend can spawn a long-lived inference task and expose it to the JavaScript frontend via Tauri's command system. Hold the LlamaModel and LlamaContext in Tauri state so successive requests reuse the loaded weights.

    Cross-platform desktop considerations

    A few concerns that apply across all four paths:

    Storage location

    Store the model in the platform's user-data directory:

    PlatformPath (Electron app.getPath('userData'))
    macOS~/Library/Application Support/<app-name>/
    Windows%APPDATA%\<app-name>\
    Linux~/.config/<app-name>/ (or $XDG_CONFIG_HOME/<app-name>/)

    Ollama manages its own storage under ~/.ollama/models. If you ship via Ollama, your app does not need to track the GGUF path; Ollama's API takes a model name string.

    GPU acceleration

    • macOS: Metal is enabled by default in current llama.cpp and Ollama builds for Apple Silicon. Intel Macs fall back to CPU.
    • Windows / Linux NVIDIA: CUDA acceleration requires a CUDA-enabled build of llama.cpp (or the CUDA tag of node-llama-cpp). Performance gain is typically 2 to 10x over CPU, depending on GPU model, context length, and model size; the larger gains tend to show up on longer outputs and more capable GPUs.
    • Windows / Linux AMD: ROCm support exists but is platform-specific and less stable. Vulkan acceleration via llama.cpp is the more portable AMD path.
    • Windows / Linux Intel iGPU: SYCL / Level Zero support exists in llama.cpp; gain is modest for small models.

    For most consumer apps, CPU inference on a recent Apple Silicon Mac or a desktop with 16 GB+ RAM is fast enough that GPU offload is not the lever that matters. The bigger wins are quantization choice and KV cache management. See Performance tips.

    Distribution outside app stores

    If you distribute via DMG, MSI, or AppImage instead of the Mac App Store / Microsoft Store, you can bundle the model directly in the installer for a single-file install experience. The trade-off is a much larger installer download. For 7 GB+ models the first-launch download is almost always better UX than a 9 GB installer.

    If you do bundle, the installer should write the model directly into the app's user-data directory at install time; do not bundle it inside the app bundle proper, since macOS code signing and notarization on a 9 GB bundle is painfully slow.

    Code signing

    • macOS: Both the app and any embedded native binaries (.dylib, .so) must be signed and notarized. node-llama-cpp's prebuilt binaries are unsigned; you need to sign them as part of your build. Electron Forge and Tauri have native-binary signing hooks for this.
    • Windows: Sign all bundled .dll and .exe files. An EV (Extended Validation) code-signing certificate is the path that has historically established SmartScreen reputation immediately; a standard OV certificate works but takes time to accumulate reputation across many downloads. Submitting through the Microsoft Store (MSIX with Partner Center signing) sidesteps SmartScreen entirely for users on Windows 10/11.
    • Linux: No code-signing requirement, but consider providing GPG signatures for AppImage or .deb / .rpm distributions.

    What's next