Back to blog
    GGUF + llama.cpp: Shipping a Fine-Tuned Model in Your Mobile App
    ggufllama-cppmobile-developmentiosandroidon-device-aideploymentsegment:mobile-builder

    GGUF + llama.cpp: Shipping a Fine-Tuned Model in Your Mobile App

    A practical guide to packaging fine-tuned AI models as GGUF files and running them on iOS and Android with llama.cpp. Includes file sizes, benchmarks, and integration patterns.

    EErtas Team·

    Your fine-tuned model works great on your laptop. Responses are fast, the quality is exactly what you need, and your evals are solid. Now you need it running on 50,000 iPhones.

    This is where most mobile developers stop. Bundling an AI model into an app binary sounds daunting: binary size, memory constraints, thermal throttling, platform-specific build systems. But the tooling has matured. Projects like PocketPal AI demonstrate that a full llama.cpp-powered chat interface runs smoothly on today's flagship handsets. The path from fine-tuned weights to a shipping mobile app is defined, repeatable, and far less painful than it was even 18 months ago.

    This guide walks through every step: GGUF format, quantization choices, iOS and Android integration, performance expectations, and delivery strategies.

    What GGUF Is

    GGUF (GGML Unified Format) is a single-file model format maintained by the llama.cpp project. Before GGUF, distributing a model meant juggling separate files for weights, configuration, tokenizer vocab, and special tokens. GGUF packs everything into one portable binary with a well-specified header structure.

    What's inside a .gguf file:

    • Model weights (quantized or full precision)
    • Architecture metadata (layer count, attention heads, context length, rope parameters)
    • Tokenizer vocabulary and merge rules
    • Special token definitions (BOS, EOS, padding tokens)

    The single-file design makes GGUF ideal for mobile deployment. You reference one path, load one file, and inference starts. No multi-file configuration needed. llama.cpp reads GGUF natively, and Ollama uses GGUF as its internal storage format.

    Quantization Levels Explained

    Quantization compresses model weights from 32-bit or 16-bit floats to lower-precision integers. This reduces file size and memory footprint substantially, at a modest quality cost that is often imperceptible for domain-specific tasks.

    FormatBits per weightQuality vs FP16Notes
    Q4_K_M~4.5 bits-2 to 4% on benchmarksBest size-quality tradeoff for mobile
    Q5_K_M~5.5 bits-1 to 2% on benchmarksNoticeably better coherence, ~20% larger
    Q8_08 bitsNegligible lossNear-lossless, 2x size of Q4_K_M
    F1616 bitsBaselineToo large for most mobile targets

    Concrete file sizes at Q4_K_M:

    • Llama 3.2 1B: 808 MB
    • Llama 3.2 3B: 2.02 GB
    • Phi-3-mini 3.8B: approximately 2.3 GB
    • Llama 3.1 8B: approximately 4.9 GB

    For mobile, Q4_K_M is the default recommendation. The quality difference versus Q5_K_M is small for domain-specific fine-tuned models (which have already constrained the output distribution), and the size saving is significant. Reserve Q8_0 for desktop or offline-capable tablets where storage is less constrained.

    A 1B model at Q4_K_M (808 MB) fits comfortably on any modern handset. A 3B model (2.02 GB) works well on flagship devices. At 4.9 GB, an 8B model is only appropriate as a downloadable asset on devices with 6 GB+ RAM, not as a bundled binary.

    How llama.cpp Works on Mobile

    llama.cpp is a pure C/C++ inference engine for GGUF models. The core library has zero external dependencies beyond the C++ standard library, which is why it compiles on nearly every platform.

    For acceleration, llama.cpp uses:

    • Metal (iOS/macOS) for GPU compute via Apple's Metal Performance Shaders
    • OpenCL or Vulkan (Android) for GPU compute on Qualcomm, ARM, and MediaTek silicon
    • NEON SIMD intrinsics for ARM CPU matrix operations
    • NNAPI (Android) as a delegate path to hardware NPUs on newer chipsets

    The project provides example binaries, but for mobile integration you compile it as a static library and call into it through platform bindings. Both official Swift and Kotlin wrapper patterns exist in the wild, and the PocketPal AI project is a well-maintained open-source reference for both platforms.

    iOS Integration

    Build Setup

    llama.cpp uses CMake. For iOS, you cross-compile a static library targeting arm64-apple-ios:

    cmake -B build-ios \
      -DCMAKE_TOOLCHAIN_FILE=ios.toolchain.cmake \
      -DPLATFORM=OS64 \
      -DGGML_METAL=ON \
      -DBUILD_SHARED_LIBS=OFF \
      -DLLAMA_BUILD_TESTS=OFF \
      .
    cmake --build build-ios --config Release
    

    The GGML_METAL=ON flag enables Metal GPU acceleration. The resulting libllama.a and libggml.a static libraries link into your Xcode project.

    Alternatively, a pre-built XCFramework is available from the llama.cpp Swift Package Manager package at github.com/ggml-org/llama.cpp, which handles the CMake step for you.

    Swift Bindings

    The SPM package exposes a LlamaContext Swift class with an init(model: URL) initializer and an async complete(prompt: String) -> AsyncStream<String> method for streaming tokens.

    A minimal integration looks like:

    import llama
    
    class ModelRunner {
        private var context: LlamaContext?
    
        func load(modelURL: URL) async throws {
            context = try await LlamaContext.createContext(path: modelURL.path)
        }
    
        func stream(prompt: String) -> AsyncStream<String> {
            guard let ctx = context else { return AsyncStream { $0.finish() } }
            return AsyncStream { continuation in
                Task {
                    for await token in ctx.completionStream(text: prompt) {
                        continuation.yield(token)
                    }
                    continuation.finish()
                }
            }
        }
    }
    

    Memory Management on iOS

    iOS kills processes that exceed their memory budget without warning. Key rules:

    • Load the model once at app launch or on a dedicated background thread. Do not re-initialize per request.
    • Set n_ctx conservatively. Context length directly determines KV cache size. A 2048-token context uses significantly less memory than an 8192-token one. Most mobile use cases need fewer than 2048 tokens.
    • Monitor memory warnings. Implement applicationDidReceiveMemoryWarning and respond by freeing the KV cache (call llama_kv_cache_clear) rather than unloading the whole model.
    • Use the mmap load flag. llama.cpp supports memory-mapped file loading (--mmap), which lets the OS page model weights in and out. On iOS this reduces peak RSS at the cost of slightly higher first-token latency.

    A 1B Q4_K_M model (808 MB) plus a 2048-token KV cache runs comfortably within the memory budget of iPhones from the 12 generation forward. A 3B model needs devices with at least 6 GB RAM (iPhone 15 Pro and later).

    Android Integration

    NDK Build

    Android uses the NDK toolchain for C/C++ code. Add llama.cpp as a submodule or copy its source into app/src/main/cpp/, then configure CMakeLists.txt:

    cmake_minimum_required(VERSION 3.22)
    project(llama_android)
    
    add_subdirectory(llama.cpp)
    
    add_library(llama_jni SHARED jni_bridge.cpp)
    target_link_libraries(llama_jni llama ggml android log)
    

    In build.gradle (app module):

    android {
        defaultConfig {
            externalNativeBuild {
                cmake {
                    cppFlags "-std=c++17"
                    arguments "-DGGML_OPENMP=OFF", "-DLLAMA_BUILD_TESTS=OFF"
                }
            }
        }
        externalNativeBuild {
            cmake {
                path "src/main/cpp/CMakeLists.txt"
            }
        }
    }
    

    For GPU acceleration on Android, pass -DGGML_OPENCL=ON (requires OpenCL headers) or -DGGML_VULKAN=ON (requires the Vulkan SDK). On Snapdragon devices, Qualcomm's QNN backend offers NPU acceleration via -DGGML_QNN=ON, though this requires the Qualcomm AI SDK.

    JNI Wrapper and Kotlin Bridge

    Create a thin JNI bridge in jni_bridge.cpp that wraps the llama.cpp C API:

    extern "C" JNIEXPORT jlong JNICALL
    Java_com_yourapp_LlamaWrapper_loadModel(JNIEnv *env, jobject, jstring modelPath) {
        const char *path = env->GetStringUTFChars(modelPath, nullptr);
        llama_model_params mparams = llama_model_default_params();
        llama_model *model = llama_load_model_from_file(path, mparams);
        env->ReleaseStringUTFChars(modelPath, path);
        return reinterpret_cast<jlong>(model);
    }
    

    On the Kotlin side, a thin wrapper class holds the native pointer and exposes a coroutine-based API:

    class LlamaWrapper(private val modelPath: String) {
        private var modelHandle: Long = 0
    
        fun load() {
            modelHandle = loadModel(modelPath)
        }
    
        fun complete(prompt: String, onToken: (String) -> Unit) {
            generateTokens(modelHandle, prompt, onToken)
        }
    
        fun close() {
            if (modelHandle != 0L) {
                freeModel(modelHandle)
                modelHandle = 0
            }
        }
    
        private external fun loadModel(path: String): Long
        private external fun generateTokens(handle: Long, prompt: String, cb: (String) -> Unit)
        private external fun freeModel(handle: Long)
    
        companion object {
            init { System.loadLibrary("llama_jni") }
        }
    }
    

    Keep the model loaded in a singleton tied to the Application lifecycle, not to individual Activities. Recreating the model on every screen transition will cause perceptible delays and excessive battery drain.

    Fine-tune a model, export as GGUF, ship it in your app.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Performance Benchmarks

    Real-world throughput on current flagship hardware, running Q4 or Q8 quantized models:

    DeviceModelQuantizationTokens/secBackend
    iPhone 17 Pro1.5BINT8136NPU (Apple ANE)
    Galaxy S25 Ultra2BINT891NPU (Google Tensor)
    iPhone 16 Pro1.5BQ422 (sustained)Metal GPU
    Snapdragon 8 Elite13BQ420+Hexagon NPU
    iPhone 15 Pro1BQ4~18Metal GPU
    Mid-range Android (SD 7s Gen 3)1BQ48-12CPU NEON

    For context: human reading speed is approximately 5-7 tokens per second. Every flagship device listed above exceeds comfortable reading pace with a 1-3B model. Even mid-range devices at 8-12 tok/s feel responsive for most use cases.

    The NPU numbers (iPhone 17 Pro at 136 tok/s, Galaxy S25 Ultra at 91 tok/s) represent a step change in capability. On NPU-accelerated paths, latency drops to roughly 1-20% of the CPU baseline, and power efficiency per trillion operations is significantly higher than GPU or CPU inference.

    What these numbers mean for UX: At 22 tok/s on an iPhone 16 Pro, a 200-token response renders in under 10 seconds. First-token latency (time before the stream starts) is typically 200-800ms depending on prompt length. Both are acceptable for most in-app assistant patterns.

    Thermal Management and Battery

    Sustained LLM inference is one of the most thermally intensive workloads a mobile processor handles. The iPhone 16 Pro loses approximately 44% throughput under sustained load as the SoC throttles to protect hardware temperature. A benchmark that shows 22 tok/s may deliver closer to 12-14 tok/s after 5-10 minutes of continuous inference.

    Practical mitigation strategies:

    Cap inference duration. For most use cases, responses are complete in under 30 seconds. Set a maximum token count (n_predict) appropriate to your use case. This limits thermal impact per request.

    Add inter-request delays. For background processing jobs, insert a short pause between completions. Even a brief pause allows the SoC to shed heat before the next inference pass.

    Choose smaller models for continuous tasks. A 1B model generates tokens at higher throughput with substantially less heat than a 3B model. For classification, extraction, or formatting tasks, the smaller model often produces equivalent results.

    Monitor device temperature on Android. The ThermalManager API (Android 10+) exposes thermal status as a 0-6 scale. Register a listener and reduce inference frequency or context length as the device warms. There is no direct equivalent on iOS, but you can measure throughput degradation as a proxy.

    Battery guidance: A 1B model running continuously at full speed on an iPhone 16 Pro consumes roughly 2-3 watts of SoC power beyond the baseline. An active inference session reduces battery life by approximately 20-30% relative to typical app usage. For most in-app assistant patterns (short, user-initiated requests with pauses between), the impact is much smaller.

    Model Delivery Strategies

    How you get the GGUF file onto the device matters as much as how you run it.

    Bundle in the App Binary

    Pros: zero-latency first-use experience, no network requirement, no download UX needed. Cons: App Store size limits. Apple allows over-the-air downloads up to 200 MB without user confirmation; apps over 4 GB require download over Wi-Fi. Google Play has similar constraints.

    Works for: 1B Q4_K_M models (808 MB) as a downloadable asset via iOS On-Demand Resources or Android App Bundle asset packs. The model stays out of the main binary but downloads automatically on install.

    Download on First Launch

    The most common pattern for larger models. The app ships without the model, and on first launch (or on a "set up AI features" opt-in screen) downloads the GGUF from your CDN.

    Implementation notes:

    • Use URLSession background download tasks on iOS so the download continues if the user backgrounds the app.
    • Use WorkManager with NETWORK_NOT_ROAMING or NETWORK_UNMETERED constraints on Android.
    • Show progress with a clear explanation ("Downloading AI model, 2 GB, Wi-Fi recommended").
    • Cache the download aggressively. Do not re-download if the file is already present and passes a checksum.
    • Consider storing the GGUF in the app's Application Support directory (iOS) or filesDir (Android) rather than a shared location, to avoid OS-level cleanup on low storage.

    Delta Updates

    When you release a new fine-tuned version, you likely don't want to push 2 GB to every user on every update. Delta update patterns:

    • LoRA adapter only: If your update is a new fine-tune of the same base model, ship only the LoRA adapter file (typically 20-200 MB) and load it over the frozen base model at inference time. llama.cpp supports LoRA adapters via the --lora flag or equivalent API call. This is far more bandwidth-efficient than replacing the full GGUF.
    • GGUF diff: For architecture changes that require a new base, tools exist to compute binary diffs between GGUF versions. The patch is much smaller than a full re-download if only a subset of weights changed.
    • Version-tagged CDN paths: Store models at paths like /models/v2/model.Q4_K_M.gguf and check a version endpoint on app launch. Update the local copy only when the remote version is newer.

    Platform Alternatives to llama.cpp

    Before committing to the llama.cpp integration path, evaluate whether a platform-managed API meets your needs.

    Apple Foundation Models API (iOS 18.4+ / WWDC 2025)

    Apple shipped a public Swift API for on-device inference that targets its approximately 3B parameter on-device model. The API is high-level: you describe a task with a GenerationOptions struct and receive text, structured JSON, or tool calls.

    Pros: No model to download or maintain, hardware-optimized by Apple, trivially simple Swift API, no memory management headaches.

    Cons: You cannot load a custom fine-tuned model. You are constrained to Apple's base model and its capabilities. Not available on Android. Model quality for specialized domains may be insufficient.

    Use the Foundation Models API if: your task is general enough that Apple's base model handles it well, and you want to ship fast without managing model files.

    Use llama.cpp with a custom GGUF if: you need domain-specific quality, cross-platform behavior, or control over the model's exact outputs.

    Google Gemini Nano (Android, ML Kit GenAI APIs, Google I/O 2025)

    Google's ML Kit now exposes on-device Gemini Nano inference via a managed API. Like Apple's offering, this runs a fixed model managed by the OS, not a custom one.

    Pros: Simple API, no download required on supported Pixel and partner devices, integrates with existing ML Kit patterns.

    Cons: Available only on Pixel 9 and select other devices. No custom model support. Cross-device consistency is limited.

    For production apps targeting broad device support with a custom fine-tuned model, llama.cpp with GGUF remains the most portable approach.

    End-to-End Checklist

    Before shipping, verify:

    • GGUF exported at the right quantization for your target device tier (Q4_K_M for most mobile targets)
    • File size fits within your delivery strategy (under 200 MB for instant OTA, under 4 GB for deferred download)
    • Context length (n_ctx) is set to the minimum required, not the model's maximum
    • Model is loaded once at app startup or on a dedicated background queue, not per-request
    • Memory warnings are handled: clear KV cache before unloading full model
    • Thermal throttling tested: run inference for 10+ minutes and verify output quality under sustained load
    • Background download implemented with progress feedback for first-launch model delivery
    • Checksum validation on downloaded GGUF before attempting to load
    • Token limit (n_predict) set to a sensible cap to bound worst-case inference duration

    Getting Started with Ertas

    The integration work above assumes you already have a fine-tuned GGUF. The fine-tuning step is where Ertas comes in.

    Upload your domain data, configure training parameters visually, and export the result as a GGUF at your target quantization level. Ertas handles the cloud GPU compute, dataset formatting, and quantized export. You get back a .gguf file ready to drop into the iOS or Android integration described above.

    The mobile inference layer is open-source infrastructure. The differentiator is the model inside it: one that understands your domain, your users' language, and your product's specific output requirements. That's what fine-tuning produces, and that's the part you own.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading