
GGUF + llama.cpp: Shipping a Fine-Tuned Model in Your Mobile App
A practical guide to packaging fine-tuned AI models as GGUF files and running them on iOS and Android with llama.cpp. Includes file sizes, benchmarks, and integration patterns.
Your fine-tuned model works great on your laptop. Responses are fast, the quality is exactly what you need, and your evals are solid. Now you need it running on 50,000 iPhones.
This is where most mobile developers stop. Bundling an AI model into an app binary sounds daunting: binary size, memory constraints, thermal throttling, platform-specific build systems. But the tooling has matured. Projects like PocketPal AI demonstrate that a full llama.cpp-powered chat interface runs smoothly on today's flagship handsets. The path from fine-tuned weights to a shipping mobile app is defined, repeatable, and far less painful than it was even 18 months ago.
This guide walks through every step: GGUF format, quantization choices, iOS and Android integration, performance expectations, and delivery strategies.
What GGUF Is
GGUF (GGML Unified Format) is a single-file model format maintained by the llama.cpp project. Before GGUF, distributing a model meant juggling separate files for weights, configuration, tokenizer vocab, and special tokens. GGUF packs everything into one portable binary with a well-specified header structure.
What's inside a .gguf file:
- Model weights (quantized or full precision)
- Architecture metadata (layer count, attention heads, context length, rope parameters)
- Tokenizer vocabulary and merge rules
- Special token definitions (BOS, EOS, padding tokens)
The single-file design makes GGUF ideal for mobile deployment. You reference one path, load one file, and inference starts. No multi-file configuration needed. llama.cpp reads GGUF natively, and Ollama uses GGUF as its internal storage format.
Quantization Levels Explained
Quantization compresses model weights from 32-bit or 16-bit floats to lower-precision integers. This reduces file size and memory footprint substantially, at a modest quality cost that is often imperceptible for domain-specific tasks.
| Format | Bits per weight | Quality vs FP16 | Notes |
|---|---|---|---|
| Q4_K_M | ~4.5 bits | -2 to 4% on benchmarks | Best size-quality tradeoff for mobile |
| Q5_K_M | ~5.5 bits | -1 to 2% on benchmarks | Noticeably better coherence, ~20% larger |
| Q8_0 | 8 bits | Negligible loss | Near-lossless, 2x size of Q4_K_M |
| F16 | 16 bits | Baseline | Too large for most mobile targets |
Concrete file sizes at Q4_K_M:
- Llama 3.2 1B: 808 MB
- Llama 3.2 3B: 2.02 GB
- Phi-3-mini 3.8B: approximately 2.3 GB
- Llama 3.1 8B: approximately 4.9 GB
For mobile, Q4_K_M is the default recommendation. The quality difference versus Q5_K_M is small for domain-specific fine-tuned models (which have already constrained the output distribution), and the size saving is significant. Reserve Q8_0 for desktop or offline-capable tablets where storage is less constrained.
A 1B model at Q4_K_M (808 MB) fits comfortably on any modern handset. A 3B model (2.02 GB) works well on flagship devices. At 4.9 GB, an 8B model is only appropriate as a downloadable asset on devices with 6 GB+ RAM, not as a bundled binary.
How llama.cpp Works on Mobile
llama.cpp is a pure C/C++ inference engine for GGUF models. The core library has zero external dependencies beyond the C++ standard library, which is why it compiles on nearly every platform.
For acceleration, llama.cpp uses:
- Metal (iOS/macOS) for GPU compute via Apple's Metal Performance Shaders
- OpenCL or Vulkan (Android) for GPU compute on Qualcomm, ARM, and MediaTek silicon
- NEON SIMD intrinsics for ARM CPU matrix operations
- NNAPI (Android) as a delegate path to hardware NPUs on newer chipsets
The project provides example binaries, but for mobile integration you compile it as a static library and call into it through platform bindings. Both official Swift and Kotlin wrapper patterns exist in the wild, and the PocketPal AI project is a well-maintained open-source reference for both platforms.
iOS Integration
Build Setup
llama.cpp uses CMake. For iOS, you cross-compile a static library targeting arm64-apple-ios:
cmake -B build-ios \
-DCMAKE_TOOLCHAIN_FILE=ios.toolchain.cmake \
-DPLATFORM=OS64 \
-DGGML_METAL=ON \
-DBUILD_SHARED_LIBS=OFF \
-DLLAMA_BUILD_TESTS=OFF \
.
cmake --build build-ios --config Release
The GGML_METAL=ON flag enables Metal GPU acceleration. The resulting libllama.a and libggml.a static libraries link into your Xcode project.
Alternatively, a pre-built XCFramework is available from the llama.cpp Swift Package Manager package at github.com/ggml-org/llama.cpp, which handles the CMake step for you.
Swift Bindings
The SPM package exposes a LlamaContext Swift class with an init(model: URL) initializer and an async complete(prompt: String) -> AsyncStream<String> method for streaming tokens.
A minimal integration looks like:
import llama
class ModelRunner {
private var context: LlamaContext?
func load(modelURL: URL) async throws {
context = try await LlamaContext.createContext(path: modelURL.path)
}
func stream(prompt: String) -> AsyncStream<String> {
guard let ctx = context else { return AsyncStream { $0.finish() } }
return AsyncStream { continuation in
Task {
for await token in ctx.completionStream(text: prompt) {
continuation.yield(token)
}
continuation.finish()
}
}
}
}
Memory Management on iOS
iOS kills processes that exceed their memory budget without warning. Key rules:
- Load the model once at app launch or on a dedicated background thread. Do not re-initialize per request.
- Set
n_ctxconservatively. Context length directly determines KV cache size. A 2048-token context uses significantly less memory than an 8192-token one. Most mobile use cases need fewer than 2048 tokens. - Monitor memory warnings. Implement
applicationDidReceiveMemoryWarningand respond by freeing the KV cache (callllama_kv_cache_clear) rather than unloading the whole model. - Use the
mmapload flag. llama.cpp supports memory-mapped file loading (--mmap), which lets the OS page model weights in and out. On iOS this reduces peak RSS at the cost of slightly higher first-token latency.
A 1B Q4_K_M model (808 MB) plus a 2048-token KV cache runs comfortably within the memory budget of iPhones from the 12 generation forward. A 3B model needs devices with at least 6 GB RAM (iPhone 15 Pro and later).
Android Integration
NDK Build
Android uses the NDK toolchain for C/C++ code. Add llama.cpp as a submodule or copy its source into app/src/main/cpp/, then configure CMakeLists.txt:
cmake_minimum_required(VERSION 3.22)
project(llama_android)
add_subdirectory(llama.cpp)
add_library(llama_jni SHARED jni_bridge.cpp)
target_link_libraries(llama_jni llama ggml android log)
In build.gradle (app module):
android {
defaultConfig {
externalNativeBuild {
cmake {
cppFlags "-std=c++17"
arguments "-DGGML_OPENMP=OFF", "-DLLAMA_BUILD_TESTS=OFF"
}
}
}
externalNativeBuild {
cmake {
path "src/main/cpp/CMakeLists.txt"
}
}
}
For GPU acceleration on Android, pass -DGGML_OPENCL=ON (requires OpenCL headers) or -DGGML_VULKAN=ON (requires the Vulkan SDK). On Snapdragon devices, Qualcomm's QNN backend offers NPU acceleration via -DGGML_QNN=ON, though this requires the Qualcomm AI SDK.
JNI Wrapper and Kotlin Bridge
Create a thin JNI bridge in jni_bridge.cpp that wraps the llama.cpp C API:
extern "C" JNIEXPORT jlong JNICALL
Java_com_yourapp_LlamaWrapper_loadModel(JNIEnv *env, jobject, jstring modelPath) {
const char *path = env->GetStringUTFChars(modelPath, nullptr);
llama_model_params mparams = llama_model_default_params();
llama_model *model = llama_load_model_from_file(path, mparams);
env->ReleaseStringUTFChars(modelPath, path);
return reinterpret_cast<jlong>(model);
}
On the Kotlin side, a thin wrapper class holds the native pointer and exposes a coroutine-based API:
class LlamaWrapper(private val modelPath: String) {
private var modelHandle: Long = 0
fun load() {
modelHandle = loadModel(modelPath)
}
fun complete(prompt: String, onToken: (String) -> Unit) {
generateTokens(modelHandle, prompt, onToken)
}
fun close() {
if (modelHandle != 0L) {
freeModel(modelHandle)
modelHandle = 0
}
}
private external fun loadModel(path: String): Long
private external fun generateTokens(handle: Long, prompt: String, cb: (String) -> Unit)
private external fun freeModel(handle: Long)
companion object {
init { System.loadLibrary("llama_jni") }
}
}
Keep the model loaded in a singleton tied to the Application lifecycle, not to individual Activities. Recreating the model on every screen transition will cause perceptible delays and excessive battery drain.
Fine-tune a model, export as GGUF, ship it in your app.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Performance Benchmarks
Real-world throughput on current flagship hardware, running Q4 or Q8 quantized models:
| Device | Model | Quantization | Tokens/sec | Backend |
|---|---|---|---|---|
| iPhone 17 Pro | 1.5B | INT8 | 136 | NPU (Apple ANE) |
| Galaxy S25 Ultra | 2B | INT8 | 91 | NPU (Google Tensor) |
| iPhone 16 Pro | 1.5B | Q4 | 22 (sustained) | Metal GPU |
| Snapdragon 8 Elite | 13B | Q4 | 20+ | Hexagon NPU |
| iPhone 15 Pro | 1B | Q4 | ~18 | Metal GPU |
| Mid-range Android (SD 7s Gen 3) | 1B | Q4 | 8-12 | CPU NEON |
For context: human reading speed is approximately 5-7 tokens per second. Every flagship device listed above exceeds comfortable reading pace with a 1-3B model. Even mid-range devices at 8-12 tok/s feel responsive for most use cases.
The NPU numbers (iPhone 17 Pro at 136 tok/s, Galaxy S25 Ultra at 91 tok/s) represent a step change in capability. On NPU-accelerated paths, latency drops to roughly 1-20% of the CPU baseline, and power efficiency per trillion operations is significantly higher than GPU or CPU inference.
What these numbers mean for UX: At 22 tok/s on an iPhone 16 Pro, a 200-token response renders in under 10 seconds. First-token latency (time before the stream starts) is typically 200-800ms depending on prompt length. Both are acceptable for most in-app assistant patterns.
Thermal Management and Battery
Sustained LLM inference is one of the most thermally intensive workloads a mobile processor handles. The iPhone 16 Pro loses approximately 44% throughput under sustained load as the SoC throttles to protect hardware temperature. A benchmark that shows 22 tok/s may deliver closer to 12-14 tok/s after 5-10 minutes of continuous inference.
Practical mitigation strategies:
Cap inference duration. For most use cases, responses are complete in under 30 seconds. Set a maximum token count (n_predict) appropriate to your use case. This limits thermal impact per request.
Add inter-request delays. For background processing jobs, insert a short pause between completions. Even a brief pause allows the SoC to shed heat before the next inference pass.
Choose smaller models for continuous tasks. A 1B model generates tokens at higher throughput with substantially less heat than a 3B model. For classification, extraction, or formatting tasks, the smaller model often produces equivalent results.
Monitor device temperature on Android. The ThermalManager API (Android 10+) exposes thermal status as a 0-6 scale. Register a listener and reduce inference frequency or context length as the device warms. There is no direct equivalent on iOS, but you can measure throughput degradation as a proxy.
Battery guidance: A 1B model running continuously at full speed on an iPhone 16 Pro consumes roughly 2-3 watts of SoC power beyond the baseline. An active inference session reduces battery life by approximately 20-30% relative to typical app usage. For most in-app assistant patterns (short, user-initiated requests with pauses between), the impact is much smaller.
Model Delivery Strategies
How you get the GGUF file onto the device matters as much as how you run it.
Bundle in the App Binary
Pros: zero-latency first-use experience, no network requirement, no download UX needed. Cons: App Store size limits. Apple allows over-the-air downloads up to 200 MB without user confirmation; apps over 4 GB require download over Wi-Fi. Google Play has similar constraints.
Works for: 1B Q4_K_M models (808 MB) as a downloadable asset via iOS On-Demand Resources or Android App Bundle asset packs. The model stays out of the main binary but downloads automatically on install.
Download on First Launch
The most common pattern for larger models. The app ships without the model, and on first launch (or on a "set up AI features" opt-in screen) downloads the GGUF from your CDN.
Implementation notes:
- Use
URLSessionbackground download tasks on iOS so the download continues if the user backgrounds the app. - Use
WorkManagerwithNETWORK_NOT_ROAMINGorNETWORK_UNMETEREDconstraints on Android. - Show progress with a clear explanation ("Downloading AI model, 2 GB, Wi-Fi recommended").
- Cache the download aggressively. Do not re-download if the file is already present and passes a checksum.
- Consider storing the GGUF in the app's
Application Supportdirectory (iOS) orfilesDir(Android) rather than a shared location, to avoid OS-level cleanup on low storage.
Delta Updates
When you release a new fine-tuned version, you likely don't want to push 2 GB to every user on every update. Delta update patterns:
- LoRA adapter only: If your update is a new fine-tune of the same base model, ship only the LoRA adapter file (typically 20-200 MB) and load it over the frozen base model at inference time. llama.cpp supports LoRA adapters via the
--loraflag or equivalent API call. This is far more bandwidth-efficient than replacing the full GGUF. - GGUF diff: For architecture changes that require a new base, tools exist to compute binary diffs between GGUF versions. The patch is much smaller than a full re-download if only a subset of weights changed.
- Version-tagged CDN paths: Store models at paths like
/models/v2/model.Q4_K_M.ggufand check a version endpoint on app launch. Update the local copy only when the remote version is newer.
Platform Alternatives to llama.cpp
Before committing to the llama.cpp integration path, evaluate whether a platform-managed API meets your needs.
Apple Foundation Models API (iOS 18.4+ / WWDC 2025)
Apple shipped a public Swift API for on-device inference that targets its approximately 3B parameter on-device model. The API is high-level: you describe a task with a GenerationOptions struct and receive text, structured JSON, or tool calls.
Pros: No model to download or maintain, hardware-optimized by Apple, trivially simple Swift API, no memory management headaches.
Cons: You cannot load a custom fine-tuned model. You are constrained to Apple's base model and its capabilities. Not available on Android. Model quality for specialized domains may be insufficient.
Use the Foundation Models API if: your task is general enough that Apple's base model handles it well, and you want to ship fast without managing model files.
Use llama.cpp with a custom GGUF if: you need domain-specific quality, cross-platform behavior, or control over the model's exact outputs.
Google Gemini Nano (Android, ML Kit GenAI APIs, Google I/O 2025)
Google's ML Kit now exposes on-device Gemini Nano inference via a managed API. Like Apple's offering, this runs a fixed model managed by the OS, not a custom one.
Pros: Simple API, no download required on supported Pixel and partner devices, integrates with existing ML Kit patterns.
Cons: Available only on Pixel 9 and select other devices. No custom model support. Cross-device consistency is limited.
For production apps targeting broad device support with a custom fine-tuned model, llama.cpp with GGUF remains the most portable approach.
End-to-End Checklist
Before shipping, verify:
- GGUF exported at the right quantization for your target device tier (Q4_K_M for most mobile targets)
- File size fits within your delivery strategy (under 200 MB for instant OTA, under 4 GB for deferred download)
- Context length (
n_ctx) is set to the minimum required, not the model's maximum - Model is loaded once at app startup or on a dedicated background queue, not per-request
- Memory warnings are handled: clear KV cache before unloading full model
- Thermal throttling tested: run inference for 10+ minutes and verify output quality under sustained load
- Background download implemented with progress feedback for first-launch model delivery
- Checksum validation on downloaded GGUF before attempting to load
- Token limit (
n_predict) set to a sensible cap to bound worst-case inference duration
Getting Started with Ertas
The integration work above assumes you already have a fine-tuned GGUF. The fine-tuning step is where Ertas comes in.
Upload your domain data, configure training parameters visually, and export the result as a GGUF at your target quantization level. Ertas handles the cloud GPU compute, dataset formatting, and quantized export. You get back a .gguf file ready to drop into the iOS or Android integration described above.
The mobile inference layer is open-source infrastructure. The differentiator is the model inside it: one that understands your domain, your users' language, and your product's specific output requirements. That's what fine-tuning produces, and that's the part you own.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Fine-Tuning for App Developers: A Non-ML-Engineer's Guide
A practical guide to fine-tuning AI models for mobile app developers. Learn LoRA, QLoRA, and GGUF export without needing an ML background.

GGUF Explained: The Open Format That Runs AI Anywhere
GGUF is the file format that made running AI models on consumer hardware practical. Here's what it is, how it works, and why every AI builder should understand it.

Shipping GGUF Models: App Store Bundling vs Post-Install Download
Two ways to get your GGUF model onto the user's device. Bundle it with the app for simplicity, or download post-install for flexibility. Architecture, size limits, and best practices for both.