
llama.cpp on iOS: A Swift Integration Guide
Step-by-step guide to integrating llama.cpp into an iOS app. Project setup, Metal GPU acceleration, model loading, token streaming, and memory management for production deployment.
llama.cpp is the inference engine that runs GGUF language models on Apple hardware. It uses Metal for GPU acceleration, supports all iPhone models from the A14 (iPhone 12) onward, and generates tokens at 20-50 tokens per second depending on model size and device.
This guide covers the integration from project setup through production deployment.
Integration Options
Option 1: Swift Package (Recommended)
The llama.cpp repository includes a Swift Package that you can add directly to your Xcode project:
- In Xcode, go to File, Add Package Dependencies
- Enter the llama.cpp repository URL
- Select the version or branch you want
- Import the
llamamodule in your Swift files
This is the simplest integration path. The package compiles llama.cpp as part of your build and exposes the C API to Swift.
Option 2: Pre-Built Framework
Build llama.cpp as an XCFramework and include it as a binary dependency. This avoids compiling the C++ source in your project:
# Build the framework
mkdir build-ios && cd build-ios
cmake .. -G Xcode \
-DCMAKE_SYSTEM_NAME=iOS \
-DCMAKE_OSX_DEPLOYMENT_TARGET=15.0 \
-DLLAMA_METAL=ON \
-DBUILD_SHARED_LIBS=OFF
cmake --build . --config Release
Option 3: llama.swift Wrapper
Community-maintained Swift wrappers provide a more idiomatic Swift API on top of the C bindings. These handle the bridging boilerplate and expose a cleaner interface.
Project Setup
Minimum Requirements
- iOS 15.0+ (for Metal compute shaders)
- Xcode 15+
- A physical device for testing (Simulator does not support Metal compute)
Build Settings
Add Metal framework to your project:
- Link
Metal.frameworkandMetalKit.framework - Set
METAL_COMPILER_FLAGSif needed for custom shaders
Entitlements
No special entitlements required. llama.cpp runs in the app's normal sandbox. Memory usage is the main concern (discussed below).
Loading a Model
import llama
class LlamaEngine {
private var model: OpaquePointer?
private var context: OpaquePointer?
func loadModel(at path: String) throws {
// Model parameters
var modelParams = llama_model_default_params()
modelParams.n_gpu_layers = 99 // Offload all layers to Metal
// Load the GGUF file
model = llama_load_model_from_file(path, modelParams)
guard model != nil else {
throw LlamaError.modelLoadFailed
}
// Create inference context
var ctxParams = llama_context_default_params()
ctxParams.n_ctx = 2048 // Context window size
ctxParams.n_threads = 4 // CPU threads (for non-Metal ops)
ctxParams.n_batch = 512 // Batch size for prompt processing
context = llama_new_context_with_model(model, ctxParams)
guard context != nil else {
throw LlamaError.contextCreationFailed
}
}
func unload() {
if let ctx = context {
llama_free(ctx)
context = nil
}
if let mdl = model {
llama_free_model(mdl)
model = nil
}
}
deinit {
unload()
}
}
Key Parameters
n_gpu_layers: Set to 99 (or the model's actual layer count) to offload everything to Metal. This is the single most important performance setting.
n_ctx: The context window size in tokens. Larger windows use more memory. 2048 is practical for most mobile use cases. 4096 if you need longer conversations.
n_threads: Number of CPU threads for operations that run on CPU. Set to the device's performance core count (typically 2-4 on iPhones).
n_batch: Tokens processed per batch during prompt evaluation. Higher values speed up prompt processing but use more memory. 512 is a good default.
Generating Text
Tokenization and Prompt Processing
extension LlamaEngine {
func generate(
prompt: String,
maxTokens: Int = 256,
temperature: Float = 0.7,
onToken: @escaping (String) -> Void
) -> String {
guard let ctx = context, let mdl = model else { return "" }
// Tokenize the prompt
let promptTokens = tokenize(prompt)
// Create a batch for prompt processing
var batch = llama_batch_init(Int32(promptTokens.count), 0, 1)
for (i, token) in promptTokens.enumerated() {
llama_batch_add(&batch, token, Int32(i), [0], i == promptTokens.count - 1)
}
// Process the prompt
llama_decode(ctx, batch)
llama_batch_free(batch)
// Generate tokens
var output = ""
for _ in 0..<maxTokens {
let logits = llama_get_logits(ctx)
// Sample next token
let token = sampleToken(logits: logits!, temperature: temperature)
// Check for end of sequence
if llama_token_is_eog(mdl, token) { break }
// Decode token to string
let piece = decodeToken(token)
output += piece
onToken(piece)
// Prepare next batch
var nextBatch = llama_batch_init(1, 0, 1)
llama_batch_add(&nextBatch, token, Int32(promptTokens.count + output.count), [0], true)
llama_decode(ctx, nextBatch)
llama_batch_free(nextBatch)
}
return output
}
private func tokenize(_ text: String) -> [llama_token] {
let maxTokens = Int32(text.utf8.count + 16)
var tokens = [llama_token](repeating: 0, count: Int(maxTokens))
let count = llama_tokenize(model, text, Int32(text.utf8.count),
&tokens, maxTokens, true, false)
return Array(tokens.prefix(Int(count)))
}
private func decodeToken(_ token: llama_token) -> String {
var buf = [CChar](repeating: 0, count: 64)
let len = llama_token_to_piece(model, token, &buf, 64, 0, false)
return String(cString: Array(buf.prefix(Int(len))) + [0])
}
}
Metal GPU Acceleration
Metal acceleration is automatic when n_gpu_layers is set. llama.cpp compiles Metal shaders at first load (takes 1-2 seconds, cached afterward).
Performance Impact
| Configuration | iPhone 15 Pro, 3B Q4 | iPhone 14, 3B Q4 |
|---|---|---|
| CPU only (n_gpu_layers = 0) | 8-12 tok/s | 6-10 tok/s |
| Metal (n_gpu_layers = 99) | 18-25 tok/s | 14-18 tok/s |
Metal provides a 2x speedup on average. Always enable it for production.
Metal Shader Caching
The first time llama.cpp runs on a device, it compiles Metal shaders. This adds 1-2 seconds to the first model load. Subsequent loads are instant (shaders are cached by iOS).
Memory Management
Memory Budget
iOS gives apps approximately 50-70% of total device RAM before triggering jetsam (forced termination):
| Device | Total RAM | App Budget | Available for Model |
|---|---|---|---|
| iPhone 12 (4GB) | 4GB | ~2.5GB | ~1.5GB |
| iPhone 14 (6GB) | 6GB | ~3.5GB | ~2.5GB |
| iPhone 15 Pro (8GB) | 8GB | ~5GB | ~3.5GB |
A 3B Q4 model uses ~2.2GB in RAM. On a 6GB device, this leaves ~1.3GB for your app, iOS, and other processes. Tight but workable.
Best Practices
// Check available memory before loading
func canLoadModel(sizeBytes: Int) -> Bool {
let available = os_proc_available_memory()
// Leave 500MB headroom for app and OS
return available > sizeBytes + 500_000_000
}
// Handle memory warnings
func didReceiveMemoryWarning() {
engine.unload()
// Show "Model unloaded" message, offer to reload
}
- Always check available memory before loading
- Unload the model when the AI feature is not active
- Handle
didReceiveMemoryWarningby unloading the model - Never keep the model loaded while the app is backgrounded
Model Delivery
Bundled
Add the GGUF file to your Xcode project as a resource. Access via Bundle.main:
let modelPath = Bundle.main.path(forResource: "model", ofType: "gguf")!
For models over 200MB, consider using On Demand Resources to avoid bloating the initial download.
Downloaded
Download the model post-install and store in the app's Documents directory:
let documentsURL = FileManager.default
.urls(for: .documentDirectory, in: .userDomainMask)[0]
let modelURL = documentsURL.appendingPathComponent("model.gguf")
Use URLSession background downloads for large files. Support resume on interruption.
Production Checklist
- Model loads without crashing on target devices (test on lowest-RAM target)
- Metal acceleration is enabled (verify with performance logging)
- Memory warning handler unloads the model gracefully
- Model file integrity is verified after download (SHA256)
- Streaming tokens display smoothly in the UI
- Generation can be cancelled by the user (interrupt the generation loop)
- Model is unloaded on background transition
- App functions normally when model is not loaded (graceful fallback)
The fine-tuned GGUF model is the critical ingredient. A base model generates generic responses. A model fine-tuned on your domain data (via a platform like Ertas) generates responses that match your app's purpose and style. The llama.cpp integration is the same either way.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

AI in iOS Apps: CoreML, Cloud APIs, and On-Device LLMs Compared
Three paths to AI in your iOS app. CoreML for Apple's ecosystem, cloud APIs for capability, and on-device LLMs via llama.cpp for cost and privacy. A practical comparison for Swift developers.

Can LLMs Actually Run on iPhones? Benchmarks and Real-World Performance
Real benchmark data for running LLMs on iPhones via llama.cpp. Token generation speeds, memory usage, and thermal behavior across iPhone models from the iPhone 12 to iPhone 16 Pro.

llama.cpp on Android: A Kotlin Integration Guide
Step-by-step guide to integrating llama.cpp into an Android app with Kotlin. JNI bindings, Vulkan GPU acceleration, model loading, and memory management across the Android device spectrum.