Back to blog
    llama.cpp on iOS: A Swift Integration Guide
    llama.cppiOSSwiftintegrationon-device AIMetalsegment:mobile-builder

    llama.cpp on iOS: A Swift Integration Guide

    Step-by-step guide to integrating llama.cpp into an iOS app. Project setup, Metal GPU acceleration, model loading, token streaming, and memory management for production deployment.

    EErtas Team·

    llama.cpp is the inference engine that runs GGUF language models on Apple hardware. It uses Metal for GPU acceleration, supports all iPhone models from the A14 (iPhone 12) onward, and generates tokens at 20-50 tokens per second depending on model size and device.

    This guide covers the integration from project setup through production deployment.

    Integration Options

    The llama.cpp repository includes a Swift Package that you can add directly to your Xcode project:

    1. In Xcode, go to File, Add Package Dependencies
    2. Enter the llama.cpp repository URL
    3. Select the version or branch you want
    4. Import the llama module in your Swift files

    This is the simplest integration path. The package compiles llama.cpp as part of your build and exposes the C API to Swift.

    Option 2: Pre-Built Framework

    Build llama.cpp as an XCFramework and include it as a binary dependency. This avoids compiling the C++ source in your project:

    # Build the framework
    mkdir build-ios && cd build-ios
    cmake .. -G Xcode \
        -DCMAKE_SYSTEM_NAME=iOS \
        -DCMAKE_OSX_DEPLOYMENT_TARGET=15.0 \
        -DLLAMA_METAL=ON \
        -DBUILD_SHARED_LIBS=OFF
    cmake --build . --config Release
    

    Option 3: llama.swift Wrapper

    Community-maintained Swift wrappers provide a more idiomatic Swift API on top of the C bindings. These handle the bridging boilerplate and expose a cleaner interface.

    Project Setup

    Minimum Requirements

    • iOS 15.0+ (for Metal compute shaders)
    • Xcode 15+
    • A physical device for testing (Simulator does not support Metal compute)

    Build Settings

    Add Metal framework to your project:

    • Link Metal.framework and MetalKit.framework
    • Set METAL_COMPILER_FLAGS if needed for custom shaders

    Entitlements

    No special entitlements required. llama.cpp runs in the app's normal sandbox. Memory usage is the main concern (discussed below).

    Loading a Model

    import llama
    
    class LlamaEngine {
        private var model: OpaquePointer?
        private var context: OpaquePointer?
    
        func loadModel(at path: String) throws {
            // Model parameters
            var modelParams = llama_model_default_params()
            modelParams.n_gpu_layers = 99 // Offload all layers to Metal
    
            // Load the GGUF file
            model = llama_load_model_from_file(path, modelParams)
            guard model != nil else {
                throw LlamaError.modelLoadFailed
            }
    
            // Create inference context
            var ctxParams = llama_context_default_params()
            ctxParams.n_ctx = 2048      // Context window size
            ctxParams.n_threads = 4     // CPU threads (for non-Metal ops)
            ctxParams.n_batch = 512     // Batch size for prompt processing
    
            context = llama_new_context_with_model(model, ctxParams)
            guard context != nil else {
                throw LlamaError.contextCreationFailed
            }
        }
    
        func unload() {
            if let ctx = context {
                llama_free(ctx)
                context = nil
            }
            if let mdl = model {
                llama_free_model(mdl)
                model = nil
            }
        }
    
        deinit {
            unload()
        }
    }
    

    Key Parameters

    n_gpu_layers: Set to 99 (or the model's actual layer count) to offload everything to Metal. This is the single most important performance setting.

    n_ctx: The context window size in tokens. Larger windows use more memory. 2048 is practical for most mobile use cases. 4096 if you need longer conversations.

    n_threads: Number of CPU threads for operations that run on CPU. Set to the device's performance core count (typically 2-4 on iPhones).

    n_batch: Tokens processed per batch during prompt evaluation. Higher values speed up prompt processing but use more memory. 512 is a good default.

    Generating Text

    Tokenization and Prompt Processing

    extension LlamaEngine {
        func generate(
            prompt: String,
            maxTokens: Int = 256,
            temperature: Float = 0.7,
            onToken: @escaping (String) -> Void
        ) -> String {
            guard let ctx = context, let mdl = model else { return "" }
    
            // Tokenize the prompt
            let promptTokens = tokenize(prompt)
    
            // Create a batch for prompt processing
            var batch = llama_batch_init(Int32(promptTokens.count), 0, 1)
            for (i, token) in promptTokens.enumerated() {
                llama_batch_add(&batch, token, Int32(i), [0], i == promptTokens.count - 1)
            }
    
            // Process the prompt
            llama_decode(ctx, batch)
            llama_batch_free(batch)
    
            // Generate tokens
            var output = ""
            for _ in 0..<maxTokens {
                let logits = llama_get_logits(ctx)
    
                // Sample next token
                let token = sampleToken(logits: logits!, temperature: temperature)
    
                // Check for end of sequence
                if llama_token_is_eog(mdl, token) { break }
    
                // Decode token to string
                let piece = decodeToken(token)
                output += piece
                onToken(piece)
    
                // Prepare next batch
                var nextBatch = llama_batch_init(1, 0, 1)
                llama_batch_add(&nextBatch, token, Int32(promptTokens.count + output.count), [0], true)
                llama_decode(ctx, nextBatch)
                llama_batch_free(nextBatch)
            }
    
            return output
        }
    
        private func tokenize(_ text: String) -> [llama_token] {
            let maxTokens = Int32(text.utf8.count + 16)
            var tokens = [llama_token](repeating: 0, count: Int(maxTokens))
            let count = llama_tokenize(model, text, Int32(text.utf8.count),
                                       &tokens, maxTokens, true, false)
            return Array(tokens.prefix(Int(count)))
        }
    
        private func decodeToken(_ token: llama_token) -> String {
            var buf = [CChar](repeating: 0, count: 64)
            let len = llama_token_to_piece(model, token, &buf, 64, 0, false)
            return String(cString: Array(buf.prefix(Int(len))) + [0])
        }
    }
    

    Metal GPU Acceleration

    Metal acceleration is automatic when n_gpu_layers is set. llama.cpp compiles Metal shaders at first load (takes 1-2 seconds, cached afterward).

    Performance Impact

    ConfigurationiPhone 15 Pro, 3B Q4iPhone 14, 3B Q4
    CPU only (n_gpu_layers = 0)8-12 tok/s6-10 tok/s
    Metal (n_gpu_layers = 99)18-25 tok/s14-18 tok/s

    Metal provides a 2x speedup on average. Always enable it for production.

    Metal Shader Caching

    The first time llama.cpp runs on a device, it compiles Metal shaders. This adds 1-2 seconds to the first model load. Subsequent loads are instant (shaders are cached by iOS).

    Memory Management

    Memory Budget

    iOS gives apps approximately 50-70% of total device RAM before triggering jetsam (forced termination):

    DeviceTotal RAMApp BudgetAvailable for Model
    iPhone 12 (4GB)4GB~2.5GB~1.5GB
    iPhone 14 (6GB)6GB~3.5GB~2.5GB
    iPhone 15 Pro (8GB)8GB~5GB~3.5GB

    A 3B Q4 model uses ~2.2GB in RAM. On a 6GB device, this leaves ~1.3GB for your app, iOS, and other processes. Tight but workable.

    Best Practices

    // Check available memory before loading
    func canLoadModel(sizeBytes: Int) -> Bool {
        let available = os_proc_available_memory()
        // Leave 500MB headroom for app and OS
        return available > sizeBytes + 500_000_000
    }
    
    // Handle memory warnings
    func didReceiveMemoryWarning() {
        engine.unload()
        // Show "Model unloaded" message, offer to reload
    }
    
    • Always check available memory before loading
    • Unload the model when the AI feature is not active
    • Handle didReceiveMemoryWarning by unloading the model
    • Never keep the model loaded while the app is backgrounded

    Model Delivery

    Bundled

    Add the GGUF file to your Xcode project as a resource. Access via Bundle.main:

    let modelPath = Bundle.main.path(forResource: "model", ofType: "gguf")!
    

    For models over 200MB, consider using On Demand Resources to avoid bloating the initial download.

    Downloaded

    Download the model post-install and store in the app's Documents directory:

    let documentsURL = FileManager.default
        .urls(for: .documentDirectory, in: .userDomainMask)[0]
    let modelURL = documentsURL.appendingPathComponent("model.gguf")
    

    Use URLSession background downloads for large files. Support resume on interruption.

    Production Checklist

    1. Model loads without crashing on target devices (test on lowest-RAM target)
    2. Metal acceleration is enabled (verify with performance logging)
    3. Memory warning handler unloads the model gracefully
    4. Model file integrity is verified after download (SHA256)
    5. Streaming tokens display smoothly in the UI
    6. Generation can be cancelled by the user (interrupt the generation loop)
    7. Model is unloaded on background transition
    8. App functions normally when model is not loaded (graceful fallback)

    The fine-tuned GGUF model is the critical ingredient. A base model generates generic responses. A model fine-tuned on your domain data (via a platform like Ertas) generates responses that match your app's purpose and style. The llama.cpp integration is the same either way.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading