llama.cpp on iOS: A Swift Integration Guide

llama.cpp is the inference engine that runs GGUF language models on Apple hardware. It uses Metal for GPU acceleration, supports all iPhone models from the A14 (iPhone 12) onward, and generates tokens at 20-50 tokens per second depending on model size and device.

This guide covers the integration from project setup through production deployment.

Integration Options

Option 1: Swift Package (Recommended)

The llama.cpp repository includes a Swift Package that you can add directly to your Xcode project:

In Xcode, go to File, Add Package Dependencies
Enter the llama.cpp repository URL
Select the version or branch you want
Import the llama module in your Swift files

This is the simplest integration path. The package compiles llama.cpp as part of your build and exposes the C API to Swift.

Option 2: Pre-Built Framework

Build llama.cpp as an XCFramework and include it as a binary dependency. This avoids compiling the C++ source in your project:

# Build the framework
mkdir build-ios && cd build-ios
cmake .. -G Xcode \
    -DCMAKE_SYSTEM_NAME=iOS \
    -DCMAKE_OSX_DEPLOYMENT_TARGET=15.0 \
    -DLLAMA_METAL=ON \
    -DBUILD_SHARED_LIBS=OFF
cmake --build . --config Release

Option 3: llama.swift Wrapper

Community-maintained Swift wrappers provide a more idiomatic Swift API on top of the C bindings. These handle the bridging boilerplate and expose a cleaner interface.

Project Setup

Minimum Requirements

iOS 15.0+ (for Metal compute shaders)
Xcode 15+
A physical device for testing (Simulator does not support Metal compute)

Build Settings

Add Metal framework to your project:

Link Metal.framework and MetalKit.framework
Set METAL_COMPILER_FLAGS if needed for custom shaders

Entitlements

No special entitlements required. llama.cpp runs in the app's normal sandbox. Memory usage is the main concern (discussed below).

Loading a Model

import llama

class LlamaEngine {
    private var model: OpaquePointer?
    private var context: OpaquePointer?

    func loadModel(at path: String) throws {
        // Model parameters
        var modelParams = llama_model_default_params()
        modelParams.n_gpu_layers = 99 // Offload all layers to Metal

        // Load the GGUF file
        model = llama_load_model_from_file(path, modelParams)
        guard model != nil else {
            throw LlamaError.modelLoadFailed
        }

        // Create inference context
        var ctxParams = llama_context_default_params()
        ctxParams.n_ctx = 2048      // Context window size
        ctxParams.n_threads = 4     // CPU threads (for non-Metal ops)
        ctxParams.n_batch = 512     // Batch size for prompt processing

        context = llama_new_context_with_model(model, ctxParams)
        guard context != nil else {
            throw LlamaError.contextCreationFailed
        }
    }

    func unload() {
        if let ctx = context {
            llama_free(ctx)
            context = nil
        }
        if let mdl = model {
            llama_free_model(mdl)
            model = nil
        }
    }

    deinit {
        unload()
    }
}

Key Parameters

n_gpu_layers: Set to 99 (or the model's actual layer count) to offload everything to Metal. This is the single most important performance setting.

n_ctx: The context window size in tokens. Larger windows use more memory. 2048 is practical for most mobile use cases. 4096 if you need longer conversations.

n_threads: Number of CPU threads for operations that run on CPU. Set to the device's performance core count (typically 2-4 on iPhones).

n_batch: Tokens processed per batch during prompt evaluation. Higher values speed up prompt processing but use more memory. 512 is a good default.

Generating Text

Tokenization and Prompt Processing

extension LlamaEngine {
    func generate(
        prompt: String,
        maxTokens: Int = 256,
        temperature: Float = 0.7,
        onToken: @escaping (String) -> Void
    ) -> String {
        guard let ctx = context, let mdl = model else { return "" }

        // Tokenize the prompt
        let promptTokens = tokenize(prompt)

        // Create a batch for prompt processing
        var batch = llama_batch_init(Int32(promptTokens.count), 0, 1)
        for (i, token) in promptTokens.enumerated() {
            llama_batch_add(&batch, token, Int32(i), [0], i == promptTokens.count - 1)
        }

        // Process the prompt
        llama_decode(ctx, batch)
        llama_batch_free(batch)

        // Generate tokens
        var output = ""
        for _ in 0..<maxTokens {
            let logits = llama_get_logits(ctx)

            // Sample next token
            let token = sampleToken(logits: logits!, temperature: temperature)

            // Check for end of sequence
            if llama_token_is_eog(mdl, token) { break }

            // Decode token to string
            let piece = decodeToken(token)
            output += piece
            onToken(piece)

            // Prepare next batch
            var nextBatch = llama_batch_init(1, 0, 1)
            llama_batch_add(&nextBatch, token, Int32(promptTokens.count + output.count), [0], true)
            llama_decode(ctx, nextBatch)
            llama_batch_free(nextBatch)
        }

        return output
    }

    private func tokenize(_ text: String) -> [llama_token] {
        let maxTokens = Int32(text.utf8.count + 16)
        var tokens = [llama_token](repeating: 0, count: Int(maxTokens))
        let count = llama_tokenize(model, text, Int32(text.utf8.count),
                                   &tokens, maxTokens, true, false)
        return Array(tokens.prefix(Int(count)))
    }

    private func decodeToken(_ token: llama_token) -> String {
        var buf = [CChar](repeating: 0, count: 64)
        let len = llama_token_to_piece(model, token, &buf, 64, 0, false)
        return String(cString: Array(buf.prefix(Int(len))) + [0])
    }
}

Metal GPU Acceleration

Metal acceleration is automatic when n_gpu_layers is set. llama.cpp compiles Metal shaders at first load (takes 1-2 seconds, cached afterward).

Performance Impact

Configuration	iPhone 15 Pro, 3B Q4	iPhone 14, 3B Q4
CPU only (n_gpu_layers = 0)	8-12 tok/s	6-10 tok/s
Metal (n_gpu_layers = 99)	18-25 tok/s	14-18 tok/s

Metal provides a 2x speedup on average. Always enable it for production.

Metal Shader Caching

The first time llama.cpp runs on a device, it compiles Metal shaders. This adds 1-2 seconds to the first model load. Subsequent loads are instant (shaders are cached by iOS).

Memory Management

Memory Budget

iOS gives apps approximately 50-70% of total device RAM before triggering jetsam (forced termination):

Device	Total RAM	App Budget	Available for Model
iPhone 12 (4GB)	4GB	~2.5GB	~1.5GB
iPhone 14 (6GB)	6GB	~3.5GB	~2.5GB
iPhone 15 Pro (8GB)	8GB	~5GB	~3.5GB

A 3B Q4 model uses ~2.2GB in RAM. On a 6GB device, this leaves ~1.3GB for your app, iOS, and other processes. Tight but workable.

Best Practices

// Check available memory before loading
func canLoadModel(sizeBytes: Int) -> Bool {
    let available = os_proc_available_memory()
    // Leave 500MB headroom for app and OS
    return available > sizeBytes + 500_000_000
}

// Handle memory warnings
func didReceiveMemoryWarning() {
    engine.unload()
    // Show "Model unloaded" message, offer to reload
}

Always check available memory before loading
Unload the model when the AI feature is not active
Handle didReceiveMemoryWarning by unloading the model
Never keep the model loaded while the app is backgrounded

Model Delivery

Bundled

Add the GGUF file to your Xcode project as a resource. Access via Bundle.main:

let modelPath = Bundle.main.path(forResource: "model", ofType: "gguf")!

For models over 200MB, consider using On Demand Resources to avoid bloating the initial download.

Downloaded

Download the model post-install and store in the app's Documents directory:

let documentsURL = FileManager.default
    .urls(for: .documentDirectory, in: .userDomainMask)[0]
let modelURL = documentsURL.appendingPathComponent("model.gguf")

Use URLSession background downloads for large files. Support resume on interruption.

Production Checklist

Model loads without crashing on target devices (test on lowest-RAM target)
Metal acceleration is enabled (verify with performance logging)
Memory warning handler unloads the model gracefully
Model file integrity is verified after download (SHA256)
Streaming tokens display smoothly in the UI
Generation can be cancelled by the user (interrupt the generation loop)
Model is unloaded on background transition
App functions normally when model is not loaded (graceful fallback)

The fine-tuned GGUF model is the critical ingredient. A base model generates generic responses. A model fine-tuned on your domain data (via a platform like Ertas) generates responses that match your app's purpose and style. The llama.cpp integration is the same either way.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

llama.cpp on iOS: A Swift Integration Guide

Integration Options

Option 1: Swift Package (Recommended)

Option 2: Pre-Built Framework

Option 3: llama.swift Wrapper

Project Setup

Minimum Requirements

Build Settings

Entitlements

Loading a Model

Key Parameters

Generating Text

Tokenization and Prompt Processing

Metal GPU Acceleration

Performance Impact

Metal Shader Caching

Memory Management

Memory Budget

Best Practices

Model Delivery

Bundled

Downloaded

Production Checklist

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

AI in iOS Apps: CoreML, Cloud APIs, and On-Device LLMs Compared

Can LLMs Actually Run on iPhones? Benchmarks and Real-World Performance

llama.cpp on Android: A Kotlin Integration Guide