iOS

Integrate an Ertas-trained GGUF into an iOS app via Flutter, native Swift, or React Native, plus the App Store specifics that bite first-time shippers.

iOS is the most opinionated platform to ship an on-device model into. Apple Silicon gives you Metal-accelerated inference for free, but you trade that for a tighter set of build, signing, and review requirements than any other platform. The good news: once you have one fine-tune through review, the next one is mostly mechanical.

This page covers integration paths (Flutter via llamadart, native Swift, React Native), iOS-specific build requirements, the privacy manifest, and the App Store review surprises that bite first-time shippers. For cross-platform download UX, see Model delivery and UX. For the GGUF bundle itself, see GGUF overview.

Integration paths

Flutter + llamadart (the reference path)

llamadart gives you a Flutter FFI binding to llama.cpp that ships prebuilt binaries for iOS arm64 (and Android). This is the integration the Ertas reference POC uses on iOS, with the same Dart code running unchanged on both platforms.

# pubspec.yaml
dependencies:
  llamadart: ^0.6.9
  path_provider: ^2.1.5
  http: ^1.2.2

import 'package:llamadart/llamadart.dart';

class LlamaService {
  LlamaEngine? _engine;
  ChatSession? _session;
  bool _isLoaded = false;

  Future<void> loadModel(String modelPath) async {
    _engine = LlamaEngine(LlamaBackend());
    await _engine!.loadModel(modelPath);
    _session = ChatSession(_engine!);
    _isLoaded = true;
  }

  Future<String> generate(String prompt, {int maxTokens = 256}) async {
    if (!_isLoaded) throw StateError('Model not loaded');
    final response = await _session!.generate(prompt, maxTokens: maxTokens);
    _session!.reset(); // critical, see below
    return response.trim();
  }

  void dispose() {
    _engine?.dispose();
    _engine = null;
    _session = null;
    _isLoaded = false;
  }
}

Call session.reset() after every generate(). Without it, prior prompts and outputs accumulate in the session context and contaminate the next generation. This is the single most common bug when integrating llamadart, and the failure mode is subtle: outputs get progressively worse instead of erroring.

A few llamadart-specific iOS considerations:

Min iOS 16.4 is required for the Metal GPU backend in llama.cpp builds. Set this in both ios/Podfile (platform :ios, '16.4') and Xcode's Deployment Target. Older versions will fall back to CPU inference, which is dramatically slower on Apple Silicon.
The flutter build ios pipeline currently hardcodes MinimumOSVersion: 12.0 on native asset frameworks like llamadart.framework. App Store Connect rejects the build with ITMS-90208. The fix is to patch flutter_tools/lib/src/isolated/native_assets/ios/native_assets.dart (lines 14 and 131 of the relevant function) to 16 / '16.4', then delete flutter_tools.snapshot and .stamp so the patch takes effect. This must be re-applied after every flutter upgrade.
Native memory is invisible to Dart's garbage collector. A loaded Q4_K_M model holds native RAM roughly equal to the GGUF file size (around 700 MB for a 1B model), plus an extra 50 to 200 MB for the KV cache at typical context lengths. The Dart GC cannot see any of this. Skipping dispose() leaks the entire allocation. Wire the call into your app lifecycle.
Always check isLoaded before each generate(). iOS can kill the app process while backgrounded; on resume, the Dart object exists but the native pointer is dead. Reload defensively rather than letting the first post-resume call crash.

Native Swift (llama.cpp as an XCFramework)

For native iOS apps without a cross-platform framework, build llama.cpp as an XCFramework and import it via Swift Package Manager or CocoaPods.

The most accessible reference implementation is llama.cpp's own SwiftUI example in the repo's examples/llama.swiftui directory. It demonstrates loading a GGUF, running inference, and streaming tokens; you can fork or adapt it directly into your project.

Beyond the official example, third-party packages on Swift Package Index (search for llama-cpp or llama.cpp) wrap llama.cpp at higher levels; pick one that is actively maintained and ships prebuilt XCFramework binaries.

The minimum interface you need:

import Llama // or your chosen package

let modelPath = Bundle.main.path(forResource: "model", ofType: "gguf")!
let model = try LlamaModel(modelPath: modelPath)
let context = try model.createContext()
let response = try context.complete(prompt: "Hello.", maxTokens: 64)

Apple's Metal backend is enabled by default in current llama.cpp builds, so a 3B-class model on an iPhone 14 or newer typically generates at 15 to 30 tokens per second (varies by chip and model).

React Native

For React Native apps, llama.rn wraps llama.cpp via the JSI bridge and loads model.gguf directly without conversion. It ships a prebuilt rnllama.xcframework with Metal acceleration. The iOS-side configuration is similar to llamadart's (min iOS 16.4, code signing for the bundled framework, explicit dispose to release native memory), with two extra constraints worth knowing: React Native New Architecture is required (starting llama.rn v0.10), and Metal acceleration is unavailable in the iOS simulator (requires Apple7 GPU and a physical device). Other RN llama.cpp wrappers exist but llama.rn is the most actively maintained at the time of writing.

MLC LLM Swift

MLC LLM is an alternative inference engine that targets TVM-compiled kernels. It produces fast inference on Apple Silicon but does not load GGUF directly; models must be re-quantised to MLC's own format. Useful if your team already has an MLC pipeline; otherwise the GGUF + llama.cpp path is simpler.

iOS build and signing

Prerequisites

Apple Developer Program membership ($99 per year).
A Mac with current Xcode (the Apple Silicon variant is what most developers use; Intel Macs still work).
Apple ID signed into Xcode's Settings → Accounts.
The Apple Program License Agreement accepted in the developer portal before signing will work.

Xcode project setup

For a Flutter project, run flutter create --platforms=ios if you have not already, then:

Open ios/Runner.xcworkspace (always the workspace, never .xcodeproj, once CocoaPods is involved).
In Runner's General tab: set Deployment Target to iOS 16.4, device family to iPhone (or iPhone + iPad).
In Signing & Capabilities: tick Automatically manage signing and select your team.
In ios/Podfile: set platform :ios, '16.4'.
Run cd ios && pod install.

Build command

flutter build ios --release --obfuscate --split-debug-info=build/debug-info-ios/

First build takes 5 to 10 minutes (CocoaPods compiles llama.cpp from source). Incremental builds land around 22 seconds. The produced Runner.app is 30 to 50 MB without the model; the model itself is downloaded on first launch.

For native iOS apps, the standard Xcode archive flow applies. No special configuration is needed for llama.cpp beyond the Metal-enabled framework you imported.

Privacy manifest (iOS 17+)

iOS 17 made a PrivacyInfo.xcprivacy file mandatory for new submissions. For an on-device AI app it declares:

NSPrivacyTracking: false (no cross-app tracking).
Data types collected. For an on-device app this is typically just crash and performance data (if you use Crashlytics or similar).
Accessed API "reason codes" for any platform APIs that require justification: UserDefaults (CA92.1), DiskSpace (85F4.1), FileTimestamp (C617.1), and so on.

Drop PrivacyInfo.xcprivacy into ios/Runner/, drag it into Xcode with Target Membership: Runner. Plugin dependencies may bundle their own manifests that merge automatically; watch the first archive build for warnings.

App Store review specifics

A handful of iOS-only review surprises that catch first-time shippers:

Surprise	What to do
Export Compliance questionnaire at submit time	For a standard HTTPS-only AI app: Uses encryption = Yes, Qualifies for Cat 5 Pt 2 exemption = Yes, Uses non-Apple-standard crypto = No. Verify the current question phrasing at submission time, as Apple updates the questionnaire periodically. Wrong answers block submission.
Apple rejects icons with alpha channel	Use `remove_alpha_ios: true` in `flutter_launcher_icons`, or strip alpha manually. App Store icon must be exactly 1024×1024, opaque RGB, no rounded corners (iOS auto-rounds).
Unused permission strings trigger reviewer questions	Only add `Info.plist` permission strings (`NSUserNotificationUsageDescription`, etc.) for permissions the app actually requests.
Crashlytics dSYM upload	Add a Run Script Phase `"${PODS_ROOT}/FirebaseCrashlytics/run"` below "Embed Frameworks", with the standard input file list. Newer versions of `firebase_crashlytics` auto-inject this.
"Consider Wi-Fi" warning at ~200 MB	The store presents this to users on cellular when an in-app download exceeds Apple's threshold. Not a block, but surface a Wi-Fi-only consent screen in-app before triggering the download.
Privacy category wording differs from Google Play	Crash data is "App Functionality" on iOS but "Analytics" on Google Play, even though it is the same data. Mind the categorization.

Apple's review turnaround is typically 24 to 48 hours, faster than most teams expect.

Universal Links (Associated Domains)

If you want a URL on your domain to open your app, set up an Associated Domains entitlement:

In Xcode → Signing & Capabilities, add the Associated Domains capability and add applinks:yourdomain.com.
Host an apple-app-site-association file at https://yourdomain.com/.well-known/apple-app-site-association with no .json extension and Content-Type: application/json.

{
  "applinks": {
    "details": [
      {
        "appIDs": ["TEAMID.com.yourcompany.yourapp"],
        "components": [{ "/": "/share/*" }]
      }
    ]
  }
}

The missing-extension requirement is the part that breaks most static hosts. If your host defaults to application/octet-stream for extensionless files (e.g. Cloudflare Pages, common static hosts), add a Content-Type override rule at the host level. iOS's verifier does not follow redirects and refuses anything other than application/json.

What's next

Android

If your Flutter or React Native app also ships to Android.

Model delivery and UX

First-run download patterns and storage location guidance.

Performance tips

Context length, threading, KV cache, Metal GPU offload.

Verifying exports

Smoke-test the GGUF before integrating it into the app.