
AI in iOS Apps: CoreML, Cloud APIs, and On-Device LLMs Compared
Three paths to AI in your iOS app. CoreML for Apple's ecosystem, cloud APIs for capability, and on-device LLMs via llama.cpp for cost and privacy. A practical comparison for Swift developers.
As an iOS developer, you have three distinct paths to adding AI features to your app. Each uses different technology, has different cost characteristics, and is suited to different tasks. Choosing the wrong path wastes either money or time.
This guide compares the three approaches from a Swift developer's perspective: what each can do, what it costs, and when to use it.
Path 1: CoreML
Apple's native machine learning framework. CoreML runs models directly on the device using Apple's Neural Engine, GPU, and CPU. It is deeply integrated into the Apple ecosystem and optimized for Apple silicon.
What CoreML Can Do
CoreML excels at vision and traditional NLP tasks that Apple has specifically optimized:
- Image classification and object detection via Vision framework
- Text classification and sentiment analysis via Natural Language framework
- Sound classification via SoundAnalysis framework
- Hand pose, body pose, and face detection
- On-device translation (limited language pairs)
Apple provides pre-trained models through Create ML and the Apple Developer documentation. You can also convert models from PyTorch or TensorFlow using coremltools.
What CoreML Cannot Do
CoreML does not support running large language models for text generation, chat, or complex reasoning. There is no native support for running a GPT-style model through CoreML in a way that produces conversational responses. Apple's on-device language features are limited to specific, narrow tasks.
Integration Pattern
import CoreML
import Vision
// Image classification example
let model = try VNCoreMLModel(for: MobileNetV2().model)
let request = VNCoreMLRequest(model: model) { request, error in
guard let results = request.results as? [VNClassificationObservation],
let topResult = results.first else { return }
print("\(topResult.identifier): \(topResult.confidence)")
}
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])
Cost
Zero. CoreML inference runs locally on the device with no API calls and no per-request charges.
Best For
Vision tasks (photo categorization, barcode scanning, face detection), text classification, sound analysis. Tasks where Apple provides optimized models or where you can train a custom classifier with Create ML.
Path 2: Cloud APIs
Call an external API (OpenAI, Anthropic, Google) from your iOS app. The model runs on the provider's servers. Your app sends the request and receives the response.
What Cloud APIs Can Do
Everything. Frontier models like GPT-4o, Claude 3.5 Sonnet, and Gemini can handle complex reasoning, creative generation, multi-turn conversation, code generation, and tasks that require broad world knowledge.
Integration Pattern
func chat(_ message: String) async throws -> String {
var request = URLRequest(url: URL(string: "https://api.openai.com/v1/chat/completions")!)
request.httpMethod = "POST"
request.setValue("Bearer \(apiKey)", forHTTPHeaderField: "Authorization")
request.setValue("application/json", forHTTPHeaderField: "Content-Type")
let body: [String: Any] = [
"model": "gpt-4o-mini",
"messages": [["role": "user", "content": message]]
]
request.httpBody = try JSONSerialization.data(withJSONObject: body)
let (data, _) = try await URLSession.shared.data(for: request)
// Parse and return response
}
Cost
Per-token pricing. GPT-4o-mini costs $0.15/$0.60 per million input/output tokens. At 10K MAU with 3 daily interactions, expect $300-$1,000+/month depending on your system prompt and conversation history.
Best For
Prototyping and validation. Tasks requiring frontier reasoning on novel inputs. Very low volume features. Features needing access to current world knowledge.
Drawbacks for iOS
Network dependency (fails offline, subway, airplane mode). Latency (500ms-3s for each response). Privacy (user data sent to third-party servers, must be disclosed in App Store privacy labels). Cost scales with every user.
Path 3: On-Device LLMs via llama.cpp
Run a full language model locally on the iPhone using llama.cpp. This gives you GPT-style capabilities (chat, generation, classification, summarization) entirely on-device.
What On-Device LLMs Can Do
Any text-in, text-out task that a small language model can handle: conversational AI, content drafting, classification, summarization, translation, structured data extraction, and function/tool calling. Fine-tuned on your domain data, a 3B model achieves 94% accuracy on domain-specific tasks.
How It Works on iOS
llama.cpp is a C/C++ library that runs GGUF model files. On iOS, it automatically uses Metal for GPU acceleration via Apple's Neural Engine. The library provides Swift-compatible interfaces through its C API or community Swift wrappers.
// Conceptual pattern using llama.cpp Swift bindings
let model = try LlamaModel(path: modelPath, params: .default)
let context = try model.createContext(contextLength: 2048)
// Streaming inference
for await token in context.generate(prompt: userMessage) {
await MainActor.run { responseText += token }
}
Performance on Apple Silicon
| iPhone | Chip | RAM | 1B Model (tok/s) | 3B Model (tok/s) |
|---|---|---|---|---|
| iPhone 12 | A14 | 4GB | 20-30 | Not recommended |
| iPhone 13 | A15 | 4-6GB | 30-40 | 12-18 |
| iPhone 14 | A15/A16 | 6GB | 30-40 | 15-22 |
| iPhone 15 | A16/A17 | 6-8GB | 35-50 | 20-30 |
| iPhone 16 Pro | A18 Pro | 8GB | 45-60 | 25-35 |
Anything above 10 tokens per second is usable for chat. Above 20 feels responsive. Modern iPhones (A15 and later) comfortably run 1-3B models.
Cost
One-time fine-tuning cost ($5-50). Model distribution via CDN (~$0.08/GB). Then zero per-inference cost. Permanently.
Best For
High-volume AI features (chat, search, classification). Privacy-sensitive data. Offline-required features. Domain-specific tasks. Any app where AI costs need to stay flat as users grow.
The Comparison
| Factor | CoreML | Cloud API | On-Device LLM |
|---|---|---|---|
| Text generation / chat | No | Yes | Yes |
| Image classification | Yes (optimized) | Yes | No (text only) |
| Offline support | Yes | No | Yes |
| Cost per inference | $0 | $0.0001-$0.01 | $0 |
| Setup complexity | Low | Low | Medium |
| Latency | Instant | 500ms-3,000ms | 50-200ms first token |
| Privacy | On-device | Third-party servers | On-device |
| Model flexibility | Apple models only | Any provider model | Any GGUF model |
| Fine-tuning | Create ML (limited) | Some providers | Full LoRA/QLoRA |
The Practical Decision
Use CoreML when you need image classification, object detection, text classification, or sound analysis. Apple's optimized models are hard to beat for these specific tasks on iOS hardware.
Use a cloud API when you are prototyping, when task volume is very low, or when you genuinely need frontier-model reasoning that a 3B model cannot match.
Use on-device LLMs when you need text generation, chat, summarization, translation, or any high-volume text task. The cost, latency, privacy, and offline advantages are significant. A fine-tuned model on your domain data will outperform generic cloud API prompting for your specific use case.
Many apps combine approaches. CoreML for the camera features. On-device LLM for the chat assistant. Cloud API as a fallback for the occasional complex query. This hybrid approach gives you the best of each technology.
For the fine-tuning step, tools like Ertas provide a visual pipeline that takes you from training data to a GGUF file ready for iOS deployment. No ML expertise required. The model runs on-device via llama.cpp with Metal acceleration, giving you production-grade inference performance on any modern iPhone.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

llama.cpp on iOS: A Swift Integration Guide
Step-by-step guide to integrating llama.cpp into an iOS app. Project setup, Metal GPU acceleration, model loading, token streaming, and memory management for production deployment.

How to Add AI to Your Mobile App: A Developer's Decision Guide
A comprehensive guide covering every approach to adding AI features to iOS and Android apps. Cloud APIs, on-device models, and hybrid architectures compared with real cost and performance data.

AI in Android Apps: ML Kit, Cloud APIs, and On-Device LLMs Compared
Three paths to AI in your Android app. Google ML Kit for common tasks, cloud APIs for full LLM capability, and on-device models via llama.cpp for cost and privacy. A practical comparison for Kotlin developers.