
AI in Android Apps: ML Kit, Cloud APIs, and On-Device LLMs Compared
Three paths to AI in your Android app. Google ML Kit for common tasks, cloud APIs for full LLM capability, and on-device models via llama.cpp for cost and privacy. A practical comparison for Kotlin developers.
Android developers have three distinct approaches for adding AI to their apps. Google ML Kit handles common tasks. Cloud APIs provide frontier-model capability. On-device LLMs via llama.cpp give you full text generation without API costs or network dependency.
Each serves a different purpose. This guide compares them from a Kotlin developer's perspective.
Path 1: Google ML Kit
ML Kit is Google's mobile SDK for on-device machine learning. It provides production-ready APIs for common ML tasks without requiring any ML expertise.
What ML Kit Can Do
ML Kit offers pre-built models for specific tasks:
- Text recognition (OCR) across Latin, Chinese, Japanese, Korean, and Devanagari scripts
- Face detection with landmark tracking and classification
- Barcode scanning for all common barcode formats
- Image labeling with 400+ categories
- Object detection and tracking in real-time camera feeds
- Pose detection for body landmark tracking
- Digital ink recognition for handwriting
- Translation across 59 languages with models downloaded on-demand
- Smart reply suggestions for conversational contexts
- Entity extraction for dates, addresses, phone numbers in text
Integration Pattern
// Text recognition example
val recognizer = TextRecognition.getClient(TextRecognizerOptions.DEFAULT_OPTIONS)
val image = InputImage.fromBitmap(bitmap, 0)
recognizer.process(image)
.addOnSuccessListener { text ->
Log.d("MLKit", "Recognized: ${text.text}")
}
.addOnFailureListener { e ->
Log.e("MLKit", "Recognition failed", e)
}
What ML Kit Cannot Do
ML Kit does not support running large language models. There is no built-in capability for open-ended text generation, conversational AI, content drafting, or complex reasoning. Smart Reply provides short suggested responses for messages, but it is not a general-purpose chat model.
Cost
Free. ML Kit runs entirely on-device. No API calls, no per-request charges, no usage limits.
Best For
OCR, barcode scanning, face detection, pose estimation, image labeling. Tasks where Google provides an optimized, pre-built solution.
Path 2: Cloud APIs
Call OpenAI, Anthropic, Google Gemini, or another provider from your Android app. The model runs on remote servers.
Integration Pattern
suspend fun chat(message: String): String = withContext(Dispatchers.IO) {
val client = OkHttpClient()
val body = """
{"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "$message"}]}
""".trimIndent()
val request = Request.Builder()
.url("https://api.openai.com/v1/chat/completions")
.post(body.toRequestBody("application/json".toMediaType()))
.addHeader("Authorization", "Bearer $apiKey")
.build()
val response = client.newCall(request).execute()
// Parse JSON response
parseResponse(response.body!!.string())
}
For Google Gemini, Android has a dedicated SDK (Google AI Client SDK) that provides a more native integration:
val model = GenerativeModel(modelName = "gemini-2.0-flash", apiKey = apiKey)
val response = model.generateContent("Your prompt here")
println(response.text)
Cost
Per-token pricing. Gemini Flash is the cheapest major option at $0.10/$0.40 per million tokens. GPT-4o-mini at $0.15/$0.60. At scale, costs range from hundreds to thousands per month.
Gemini Nano: The In-Between
Google offers Gemini Nano for on-device inference, but it is heavily restricted. It runs only on specific devices (Pixel 8/9 series, Samsung Galaxy S24/S25 series) and only through the AICore system service. You cannot use your own models. You cannot fine-tune it. The capabilities are limited to specific tasks Google has approved.
For developers who need on-device AI across the full Android device spectrum, Gemini Nano is not a general solution.
Best For
Same as iOS: prototyping, validation, very low volume, or tasks requiring frontier reasoning.
Path 3: On-Device LLMs via llama.cpp
Run a full language model locally on the Android device. llama.cpp provides the inference engine. Your fine-tuned GGUF model provides the intelligence.
How It Works on Android
The llama.cpp project includes llama.android, a pre-built Android library that provides Kotlin bindings through JNI. It supports CPU inference on all devices and GPU acceleration via Vulkan on supported hardware.
// Conceptual pattern using llama.android
class AiViewModel : ViewModel() {
private val llamaModel = LlamaModel()
fun loadModel(modelPath: String) {
viewModelScope.launch(Dispatchers.Default) {
llamaModel.load(modelPath, nThreads = 4, nGpuLayers = 32)
}
}
fun generate(prompt: String): Flow<String> = flow {
llamaModel.generate(prompt) { token ->
emit(token)
}
}.flowOn(Dispatchers.Default)
}
Performance Across Chipsets
| Chipset | Devices | RAM | 1B (tok/s) | 3B (tok/s) |
|---|---|---|---|---|
| Snapdragon 8 Gen 2 | Galaxy S23, OnePlus 11 | 8-12GB | 25-35 | 12-18 |
| Snapdragon 8 Gen 3 | Galaxy S24, OnePlus 12 | 8-12GB | 35-45 | 18-25 |
| Tensor G3 | Pixel 8/8 Pro | 12GB | 25-35 | 12-18 |
| Tensor G4 | Pixel 9/9 Pro | 12-16GB | 30-40 | 15-22 |
| Snapdragon 7 Gen 3 | Mid-range 2024+ | 6-8GB | 18-25 | 8-12 |
Above 10 tokens per second is usable for chat interfaces. Flagship devices from the last two years handle 1-3B models comfortably.
The Android Fragmentation Factor
Android has more device diversity than iOS. This is both a challenge and an advantage:
Challenge: You need to test across chipsets and RAM configurations. A model that runs well on a Galaxy S24 (12GB) might struggle on a budget phone with 4GB.
Advantage: Many Android devices have 8-12GB RAM, which is generous for on-device models. The mid-range 6-8GB segment can still run 1B models effectively.
Practical approach: Target 1B models for broad compatibility (supports 4GB+ devices). Offer 3B models as an upgrade for devices with 8GB+ RAM. Detect available memory at runtime and adjust.
Memory Management
Android's memory management is more aggressive than iOS. The system will kill background processes to reclaim memory. Key practices:
- Load the model in a foreground Service or when the AI feature is active
- Release model memory when the user navigates away from the AI feature
- Handle
onTrimMemorycallbacks to release resources under pressure - Use
ActivityManager.getMemoryInfo()to check available RAM before loading
Cost
Same as iOS: one-time fine-tuning ($5-50), CDN distribution, then zero per-inference cost.
The Comparison
| Factor | ML Kit | Cloud API | On-Device LLM |
|---|---|---|---|
| Text generation / chat | No | Yes | Yes |
| OCR / barcode scanning | Yes (optimized) | Yes | No |
| Offline support | Yes | No | Yes |
| Cost per inference | $0 | $0.0001-$0.01 | $0 |
| Device coverage | All Android 5.0+ | All with internet | 4GB+ RAM |
| Privacy | On-device | Third-party servers | On-device |
| Custom model support | No | Via API selection | Any GGUF model |
| Fine-tuning | No | Some providers | Full LoRA/QLoRA |
Practical Decision Framework
Use ML Kit when you need OCR, barcode scanning, face detection, pose estimation, or image labeling. Google's implementations are production-grade and free.
Use cloud APIs when you are validating a feature, serving very low volume, or need frontier reasoning. Gemini's Android SDK makes this especially easy for Android developers.
Use on-device LLMs when you need conversational AI, content generation, classification, or any text-heavy AI feature at scale. The zero-cost scaling, offline support, and privacy guarantees are decisive advantages for production mobile apps.
The fine-tuning pipeline (dataset preparation, LoRA training, GGUF export) is where tools like Ertas save time. The visual interface handles the full workflow, and the exported GGUF runs on any Android device via llama.cpp with no additional configuration.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

llama.cpp on Android: A Kotlin Integration Guide
Step-by-step guide to integrating llama.cpp into an Android app with Kotlin. JNI bindings, Vulkan GPU acceleration, model loading, and memory management across the Android device spectrum.

How to Add AI to Your Mobile App: A Developer's Decision Guide
A comprehensive guide covering every approach to adding AI features to iOS and Android apps. Cloud APIs, on-device models, and hybrid architectures compared with real cost and performance data.

AI in iOS Apps: CoreML, Cloud APIs, and On-Device LLMs Compared
Three paths to AI in your iOS app. CoreML for Apple's ecosystem, cloud APIs for capability, and on-device LLMs via llama.cpp for cost and privacy. A practical comparison for Swift developers.