AI in Android Apps: ML Kit, Cloud APIs, and On-Device LLMs Compared

Android developers have three distinct approaches for adding AI to their apps. Google ML Kit handles common tasks. Cloud APIs provide frontier-model capability. On-device LLMs via llama.cpp give you full text generation without API costs or network dependency.

Each serves a different purpose. This guide compares them from a Kotlin developer's perspective.

Path 1: Google ML Kit

ML Kit is Google's mobile SDK for on-device machine learning. It provides production-ready APIs for common ML tasks without requiring any ML expertise.

What ML Kit Can Do

ML Kit offers pre-built models for specific tasks:

Text recognition (OCR) across Latin, Chinese, Japanese, Korean, and Devanagari scripts
Face detection with landmark tracking and classification
Barcode scanning for all common barcode formats
Image labeling with 400+ categories
Object detection and tracking in real-time camera feeds
Pose detection for body landmark tracking
Digital ink recognition for handwriting
Translation across 59 languages with models downloaded on-demand
Smart reply suggestions for conversational contexts
Entity extraction for dates, addresses, phone numbers in text

Integration Pattern

// Text recognition example
val recognizer = TextRecognition.getClient(TextRecognizerOptions.DEFAULT_OPTIONS)
val image = InputImage.fromBitmap(bitmap, 0)

recognizer.process(image)
    .addOnSuccessListener { text ->
        Log.d("MLKit", "Recognized: ${text.text}")
    }
    .addOnFailureListener { e ->
        Log.e("MLKit", "Recognition failed", e)
    }

What ML Kit Cannot Do

ML Kit does not support running large language models. There is no built-in capability for open-ended text generation, conversational AI, content drafting, or complex reasoning. Smart Reply provides short suggested responses for messages, but it is not a general-purpose chat model.

Cost

Free. ML Kit runs entirely on-device. No API calls, no per-request charges, no usage limits.

Best For

OCR, barcode scanning, face detection, pose estimation, image labeling. Tasks where Google provides an optimized, pre-built solution.

Path 2: Cloud APIs

Call OpenAI, Anthropic, Google Gemini, or another provider from your Android app. The model runs on remote servers.

Integration Pattern

suspend fun chat(message: String): String = withContext(Dispatchers.IO) {
    val client = OkHttpClient()
    val body = """
        {"model": "gpt-4o-mini",
         "messages": [{"role": "user", "content": "$message"}]}
    """.trimIndent()

    val request = Request.Builder()
        .url("https://api.openai.com/v1/chat/completions")
        .post(body.toRequestBody("application/json".toMediaType()))
        .addHeader("Authorization", "Bearer $apiKey")
        .build()

    val response = client.newCall(request).execute()
    // Parse JSON response
    parseResponse(response.body!!.string())
}

For Google Gemini, Android has a dedicated SDK (Google AI Client SDK) that provides a more native integration:

val model = GenerativeModel(modelName = "gemini-2.0-flash", apiKey = apiKey)
val response = model.generateContent("Your prompt here")
println(response.text)

Cost

Per-token pricing. Gemini Flash is the cheapest major option at $0.10/$0.40 per million tokens. GPT-4o-mini at $0.15/$0.60. At scale, costs range from hundreds to thousands per month.

Gemini Nano: The In-Between

Google offers Gemini Nano for on-device inference, but it is heavily restricted. It runs only on specific devices (Pixel 8/9 series, Samsung Galaxy S24/S25 series) and only through the AICore system service. You cannot use your own models. You cannot fine-tune it. The capabilities are limited to specific tasks Google has approved.

For developers who need on-device AI across the full Android device spectrum, Gemini Nano is not a general solution.

Best For

Same as iOS: prototyping, validation, very low volume, or tasks requiring frontier reasoning.

Path 3: On-Device LLMs via llama.cpp

Run a full language model locally on the Android device. llama.cpp provides the inference engine. Your fine-tuned GGUF model provides the intelligence.

How It Works on Android

The llama.cpp project includes llama.android, a pre-built Android library that provides Kotlin bindings through JNI. It supports CPU inference on all devices and GPU acceleration via Vulkan on supported hardware.

// Conceptual pattern using llama.android
class AiViewModel : ViewModel() {
    private val llamaModel = LlamaModel()

    fun loadModel(modelPath: String) {
        viewModelScope.launch(Dispatchers.Default) {
            llamaModel.load(modelPath, nThreads = 4, nGpuLayers = 32)
        }
    }

    fun generate(prompt: String): Flow<String> = flow {
        llamaModel.generate(prompt) { token ->
            emit(token)
        }
    }.flowOn(Dispatchers.Default)
}

Performance Across Chipsets

Chipset	Devices	RAM	1B (tok/s)	3B (tok/s)
Snapdragon 8 Gen 2	Galaxy S23, OnePlus 11	8-12GB	25-35	12-18
Snapdragon 8 Gen 3	Galaxy S24, OnePlus 12	8-12GB	35-45	18-25
Tensor G3	Pixel 8/8 Pro	12GB	25-35	12-18
Tensor G4	Pixel 9/9 Pro	12-16GB	30-40	15-22
Snapdragon 7 Gen 3	Mid-range 2024+	6-8GB	18-25	8-12

Above 10 tokens per second is usable for chat interfaces. Flagship devices from the last two years handle 1-3B models comfortably.

The Android Fragmentation Factor

Android has more device diversity than iOS. This is both a challenge and an advantage:

Challenge: You need to test across chipsets and RAM configurations. A model that runs well on a Galaxy S24 (12GB) might struggle on a budget phone with 4GB.

Advantage: Many Android devices have 8-12GB RAM, which is generous for on-device models. The mid-range 6-8GB segment can still run 1B models effectively.

Practical approach: Target 1B models for broad compatibility (supports 4GB+ devices). Offer 3B models as an upgrade for devices with 8GB+ RAM. Detect available memory at runtime and adjust.

Memory Management

Android's memory management is more aggressive than iOS. The system will kill background processes to reclaim memory. Key practices:

Load the model in a foreground Service or when the AI feature is active
Release model memory when the user navigates away from the AI feature
Handle onTrimMemory callbacks to release resources under pressure
Use ActivityManager.getMemoryInfo() to check available RAM before loading

Cost

Same as iOS: one-time fine-tuning ($5-50), CDN distribution, then zero per-inference cost.

The Comparison

Factor	ML Kit	Cloud API	On-Device LLM
Text generation / chat	No	Yes	Yes
OCR / barcode scanning	Yes (optimized)	Yes	No
Offline support	Yes	No	Yes
Cost per inference	$0	$0.0001-$0.01	$0
Device coverage	All Android 5.0+	All with internet	4GB+ RAM
Privacy	On-device	Third-party servers	On-device
Custom model support	No	Via API selection	Any GGUF model
Fine-tuning	No	Some providers	Full LoRA/QLoRA

Practical Decision Framework

Use ML Kit when you need OCR, barcode scanning, face detection, pose estimation, or image labeling. Google's implementations are production-grade and free.

Use cloud APIs when you are validating a feature, serving very low volume, or need frontier reasoning. Gemini's Android SDK makes this especially easy for Android developers.

Use on-device LLMs when you need conversational AI, content generation, classification, or any text-heavy AI feature at scale. The zero-cost scaling, offline support, and privacy guarantees are decisive advantages for production mobile apps.

The fine-tuning pipeline (dataset preparation, LoRA training, GGUF export) is where tools like Ertas save time. The visual interface handles the full workflow, and the exported GGUF runs on any Android device via llama.cpp with no additional configuration.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

AI in Android Apps: ML Kit, Cloud APIs, and On-Device LLMs Compared

Path 1: Google ML Kit

What ML Kit Can Do

Integration Pattern

What ML Kit Cannot Do

Cost

Best For

Path 2: Cloud APIs

Integration Pattern

Cost

Gemini Nano: The In-Between

Best For

Path 3: On-Device LLMs via llama.cpp

How It Works on Android

Performance Across Chipsets

The Android Fragmentation Factor

Memory Management

Cost

The Comparison

Practical Decision Framework

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

llama.cpp on Android: A Kotlin Integration Guide

How to Add AI to Your Mobile App: A Developer's Decision Guide

AI in iOS Apps: CoreML, Cloud APIs, and On-Device LLMs Compared