Back to blog
    AI in Android Apps: ML Kit, Cloud APIs, and On-Device LLMs Compared
    AndroidML KitKotlincloud APIon-device AIllama.cppsegment:mobile-builder

    AI in Android Apps: ML Kit, Cloud APIs, and On-Device LLMs Compared

    Three paths to AI in your Android app. Google ML Kit for common tasks, cloud APIs for full LLM capability, and on-device models via llama.cpp for cost and privacy. A practical comparison for Kotlin developers.

    EErtas Team·

    Android developers have three distinct approaches for adding AI to their apps. Google ML Kit handles common tasks. Cloud APIs provide frontier-model capability. On-device LLMs via llama.cpp give you full text generation without API costs or network dependency.

    Each serves a different purpose. This guide compares them from a Kotlin developer's perspective.

    Path 1: Google ML Kit

    ML Kit is Google's mobile SDK for on-device machine learning. It provides production-ready APIs for common ML tasks without requiring any ML expertise.

    What ML Kit Can Do

    ML Kit offers pre-built models for specific tasks:

    • Text recognition (OCR) across Latin, Chinese, Japanese, Korean, and Devanagari scripts
    • Face detection with landmark tracking and classification
    • Barcode scanning for all common barcode formats
    • Image labeling with 400+ categories
    • Object detection and tracking in real-time camera feeds
    • Pose detection for body landmark tracking
    • Digital ink recognition for handwriting
    • Translation across 59 languages with models downloaded on-demand
    • Smart reply suggestions for conversational contexts
    • Entity extraction for dates, addresses, phone numbers in text

    Integration Pattern

    // Text recognition example
    val recognizer = TextRecognition.getClient(TextRecognizerOptions.DEFAULT_OPTIONS)
    val image = InputImage.fromBitmap(bitmap, 0)
    
    recognizer.process(image)
        .addOnSuccessListener { text ->
            Log.d("MLKit", "Recognized: ${text.text}")
        }
        .addOnFailureListener { e ->
            Log.e("MLKit", "Recognition failed", e)
        }
    

    What ML Kit Cannot Do

    ML Kit does not support running large language models. There is no built-in capability for open-ended text generation, conversational AI, content drafting, or complex reasoning. Smart Reply provides short suggested responses for messages, but it is not a general-purpose chat model.

    Cost

    Free. ML Kit runs entirely on-device. No API calls, no per-request charges, no usage limits.

    Best For

    OCR, barcode scanning, face detection, pose estimation, image labeling. Tasks where Google provides an optimized, pre-built solution.

    Path 2: Cloud APIs

    Call OpenAI, Anthropic, Google Gemini, or another provider from your Android app. The model runs on remote servers.

    Integration Pattern

    suspend fun chat(message: String): String = withContext(Dispatchers.IO) {
        val client = OkHttpClient()
        val body = """
            {"model": "gpt-4o-mini",
             "messages": [{"role": "user", "content": "$message"}]}
        """.trimIndent()
    
        val request = Request.Builder()
            .url("https://api.openai.com/v1/chat/completions")
            .post(body.toRequestBody("application/json".toMediaType()))
            .addHeader("Authorization", "Bearer $apiKey")
            .build()
    
        val response = client.newCall(request).execute()
        // Parse JSON response
        parseResponse(response.body!!.string())
    }
    

    For Google Gemini, Android has a dedicated SDK (Google AI Client SDK) that provides a more native integration:

    val model = GenerativeModel(modelName = "gemini-2.0-flash", apiKey = apiKey)
    val response = model.generateContent("Your prompt here")
    println(response.text)
    

    Cost

    Per-token pricing. Gemini Flash is the cheapest major option at $0.10/$0.40 per million tokens. GPT-4o-mini at $0.15/$0.60. At scale, costs range from hundreds to thousands per month.

    Gemini Nano: The In-Between

    Google offers Gemini Nano for on-device inference, but it is heavily restricted. It runs only on specific devices (Pixel 8/9 series, Samsung Galaxy S24/S25 series) and only through the AICore system service. You cannot use your own models. You cannot fine-tune it. The capabilities are limited to specific tasks Google has approved.

    For developers who need on-device AI across the full Android device spectrum, Gemini Nano is not a general solution.

    Best For

    Same as iOS: prototyping, validation, very low volume, or tasks requiring frontier reasoning.

    Path 3: On-Device LLMs via llama.cpp

    Run a full language model locally on the Android device. llama.cpp provides the inference engine. Your fine-tuned GGUF model provides the intelligence.

    How It Works on Android

    The llama.cpp project includes llama.android, a pre-built Android library that provides Kotlin bindings through JNI. It supports CPU inference on all devices and GPU acceleration via Vulkan on supported hardware.

    // Conceptual pattern using llama.android
    class AiViewModel : ViewModel() {
        private val llamaModel = LlamaModel()
    
        fun loadModel(modelPath: String) {
            viewModelScope.launch(Dispatchers.Default) {
                llamaModel.load(modelPath, nThreads = 4, nGpuLayers = 32)
            }
        }
    
        fun generate(prompt: String): Flow<String> = flow {
            llamaModel.generate(prompt) { token ->
                emit(token)
            }
        }.flowOn(Dispatchers.Default)
    }
    

    Performance Across Chipsets

    ChipsetDevicesRAM1B (tok/s)3B (tok/s)
    Snapdragon 8 Gen 2Galaxy S23, OnePlus 118-12GB25-3512-18
    Snapdragon 8 Gen 3Galaxy S24, OnePlus 128-12GB35-4518-25
    Tensor G3Pixel 8/8 Pro12GB25-3512-18
    Tensor G4Pixel 9/9 Pro12-16GB30-4015-22
    Snapdragon 7 Gen 3Mid-range 2024+6-8GB18-258-12

    Above 10 tokens per second is usable for chat interfaces. Flagship devices from the last two years handle 1-3B models comfortably.

    The Android Fragmentation Factor

    Android has more device diversity than iOS. This is both a challenge and an advantage:

    Challenge: You need to test across chipsets and RAM configurations. A model that runs well on a Galaxy S24 (12GB) might struggle on a budget phone with 4GB.

    Advantage: Many Android devices have 8-12GB RAM, which is generous for on-device models. The mid-range 6-8GB segment can still run 1B models effectively.

    Practical approach: Target 1B models for broad compatibility (supports 4GB+ devices). Offer 3B models as an upgrade for devices with 8GB+ RAM. Detect available memory at runtime and adjust.

    Memory Management

    Android's memory management is more aggressive than iOS. The system will kill background processes to reclaim memory. Key practices:

    • Load the model in a foreground Service or when the AI feature is active
    • Release model memory when the user navigates away from the AI feature
    • Handle onTrimMemory callbacks to release resources under pressure
    • Use ActivityManager.getMemoryInfo() to check available RAM before loading

    Cost

    Same as iOS: one-time fine-tuning ($5-50), CDN distribution, then zero per-inference cost.

    The Comparison

    FactorML KitCloud APIOn-Device LLM
    Text generation / chatNoYesYes
    OCR / barcode scanningYes (optimized)YesNo
    Offline supportYesNoYes
    Cost per inference$0$0.0001-$0.01$0
    Device coverageAll Android 5.0+All with internet4GB+ RAM
    PrivacyOn-deviceThird-party serversOn-device
    Custom model supportNoVia API selectionAny GGUF model
    Fine-tuningNoSome providersFull LoRA/QLoRA

    Practical Decision Framework

    Use ML Kit when you need OCR, barcode scanning, face detection, pose estimation, or image labeling. Google's implementations are production-grade and free.

    Use cloud APIs when you are validating a feature, serving very low volume, or need frontier reasoning. Gemini's Android SDK makes this especially easy for Android developers.

    Use on-device LLMs when you need conversational AI, content generation, classification, or any text-heavy AI feature at scale. The zero-cost scaling, offline support, and privacy guarantees are decisive advantages for production mobile apps.

    The fine-tuning pipeline (dataset preparation, LoRA training, GGUF export) is where tools like Ertas save time. The visual interface handles the full workflow, and the exported GGUF runs on any Android device via llama.cpp with no additional configuration.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading