Android

    Integrate an Ertas-trained GGUF into an Android app via Flutter, React Native, or native Kotlin, with Play Store delivery patterns that survive review.

    Android is the friendliest platform to iterate on. Sideload during development is one adb install away, debug builds skip code signing review, and Google Play Console's turnaround is fast once you have the privacy and permissions story straight. The trade-off is that you own more of the delivery story than on iOS: there is no Apple-managed equivalent of URLSession background downloads, and WorkManager plus DownloadManager is the only path that survives doze mode reliably.

    This page covers integration paths (Flutter via llamadart, React Native, native Kotlin) and the Play Store specifics that affect submission. For Android-specific delivery mechanisms (Play Asset Delivery, sizing) and the cross-platform download UX, see Model delivery and UX. For the GGUF bundle, see GGUF overview.

    Integration paths

    Flutter + llamadart (the reference path)

    llamadart gives you a Flutter FFI binding to llama.cpp that compiles and bundles llama.cpp via the package's NDK build system. The Ertas reference Flutter POC ships on Android with this path; no CMakeLists.txt, no manual .so management, and no custom Gradle tasks are required.

    # pubspec.yaml
    dependencies:
      llamadart: ^0.6.9
      path_provider: ^2.1.5
      http: ^1.2.2
    import 'package:llamadart/llamadart.dart';
    
    class LlamaService {
      LlamaEngine? _engine;
      ChatSession? _session;
      bool _isLoaded = false;
    
      Future<void> loadModel(String modelPath) async {
        _engine = LlamaEngine(LlamaBackend());
        await _engine!.loadModel(modelPath);
        _session = ChatSession(_engine!);
        _isLoaded = true;
      }
    
      Future<String> generate(String prompt, {int maxTokens = 256}) async {
        if (!_isLoaded) throw StateError('Model not loaded');
        final response = await _session!.generate(prompt, maxTokens: maxTokens);
        _session!.reset(); // critical, see below
        return response.trim();
      }
    
      void dispose() {
        _engine?.dispose();
        _engine = null;
        _session = null;
        _isLoaded = false;
      }
    }

    Call session.reset() after every generate(). Without it, prior prompts and outputs accumulate in the session context and contaminate the next generation. This is the single most common bug when integrating llamadart, and the failure mode is subtle: outputs get progressively worse instead of erroring.

    Android-specific llamadart considerations:

    • minSdk = 24 (Android 7.0) is what the Ertas reference Flutter POC uses; llamadart itself does not document a strict minimum. The 6 GB device-RAM requirement is the practical floor regardless, and every device meeting it ships Android 9 or later. Lower minSdk values may work in theory but are not validated.
    • 6 GB total device RAM minimum for a 1B-class model. A Q4_K_M model holds native RAM roughly equal to the GGUF file size (around 700 MB for a 1B model, scaling up with model size), plus 50 to 200 MB for the KV cache. On a 4 GB device, the OS will kill the app under memory pressure. The 6 GB floor is consistent with how the industry has positioned on-device LLM features more broadly: Apple Intelligence requires iPhone 15 Pro and up (8 GB), and Google's Gemini Nano on the Pixel 8 (8 GB) and Pixel 8 Pro (12 GB) sits in similar territory. Surface RAM requirements in your store listing.
    • ARM64 NEON SIMD is always on. Newer chips also enable SME2. You do not configure thread counts manually; llamadart auto-tunes based on the device.
    • No special permissions needed. Inference is fully local; no INTERNET or storage permissions are required beyond what your download flow uses.
    • Always check isLoaded before each generate(). Android can kill the process while backgrounded under memory pressure; on resume, the Dart object exists but the native pointer is dead. Reload defensively.

    React Native

    For React Native, llama.rn wraps llama.cpp via JSI and ships prebuilt .so libraries with OpenCL and Hexagon NPU acceleration paths. Configuration is similar in spirit to llamadart's: bundle the model post-install, dispose the native engine when done, and run inference off the JS thread. Two extra constraints worth knowing: React Native New Architecture is required (starting llama.rn v0.10), and only arm64-v8a and x86_64 ABIs are supported, so drop armeabi-v7a from your build config when adding the package. Other RN llama.cpp wrappers exist but llama.rn is the most actively maintained at the time of writing.

    Native Kotlin (JNI to llama.cpp)

    For native Android apps without a cross-platform framework, the path is to build llama.cpp as a native library and call it via JNI. The llama.cpp repo ships an Android example that demonstrates the full bridge; the snippets below are the minimum scaffolding.

    1. Add llama.cpp as a submodule and a CMakeLists.txt:

    git submodule add https://github.com/ggerganov/llama.cpp.git app/src/main/cpp/llama.cpp
    # app/src/main/cpp/CMakeLists.txt
    cmake_minimum_required(VERSION 3.22)
    project("llama-android")
    
    set(LLAMA_NATIVE OFF CACHE BOOL "" FORCE)
    add_subdirectory(llama.cpp build-llama)
    
    add_library(llama-android SHARED llama-android.cpp)
    target_link_libraries(llama-android PRIVATE llama log android)

    2. Write a thin JNI wrapper (llama-android.cpp):

    #include <jni.h>
    #include "llama.cpp/llama.h"
    
    extern "C" JNIEXPORT jlong JNICALL
    Java_com_example_llama_Llm_loadModel(JNIEnv* env, jobject, jstring path) {
        const char* p = env->GetStringUTFChars(path, nullptr);
        llama_backend_init();
        llama_model_params mp = llama_model_default_params();
        llama_model* model = llama_load_model_from_file(p, mp);
        env->ReleaseStringUTFChars(path, p);
        return reinterpret_cast<jlong>(model);
    }
    
    extern "C" JNIEXPORT void JNICALL
    Java_com_example_llama_Llm_freeModel(JNIEnv*, jobject, jlong handle) {
        llama_free_model(reinterpret_cast<llama_model*>(handle));
    }
    
    // Add `generate(handle, prompt, maxTokens)` similarly:
    // llama_context, llama_decode, llama_sample_token, llama_token_to_piece

    3. Configure Gradle (app/build.gradle.kts):

    android {
        defaultConfig {
            ndk { abiFilters += listOf("arm64-v8a", "x86_64") }
            externalNativeBuild {
                cmake { arguments += "-DLLAMA_NATIVE=OFF" }
            }
        }
        externalNativeBuild {
            cmake {
                path = file("src/main/cpp/CMakeLists.txt")
                version = "3.22.1"
            }
        }
    }

    4. Kotlin loader (Llm.kt):

    class Llm {
        private var handle: Long = 0
    
        fun load(path: String) {
            handle = loadModel(path)
        }
    
        fun close() {
            if (handle != 0L) freeModel(handle)
            handle = 0
        }
    
        private external fun loadModel(path: String): Long
        private external fun freeModel(handle: Long)
    
        companion object {
            init { System.loadLibrary("llama-android") }
        }
    }

    The native Kotlin path gives you tight control over inference parameters and avoids the FFI overhead, but adds a build-time dependency on the NDK and a maintenance burden on your team. For most apps where AI is a feature rather than the product, Flutter or React Native with a prebuilt FFI binding is the better trade-off. Reach for native Kotlin only when you have a clear reason: latency budget, special tokenizer handling, or an existing C++ codebase you want to integrate alongside the inference.

    MLC LLM

    MLC LLM is an alternative inference engine that targets TVM-compiled kernels. It produces fast inference on Android but does not load GGUF directly; models must be re-quantised to MLC's own format. Useful if your team already has an MLC pipeline; otherwise the GGUF + llama.cpp path is simpler.

    Model delivery on Android

    The default delivery path is direct HTTPS download from a public artifact host (Hugging Face works well for public GGUFs) on first launch. Play Asset Delivery (PAD) is the Play Store-managed alternative when your model is app-internal and you distribute only through Google Play.

    For the full delivery decision tree (PAD vs Hugging Face vs custom CDN), background download mechanics with WorkManager and DownloadManager, storage locations, integrity verification, and first-run UX patterns, see Model delivery and UX. It is the cross-platform reference; the Android-specific bits there cover everything you would otherwise need to repeat here.

    A short Android-specific summary:

    • Storage: Context.getFilesDir(). Never getCacheDir() (the OS evicts it under pressure).
    • Download: WorkManager for orchestration, DownloadManager for the transfer. Do not roll your own HTTP client.
    • Integrity: SHA-256 against an expected hash, atomic rename from .tmp to the active path.
    • UX: Wi-Fi-only default with a visible toggle. Show size, free disk space, and a time estimate before the download starts.

    Play Store submission

    Google Play has a less invasive review than Apple, but a few items worth knowing:

    ConcernWhat to do
    Privacy declarationDeclare crash and performance data collection in the Data Safety section. Crash data is "Analytics" on Google Play (unlike "App Functionality" on iOS for the same data).
    Target SDKGoogle requires apps to target a recent Android API level; the floor moves up every year. Set targetSdk to the latest stable each release cycle.
    AAB submissionGoogle Play has required Android App Bundles (.aab) for new apps since August 2021; APKs are accepted only for updates to apps published before that date. Play generates per-device APK splits automatically.
    64-bit requirementAll native code must be 64-bit. arm64-v8a is the floor; armeabi-v7a is optional. The llamadart and llama.rn bindings already ship 64-bit-only by default.
    Crashlytics symbol uploadUnlike iOS where dSYM upload runs in the Xcode build phase, Android requires a manual CLI upload of Flutter's debug symbols (firebase crashlytics:symbols:upload) for the obfuscated build to produce useful stack traces.
    Play Console review turnaroundTypically a few hours for established apps; the first submission can take several days.

    Google does not require an annual developer fee renewal (the Google Play Developer Console fee is one-time, $25). Compare to Apple's $99 per year.

    If you want a URL on your domain to open your app, use Android App Links:

    1. In AndroidManifest.xml, add an intent filter with android:autoVerify="true" and the host you want to claim.
    2. Host an assetlinks.json file at https://yourdomain.com/.well-known/assetlinks.json with Content-Type: application/json.
    [{
      "relation": ["delegate_permission/common.handle_all_urls"],
      "target": {
        "namespace": "android_app",
        "package_name": "com.yourcompany.yourapp",
        "sha256_cert_fingerprints": ["<your-cert-fingerprint>"]
      }
    }]

    Unlike iOS's AASA file (which has no extension and requires a Content-Type rewrite on most hosts), assetlinks.json is a literal .json file at the well-known path. Most static hosts serve it correctly with no extra configuration.

    What's next