Android
Integrate an Ertas-trained GGUF into an Android app via Flutter, React Native, or native Kotlin, with Play Store delivery patterns that survive review.
Android is the friendliest platform to iterate on. Sideload during development is one adb install away, debug builds skip code signing review, and Google Play Console's turnaround is fast once you have the privacy and permissions story straight. The trade-off is that you own more of the delivery story than on iOS: there is no Apple-managed equivalent of URLSession background downloads, and WorkManager plus DownloadManager is the only path that survives doze mode reliably.
This page covers integration paths (Flutter via llamadart, React Native, native Kotlin) and the Play Store specifics that affect submission. For Android-specific delivery mechanisms (Play Asset Delivery, sizing) and the cross-platform download UX, see Model delivery and UX. For the GGUF bundle, see GGUF overview.
Integration paths
Flutter + llamadart (the reference path)
llamadart gives you a Flutter FFI binding to llama.cpp that compiles and bundles llama.cpp via the package's NDK build system. The Ertas reference Flutter POC ships on Android with this path; no CMakeLists.txt, no manual .so management, and no custom Gradle tasks are required.
# pubspec.yaml
dependencies:
llamadart: ^0.6.9
path_provider: ^2.1.5
http: ^1.2.2
import 'package:llamadart/llamadart.dart';
class LlamaService {
LlamaEngine? _engine;
ChatSession? _session;
bool _isLoaded = false;
Future<void> loadModel(String modelPath) async {
_engine = LlamaEngine(LlamaBackend());
await _engine!.loadModel(modelPath);
_session = ChatSession(_engine!);
_isLoaded = true;
}
Future<String> generate(String prompt, {int maxTokens = 256}) async {
if (!_isLoaded) throw StateError('Model not loaded');
final response = await _session!.generate(prompt, maxTokens: maxTokens);
_session!.reset(); // critical, see below
return response.trim();
}
void dispose() {
_engine?.dispose();
_engine = null;
_session = null;
_isLoaded = false;
}
}
Call session.reset() after every generate(). Without it, prior prompts and outputs accumulate in the session context and contaminate the next generation. This is the single most common bug when integrating llamadart, and the failure mode is subtle: outputs get progressively worse instead of erroring.
Android-specific llamadart considerations:
minSdk = 24(Android 7.0) is what the Ertas reference Flutter POC uses; llamadart itself does not document a strict minimum. The 6 GB device-RAM requirement is the practical floor regardless, and every device meeting it ships Android 9 or later. LowerminSdkvalues may work in theory but are not validated.- 6 GB total device RAM minimum for a 1B-class model. A Q4_K_M model holds native RAM roughly equal to the GGUF file size (around 700 MB for a 1B model, scaling up with model size), plus 50 to 200 MB for the KV cache. On a 4 GB device, the OS will kill the app under memory pressure. The 6 GB floor is consistent with how the industry has positioned on-device LLM features more broadly: Apple Intelligence requires iPhone 15 Pro and up (8 GB), and Google's Gemini Nano on the Pixel 8 (8 GB) and Pixel 8 Pro (12 GB) sits in similar territory. Surface RAM requirements in your store listing.
- ARM64 NEON SIMD is always on. Newer chips also enable SME2. You do not configure thread counts manually; llamadart auto-tunes based on the device.
- No special permissions needed. Inference is fully local; no
INTERNETor storage permissions are required beyond what your download flow uses. - Always check
isLoadedbefore eachgenerate(). Android can kill the process while backgrounded under memory pressure; on resume, the Dart object exists but the native pointer is dead. Reload defensively.
React Native
For React Native, llama.rn wraps llama.cpp via JSI and ships prebuilt .so libraries with OpenCL and Hexagon NPU acceleration paths. Configuration is similar in spirit to llamadart's: bundle the model post-install, dispose the native engine when done, and run inference off the JS thread. Two extra constraints worth knowing: React Native New Architecture is required (starting llama.rn v0.10), and only arm64-v8a and x86_64 ABIs are supported, so drop armeabi-v7a from your build config when adding the package. Other RN llama.cpp wrappers exist but llama.rn is the most actively maintained at the time of writing.
Native Kotlin (JNI to llama.cpp)
For native Android apps without a cross-platform framework, the path is to build llama.cpp as a native library and call it via JNI. The llama.cpp repo ships an Android example that demonstrates the full bridge; the snippets below are the minimum scaffolding.
1. Add llama.cpp as a submodule and a CMakeLists.txt:
git submodule add https://github.com/ggerganov/llama.cpp.git app/src/main/cpp/llama.cpp
# app/src/main/cpp/CMakeLists.txt
cmake_minimum_required(VERSION 3.22)
project("llama-android")
set(LLAMA_NATIVE OFF CACHE BOOL "" FORCE)
add_subdirectory(llama.cpp build-llama)
add_library(llama-android SHARED llama-android.cpp)
target_link_libraries(llama-android PRIVATE llama log android)
2. Write a thin JNI wrapper (llama-android.cpp):
#include <jni.h>
#include "llama.cpp/llama.h"
extern "C" JNIEXPORT jlong JNICALL
Java_com_example_llama_Llm_loadModel(JNIEnv* env, jobject, jstring path) {
const char* p = env->GetStringUTFChars(path, nullptr);
llama_backend_init();
llama_model_params mp = llama_model_default_params();
llama_model* model = llama_load_model_from_file(p, mp);
env->ReleaseStringUTFChars(path, p);
return reinterpret_cast<jlong>(model);
}
extern "C" JNIEXPORT void JNICALL
Java_com_example_llama_Llm_freeModel(JNIEnv*, jobject, jlong handle) {
llama_free_model(reinterpret_cast<llama_model*>(handle));
}
// Add `generate(handle, prompt, maxTokens)` similarly:
// llama_context, llama_decode, llama_sample_token, llama_token_to_piece
3. Configure Gradle (app/build.gradle.kts):
android {
defaultConfig {
ndk { abiFilters += listOf("arm64-v8a", "x86_64") }
externalNativeBuild {
cmake { arguments += "-DLLAMA_NATIVE=OFF" }
}
}
externalNativeBuild {
cmake {
path = file("src/main/cpp/CMakeLists.txt")
version = "3.22.1"
}
}
}
4. Kotlin loader (Llm.kt):
class Llm {
private var handle: Long = 0
fun load(path: String) {
handle = loadModel(path)
}
fun close() {
if (handle != 0L) freeModel(handle)
handle = 0
}
private external fun loadModel(path: String): Long
private external fun freeModel(handle: Long)
companion object {
init { System.loadLibrary("llama-android") }
}
}
The native Kotlin path gives you tight control over inference parameters and avoids the FFI overhead, but adds a build-time dependency on the NDK and a maintenance burden on your team. For most apps where AI is a feature rather than the product, Flutter or React Native with a prebuilt FFI binding is the better trade-off. Reach for native Kotlin only when you have a clear reason: latency budget, special tokenizer handling, or an existing C++ codebase you want to integrate alongside the inference.
MLC LLM
MLC LLM is an alternative inference engine that targets TVM-compiled kernels. It produces fast inference on Android but does not load GGUF directly; models must be re-quantised to MLC's own format. Useful if your team already has an MLC pipeline; otherwise the GGUF + llama.cpp path is simpler.
Model delivery on Android
The default delivery path is direct HTTPS download from a public artifact host (Hugging Face works well for public GGUFs) on first launch. Play Asset Delivery (PAD) is the Play Store-managed alternative when your model is app-internal and you distribute only through Google Play.
For the full delivery decision tree (PAD vs Hugging Face vs custom CDN), background download mechanics with WorkManager and DownloadManager, storage locations, integrity verification, and first-run UX patterns, see Model delivery and UX. It is the cross-platform reference; the Android-specific bits there cover everything you would otherwise need to repeat here.
A short Android-specific summary:
- Storage:
Context.getFilesDir(). NevergetCacheDir()(the OS evicts it under pressure). - Download:
WorkManagerfor orchestration,DownloadManagerfor the transfer. Do not roll your own HTTP client. - Integrity: SHA-256 against an expected hash, atomic rename from
.tmpto the active path. - UX: Wi-Fi-only default with a visible toggle. Show size, free disk space, and a time estimate before the download starts.
Play Store submission
Google Play has a less invasive review than Apple, but a few items worth knowing:
| Concern | What to do |
|---|---|
| Privacy declaration | Declare crash and performance data collection in the Data Safety section. Crash data is "Analytics" on Google Play (unlike "App Functionality" on iOS for the same data). |
| Target SDK | Google requires apps to target a recent Android API level; the floor moves up every year. Set targetSdk to the latest stable each release cycle. |
| AAB submission | Google Play has required Android App Bundles (.aab) for new apps since August 2021; APKs are accepted only for updates to apps published before that date. Play generates per-device APK splits automatically. |
| 64-bit requirement | All native code must be 64-bit. arm64-v8a is the floor; armeabi-v7a is optional. The llamadart and llama.rn bindings already ship 64-bit-only by default. |
| Crashlytics symbol upload | Unlike iOS where dSYM upload runs in the Xcode build phase, Android requires a manual CLI upload of Flutter's debug symbols (firebase crashlytics:symbols:upload) for the obfuscated build to produce useful stack traces. |
| Play Console review turnaround | Typically a few hours for established apps; the first submission can take several days. |
Google does not require an annual developer fee renewal (the Google Play Developer Console fee is one-time, $25). Compare to Apple's $99 per year.
Deep Links (App Links)
If you want a URL on your domain to open your app, use Android App Links:
- In
AndroidManifest.xml, add an intent filter withandroid:autoVerify="true"and the host you want to claim. - Host an
assetlinks.jsonfile athttps://yourdomain.com/.well-known/assetlinks.jsonwithContent-Type: application/json.
[{
"relation": ["delegate_permission/common.handle_all_urls"],
"target": {
"namespace": "android_app",
"package_name": "com.yourcompany.yourapp",
"sha256_cert_fingerprints": ["<your-cert-fingerprint>"]
}
}]
Unlike iOS's AASA file (which has no extension and requires a Content-Type rewrite on most hosts), assetlinks.json is a literal .json file at the well-known path. Most static hosts serve it correctly with no extra configuration.
What's next
iOS
If your Flutter or React Native app also ships to iPhone or iPad.
Model delivery and UX
Play Asset Delivery, WorkManager / DownloadManager, atomic verify.
Performance tips
Context length, threading, KV cache, Android-side battery considerations.
Verifying exports
Smoke-test the GGUF before integrating it into the app.