
llama.cpp on Android: A Kotlin Integration Guide
Step-by-step guide to integrating llama.cpp into an Android app with Kotlin. JNI bindings, Vulkan GPU acceleration, model loading, and memory management across the Android device spectrum.
llama.cpp runs GGUF language models on Android devices using CPU multi-threading and Vulkan GPU acceleration. The llama.android project provides pre-built Kotlin bindings through JNI, making integration straightforward for Kotlin developers.
This guide covers the full integration path from project setup to production deployment.
Integration Options
Option 1: llama.android Library (Recommended)
The llama.cpp repository includes llama.android, a pre-built Android library with Kotlin bindings. This is the fastest path to working on-device AI.
Add it to your project:
- Clone or download the llama.android module from the llama.cpp repository
- Include it as a module in your Android project
- Add the dependency in your app's
build.gradle.kts
dependencies {
implementation(project(":llama"))
}
Option 2: Build from Source
For more control, build llama.cpp as a native library using the Android NDK:
mkdir build-android && cd build-android
cmake .. \
-DCMAKE_TOOLCHAIN_FILE=$NDK_PATH/build/cmake/android.toolchain.cmake \
-DANDROID_ABI=arm64-v8a \
-DANDROID_PLATFORM=android-26 \
-DLLAMA_VULKAN=ON \
-DBUILD_SHARED_LIBS=ON
cmake --build . --config Release
This produces libllama.so that you include in your jniLibs directory.
Option 3: Pre-Built AAR
Some community projects publish llama.cpp as an AAR (Android Archive) that you can include as a Maven dependency. Check for recent, maintained versions.
Project Setup
Minimum Requirements
- Android API 26+ (Android 8.0)
- ARM64 (arm64-v8a) target architecture
- NDK r25+ for building native code
- 4GB+ RAM on target device (for 1B models)
Build Configuration
// app/build.gradle.kts
android {
defaultConfig {
ndk {
abiFilters += "arm64-v8a" // 64-bit ARM only
}
}
}
Permissions
No special permissions required for inference. For model download:
<uses-permission android:name="android.permission.INTERNET" />
Loading a Model
class LlamaEngine(private val context: Context) {
private var model: Long = 0 // Native pointer
private var ctx: Long = 0 // Native context pointer
suspend fun loadModel(modelPath: String) = withContext(Dispatchers.Default) {
// Load model with GPU acceleration
model = LlamaNative.loadModel(
modelPath = modelPath,
nGpuLayers = 99, // Offload all layers to Vulkan
)
require(model != 0L) { "Failed to load model" }
// Create inference context
ctx = LlamaNative.createContext(
model = model,
nCtx = 2048, // Context window
nThreads = 4, // CPU threads
nBatch = 512, // Batch size
)
require(ctx != 0L) { "Failed to create context" }
}
fun unload() {
if (ctx != 0L) {
LlamaNative.freeContext(ctx)
ctx = 0
}
if (model != 0L) {
LlamaNative.freeModel(model)
model = 0
}
}
}
Vulkan GPU Acceleration
Vulkan is Android's GPU compute API. llama.cpp uses it to accelerate matrix operations during inference. Enable it by setting nGpuLayers to the model's layer count.
Vulkan support depends on the device:
- Snapdragon 8 Gen 2+: Full Vulkan compute support, best performance
- Tensor G3/G4: Good Vulkan support
- Snapdragon 7 Gen 3+: Vulkan supported, moderate acceleration
- Older/budget devices: May lack Vulkan compute. Falls back to CPU automatically.
Check Vulkan availability at runtime:
fun isVulkanAvailable(): Boolean {
return try {
val vk = android.hardware.HardwareBuffer::class.java
android.os.Build.VERSION.SDK_INT >= 26
// More robust: try to create a Vulkan instance
} catch (e: Exception) {
false
}
}
Generating Text
class LlamaEngine(private val context: Context) {
// ... loadModel and unload from above
suspend fun generate(
prompt: String,
maxTokens: Int = 256,
temperature: Float = 0.7f,
onToken: (String) -> Unit = {}
): String = withContext(Dispatchers.Default) {
val result = StringBuilder()
LlamaNative.generate(
context = ctx,
prompt = prompt,
maxTokens = maxTokens,
temperature = temperature,
) { token ->
result.append(token)
// Dispatch to main thread for UI update
withContext(Dispatchers.Main) {
onToken(token)
}
}
result.toString()
}
}
ViewModel Integration
class AiViewModel(application: Application) : AndroidViewModel(application) {
private val engine = LlamaEngine(application)
private val _response = MutableStateFlow("")
val response: StateFlow<String> = _response
private val _isGenerating = MutableStateFlow(false)
val isGenerating: StateFlow<Boolean> = _isGenerating
fun loadModel(path: String) {
viewModelScope.launch {
engine.loadModel(path)
}
}
fun generate(prompt: String) {
viewModelScope.launch {
_isGenerating.value = true
_response.value = ""
engine.generate(prompt, maxTokens = 256) { token ->
_response.value += token
}
_isGenerating.value = false
}
}
override fun onCleared() {
engine.unload()
super.onCleared()
}
}
Jetpack Compose UI
@Composable
fun AiChatScreen(viewModel: AiViewModel = viewModel()) {
val response by viewModel.response.collectAsState()
val isGenerating by viewModel.isGenerating.collectAsState()
var input by remember { mutableStateOf("") }
Column(modifier = Modifier.fillMaxSize().padding(16.dp)) {
// Response area
Text(
text = response,
modifier = Modifier.weight(1f).verticalScroll(rememberScrollState())
)
// Input area
Row(modifier = Modifier.fillMaxWidth()) {
TextField(
value = input,
onValueChange = { input = it },
modifier = Modifier.weight(1f),
enabled = !isGenerating
)
Button(
onClick = {
viewModel.generate(input)
input = ""
},
enabled = !isGenerating
) {
Text("Send")
}
}
}
}
Memory Management
Android's memory management is aggressive. The system kills background processes to free RAM and can terminate your app under memory pressure.
Available Memory Check
fun getAvailableMemoryMb(): Long {
val memInfo = ActivityManager.MemoryInfo()
val activityManager = getSystemService(ACTIVITY_SERVICE) as ActivityManager
activityManager.getMemoryInfo(memInfo)
return memInfo.availMem / (1024 * 1024)
}
fun canLoadModel(modelSizeMb: Long): Boolean {
val available = getAvailableMemoryMb()
// Reserve 500MB for app and OS
return available > modelSizeMb + 500
}
Lifecycle Management
class AiService : LifecycleObserver {
private var engine: LlamaEngine? = null
@OnLifecycleEvent(Lifecycle.Event.ON_RESUME)
fun onResume() {
// Consider loading model if AI screen is active
}
@OnLifecycleEvent(Lifecycle.Event.ON_PAUSE)
fun onPause() {
// Unload model to free memory
engine?.unload()
}
}
onTrimMemory Handler
override fun onTrimMemory(level: Int) {
super.onTrimMemory(level)
if (level >= ComponentCallbacks2.TRIM_MEMORY_RUNNING_LOW) {
engine?.unload()
}
}
Model Delivery
Asset Delivery (Play Feature Delivery)
For models over 150MB, use Play Asset Delivery to avoid the APK size limit:
// build.gradle.kts
assetPacks += ":model_pack"
// model_pack/build.gradle.kts
plugins {
id("com.android.asset-pack")
}
assetPack {
packName.set("model_pack")
dynamicDelivery {
deliveryType.set("install-time") // or "fast-follow" or "on-demand"
}
}
Post-Install Download
Download the model after installation:
suspend fun downloadModel(url: String, destination: File): Boolean {
return withContext(Dispatchers.IO) {
val client = OkHttpClient()
val request = Request.Builder().url(url).build()
val response = client.newCall(request).execute()
response.body?.let { body ->
val totalBytes = body.contentLength()
destination.outputStream().use { output ->
body.byteStream().use { input ->
val buffer = ByteArray(8192)
var bytesRead: Long = 0
var read: Int
while (input.read(buffer).also { read = it } != -1) {
output.write(buffer, 0, read)
bytesRead += read
val progress = bytesRead.toFloat() / totalBytes
// Update progress UI
}
}
}
true
} ?: false
}
}
Use WorkManager for background downloads that survive app restarts.
Testing Across Devices
The Android device spectrum is wide. Test on:
- Flagship (SD 8 Gen 3, 12GB): Verify best-case performance
- Mid-range (SD 7 Gen 3, 6-8GB): Verify 1B model runs smoothly
- Budget (SD 6 Gen 3, 4GB): Verify graceful fallback if model cannot load
- Older flagship (SD 8 Gen 1, 8GB): Verify 2-year-old flagships work
Use Firebase Test Lab or BrowserStack for device coverage testing without owning every device.
Production Checklist
- Model loads and generates correctly on minimum-spec device
- Vulkan acceleration is detected and used when available
- CPU fallback works when Vulkan is unavailable
- Memory check prevents loading on low-RAM devices
- Model unloads on onPause/onTrimMemory
- Download progress shows correctly for post-install delivery
- Model integrity verified after download (SHA256)
- Generation can be cancelled by the user
- App functions normally when model is not available
The GGUF model is what determines quality. A model fine-tuned on your domain data (via Ertas or similar) will produce responses specific to your app's purpose. The llama.cpp integration is identical regardless of which model you use.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

AI in Android Apps: ML Kit, Cloud APIs, and On-Device LLMs Compared
Three paths to AI in your Android app. Google ML Kit for common tasks, cloud APIs for full LLM capability, and on-device models via llama.cpp for cost and privacy. A practical comparison for Kotlin developers.

LLM Benchmarks on Android: Snapdragon, Tensor, and Exynos Compared
Real benchmark data for running LLMs on Android via llama.cpp. Token speeds across Snapdragon 8 Gen 2/3, Tensor G3/G4, Exynos 2400, and mid-range chipsets with practical deployment guidance.

llama.cpp on iOS: A Swift Integration Guide
Step-by-step guide to integrating llama.cpp into an iOS app. Project setup, Metal GPU acceleration, model loading, token streaming, and memory management for production deployment.