llama.cpp on Android: A Kotlin Integration Guide

llama.cpp runs GGUF language models on Android devices using CPU multi-threading and Vulkan GPU acceleration. The llama.android project provides pre-built Kotlin bindings through JNI, making integration straightforward for Kotlin developers.

This guide covers the full integration path from project setup to production deployment.

Integration Options

Option 1: llama.android Library (Recommended)

The llama.cpp repository includes llama.android, a pre-built Android library with Kotlin bindings. This is the fastest path to working on-device AI.

Add it to your project:

Clone or download the llama.android module from the llama.cpp repository
Include it as a module in your Android project
Add the dependency in your app's build.gradle.kts

dependencies {
    implementation(project(":llama"))
}

Option 2: Build from Source

For more control, build llama.cpp as a native library using the Android NDK:

mkdir build-android && cd build-android
cmake .. \
    -DCMAKE_TOOLCHAIN_FILE=$NDK_PATH/build/cmake/android.toolchain.cmake \
    -DANDROID_ABI=arm64-v8a \
    -DANDROID_PLATFORM=android-26 \
    -DLLAMA_VULKAN=ON \
    -DBUILD_SHARED_LIBS=ON
cmake --build . --config Release

This produces libllama.so that you include in your jniLibs directory.

Option 3: Pre-Built AAR

Some community projects publish llama.cpp as an AAR (Android Archive) that you can include as a Maven dependency. Check for recent, maintained versions.

Project Setup

Minimum Requirements

Android API 26+ (Android 8.0)
ARM64 (arm64-v8a) target architecture
NDK r25+ for building native code
4GB+ RAM on target device (for 1B models)

Build Configuration

// app/build.gradle.kts
android {
    defaultConfig {
        ndk {
            abiFilters += "arm64-v8a" // 64-bit ARM only
        }
    }
}

Permissions

No special permissions required for inference. For model download:

<uses-permission android:name="android.permission.INTERNET" />

Loading a Model

class LlamaEngine(private val context: Context) {
    private var model: Long = 0 // Native pointer
    private var ctx: Long = 0   // Native context pointer

    suspend fun loadModel(modelPath: String) = withContext(Dispatchers.Default) {
        // Load model with GPU acceleration
        model = LlamaNative.loadModel(
            modelPath = modelPath,
            nGpuLayers = 99,    // Offload all layers to Vulkan
        )
        require(model != 0L) { "Failed to load model" }

        // Create inference context
        ctx = LlamaNative.createContext(
            model = model,
            nCtx = 2048,        // Context window
            nThreads = 4,       // CPU threads
            nBatch = 512,       // Batch size
        )
        require(ctx != 0L) { "Failed to create context" }
    }

    fun unload() {
        if (ctx != 0L) {
            LlamaNative.freeContext(ctx)
            ctx = 0
        }
        if (model != 0L) {
            LlamaNative.freeModel(model)
            model = 0
        }
    }
}

Vulkan GPU Acceleration

Vulkan is Android's GPU compute API. llama.cpp uses it to accelerate matrix operations during inference. Enable it by setting nGpuLayers to the model's layer count.

Vulkan support depends on the device:

Snapdragon 8 Gen 2+: Full Vulkan compute support, best performance
Tensor G3/G4: Good Vulkan support
Snapdragon 7 Gen 3+: Vulkan supported, moderate acceleration
Older/budget devices: May lack Vulkan compute. Falls back to CPU automatically.

Check Vulkan availability at runtime:

fun isVulkanAvailable(): Boolean {
    return try {
        val vk = android.hardware.HardwareBuffer::class.java
        android.os.Build.VERSION.SDK_INT >= 26
        // More robust: try to create a Vulkan instance
    } catch (e: Exception) {
        false
    }
}

Generating Text

class LlamaEngine(private val context: Context) {
    // ... loadModel and unload from above

    suspend fun generate(
        prompt: String,
        maxTokens: Int = 256,
        temperature: Float = 0.7f,
        onToken: (String) -> Unit = {}
    ): String = withContext(Dispatchers.Default) {
        val result = StringBuilder()

        LlamaNative.generate(
            context = ctx,
            prompt = prompt,
            maxTokens = maxTokens,
            temperature = temperature,
        ) { token ->
            result.append(token)
            // Dispatch to main thread for UI update
            withContext(Dispatchers.Main) {
                onToken(token)
            }
        }

        result.toString()
    }
}

ViewModel Integration

class AiViewModel(application: Application) : AndroidViewModel(application) {
    private val engine = LlamaEngine(application)
    private val _response = MutableStateFlow("")
    val response: StateFlow<String> = _response
    private val _isGenerating = MutableStateFlow(false)
    val isGenerating: StateFlow<Boolean> = _isGenerating

    fun loadModel(path: String) {
        viewModelScope.launch {
            engine.loadModel(path)
        }
    }

    fun generate(prompt: String) {
        viewModelScope.launch {
            _isGenerating.value = true
            _response.value = ""

            engine.generate(prompt, maxTokens = 256) { token ->
                _response.value += token
            }

            _isGenerating.value = false
        }
    }

    override fun onCleared() {
        engine.unload()
        super.onCleared()
    }
}

Jetpack Compose UI

@Composable
fun AiChatScreen(viewModel: AiViewModel = viewModel()) {
    val response by viewModel.response.collectAsState()
    val isGenerating by viewModel.isGenerating.collectAsState()
    var input by remember { mutableStateOf("") }

    Column(modifier = Modifier.fillMaxSize().padding(16.dp)) {
        // Response area
        Text(
            text = response,
            modifier = Modifier.weight(1f).verticalScroll(rememberScrollState())
        )

        // Input area
        Row(modifier = Modifier.fillMaxWidth()) {
            TextField(
                value = input,
                onValueChange = { input = it },
                modifier = Modifier.weight(1f),
                enabled = !isGenerating
            )
            Button(
                onClick = {
                    viewModel.generate(input)
                    input = ""
                },
                enabled = !isGenerating
            ) {
                Text("Send")
            }
        }
    }
}

Memory Management

Android's memory management is aggressive. The system kills background processes to free RAM and can terminate your app under memory pressure.

Available Memory Check

fun getAvailableMemoryMb(): Long {
    val memInfo = ActivityManager.MemoryInfo()
    val activityManager = getSystemService(ACTIVITY_SERVICE) as ActivityManager
    activityManager.getMemoryInfo(memInfo)
    return memInfo.availMem / (1024 * 1024)
}

fun canLoadModel(modelSizeMb: Long): Boolean {
    val available = getAvailableMemoryMb()
    // Reserve 500MB for app and OS
    return available > modelSizeMb + 500
}

Lifecycle Management

class AiService : LifecycleObserver {
    private var engine: LlamaEngine? = null

    @OnLifecycleEvent(Lifecycle.Event.ON_RESUME)
    fun onResume() {
        // Consider loading model if AI screen is active
    }

    @OnLifecycleEvent(Lifecycle.Event.ON_PAUSE)
    fun onPause() {
        // Unload model to free memory
        engine?.unload()
    }
}

onTrimMemory Handler

override fun onTrimMemory(level: Int) {
    super.onTrimMemory(level)
    if (level >= ComponentCallbacks2.TRIM_MEMORY_RUNNING_LOW) {
        engine?.unload()
    }
}

Model Delivery

Asset Delivery (Play Feature Delivery)

For models over 150MB, use Play Asset Delivery to avoid the APK size limit:

// build.gradle.kts
assetPacks += ":model_pack"

// model_pack/build.gradle.kts
plugins {
    id("com.android.asset-pack")
}
assetPack {
    packName.set("model_pack")
    dynamicDelivery {
        deliveryType.set("install-time") // or "fast-follow" or "on-demand"
    }
}

Post-Install Download

Download the model after installation:

suspend fun downloadModel(url: String, destination: File): Boolean {
    return withContext(Dispatchers.IO) {
        val client = OkHttpClient()
        val request = Request.Builder().url(url).build()
        val response = client.newCall(request).execute()

        response.body?.let { body ->
            val totalBytes = body.contentLength()
            destination.outputStream().use { output ->
                body.byteStream().use { input ->
                    val buffer = ByteArray(8192)
                    var bytesRead: Long = 0
                    var read: Int

                    while (input.read(buffer).also { read = it } != -1) {
                        output.write(buffer, 0, read)
                        bytesRead += read
                        val progress = bytesRead.toFloat() / totalBytes
                        // Update progress UI
                    }
                }
            }
            true
        } ?: false
    }
}

Use WorkManager for background downloads that survive app restarts.

Testing Across Devices

The Android device spectrum is wide. Test on:

Flagship (SD 8 Gen 3, 12GB): Verify best-case performance
Mid-range (SD 7 Gen 3, 6-8GB): Verify 1B model runs smoothly
Budget (SD 6 Gen 3, 4GB): Verify graceful fallback if model cannot load
Older flagship (SD 8 Gen 1, 8GB): Verify 2-year-old flagships work

Use Firebase Test Lab or BrowserStack for device coverage testing without owning every device.

Production Checklist

Model loads and generates correctly on minimum-spec device
Vulkan acceleration is detected and used when available
CPU fallback works when Vulkan is unavailable
Memory check prevents loading on low-RAM devices
Model unloads on onPause/onTrimMemory
Download progress shows correctly for post-install delivery
Model integrity verified after download (SHA256)
Generation can be cancelled by the user
App functions normally when model is not available

The GGUF model is what determines quality. A model fine-tuned on your domain data (via Ertas or similar) will produce responses specific to your app's purpose. The llama.cpp integration is identical regardless of which model you use.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

llama.cpp on Android: A Kotlin Integration Guide

Integration Options

Option 1: llama.android Library (Recommended)

Option 2: Build from Source

Option 3: Pre-Built AAR

Project Setup

Minimum Requirements

Build Configuration

Permissions

Loading a Model

Vulkan GPU Acceleration

Generating Text

ViewModel Integration

Jetpack Compose UI

Memory Management

Available Memory Check

Lifecycle Management

onTrimMemory Handler

Model Delivery

Asset Delivery (Play Feature Delivery)

Post-Install Download

Testing Across Devices

Production Checklist

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

AI in Android Apps: ML Kit, Cloud APIs, and On-Device LLMs Compared

LLM Benchmarks on Android: Snapdragon, Tensor, and Exynos Compared

llama.cpp on iOS: A Swift Integration Guide