Back to blog
    llama.cpp on Android: A Kotlin Integration Guide
    llama.cppAndroidKotlinintegrationon-device AIVulkansegment:mobile-builder

    llama.cpp on Android: A Kotlin Integration Guide

    Step-by-step guide to integrating llama.cpp into an Android app with Kotlin. JNI bindings, Vulkan GPU acceleration, model loading, and memory management across the Android device spectrum.

    EErtas Team·

    llama.cpp runs GGUF language models on Android devices using CPU multi-threading and Vulkan GPU acceleration. The llama.android project provides pre-built Kotlin bindings through JNI, making integration straightforward for Kotlin developers.

    This guide covers the full integration path from project setup to production deployment.

    Integration Options

    The llama.cpp repository includes llama.android, a pre-built Android library with Kotlin bindings. This is the fastest path to working on-device AI.

    Add it to your project:

    1. Clone or download the llama.android module from the llama.cpp repository
    2. Include it as a module in your Android project
    3. Add the dependency in your app's build.gradle.kts
    dependencies {
        implementation(project(":llama"))
    }
    

    Option 2: Build from Source

    For more control, build llama.cpp as a native library using the Android NDK:

    mkdir build-android && cd build-android
    cmake .. \
        -DCMAKE_TOOLCHAIN_FILE=$NDK_PATH/build/cmake/android.toolchain.cmake \
        -DANDROID_ABI=arm64-v8a \
        -DANDROID_PLATFORM=android-26 \
        -DLLAMA_VULKAN=ON \
        -DBUILD_SHARED_LIBS=ON
    cmake --build . --config Release
    

    This produces libllama.so that you include in your jniLibs directory.

    Option 3: Pre-Built AAR

    Some community projects publish llama.cpp as an AAR (Android Archive) that you can include as a Maven dependency. Check for recent, maintained versions.

    Project Setup

    Minimum Requirements

    • Android API 26+ (Android 8.0)
    • ARM64 (arm64-v8a) target architecture
    • NDK r25+ for building native code
    • 4GB+ RAM on target device (for 1B models)

    Build Configuration

    // app/build.gradle.kts
    android {
        defaultConfig {
            ndk {
                abiFilters += "arm64-v8a" // 64-bit ARM only
            }
        }
    }
    

    Permissions

    No special permissions required for inference. For model download:

    <uses-permission android:name="android.permission.INTERNET" />
    

    Loading a Model

    class LlamaEngine(private val context: Context) {
        private var model: Long = 0 // Native pointer
        private var ctx: Long = 0   // Native context pointer
    
        suspend fun loadModel(modelPath: String) = withContext(Dispatchers.Default) {
            // Load model with GPU acceleration
            model = LlamaNative.loadModel(
                modelPath = modelPath,
                nGpuLayers = 99,    // Offload all layers to Vulkan
            )
            require(model != 0L) { "Failed to load model" }
    
            // Create inference context
            ctx = LlamaNative.createContext(
                model = model,
                nCtx = 2048,        // Context window
                nThreads = 4,       // CPU threads
                nBatch = 512,       // Batch size
            )
            require(ctx != 0L) { "Failed to create context" }
        }
    
        fun unload() {
            if (ctx != 0L) {
                LlamaNative.freeContext(ctx)
                ctx = 0
            }
            if (model != 0L) {
                LlamaNative.freeModel(model)
                model = 0
            }
        }
    }
    

    Vulkan GPU Acceleration

    Vulkan is Android's GPU compute API. llama.cpp uses it to accelerate matrix operations during inference. Enable it by setting nGpuLayers to the model's layer count.

    Vulkan support depends on the device:

    • Snapdragon 8 Gen 2+: Full Vulkan compute support, best performance
    • Tensor G3/G4: Good Vulkan support
    • Snapdragon 7 Gen 3+: Vulkan supported, moderate acceleration
    • Older/budget devices: May lack Vulkan compute. Falls back to CPU automatically.

    Check Vulkan availability at runtime:

    fun isVulkanAvailable(): Boolean {
        return try {
            val vk = android.hardware.HardwareBuffer::class.java
            android.os.Build.VERSION.SDK_INT >= 26
            // More robust: try to create a Vulkan instance
        } catch (e: Exception) {
            false
        }
    }
    

    Generating Text

    class LlamaEngine(private val context: Context) {
        // ... loadModel and unload from above
    
        suspend fun generate(
            prompt: String,
            maxTokens: Int = 256,
            temperature: Float = 0.7f,
            onToken: (String) -> Unit = {}
        ): String = withContext(Dispatchers.Default) {
            val result = StringBuilder()
    
            LlamaNative.generate(
                context = ctx,
                prompt = prompt,
                maxTokens = maxTokens,
                temperature = temperature,
            ) { token ->
                result.append(token)
                // Dispatch to main thread for UI update
                withContext(Dispatchers.Main) {
                    onToken(token)
                }
            }
    
            result.toString()
        }
    }
    

    ViewModel Integration

    class AiViewModel(application: Application) : AndroidViewModel(application) {
        private val engine = LlamaEngine(application)
        private val _response = MutableStateFlow("")
        val response: StateFlow<String> = _response
        private val _isGenerating = MutableStateFlow(false)
        val isGenerating: StateFlow<Boolean> = _isGenerating
    
        fun loadModel(path: String) {
            viewModelScope.launch {
                engine.loadModel(path)
            }
        }
    
        fun generate(prompt: String) {
            viewModelScope.launch {
                _isGenerating.value = true
                _response.value = ""
    
                engine.generate(prompt, maxTokens = 256) { token ->
                    _response.value += token
                }
    
                _isGenerating.value = false
            }
        }
    
        override fun onCleared() {
            engine.unload()
            super.onCleared()
        }
    }
    

    Jetpack Compose UI

    @Composable
    fun AiChatScreen(viewModel: AiViewModel = viewModel()) {
        val response by viewModel.response.collectAsState()
        val isGenerating by viewModel.isGenerating.collectAsState()
        var input by remember { mutableStateOf("") }
    
        Column(modifier = Modifier.fillMaxSize().padding(16.dp)) {
            // Response area
            Text(
                text = response,
                modifier = Modifier.weight(1f).verticalScroll(rememberScrollState())
            )
    
            // Input area
            Row(modifier = Modifier.fillMaxWidth()) {
                TextField(
                    value = input,
                    onValueChange = { input = it },
                    modifier = Modifier.weight(1f),
                    enabled = !isGenerating
                )
                Button(
                    onClick = {
                        viewModel.generate(input)
                        input = ""
                    },
                    enabled = !isGenerating
                ) {
                    Text("Send")
                }
            }
        }
    }
    

    Memory Management

    Android's memory management is aggressive. The system kills background processes to free RAM and can terminate your app under memory pressure.

    Available Memory Check

    fun getAvailableMemoryMb(): Long {
        val memInfo = ActivityManager.MemoryInfo()
        val activityManager = getSystemService(ACTIVITY_SERVICE) as ActivityManager
        activityManager.getMemoryInfo(memInfo)
        return memInfo.availMem / (1024 * 1024)
    }
    
    fun canLoadModel(modelSizeMb: Long): Boolean {
        val available = getAvailableMemoryMb()
        // Reserve 500MB for app and OS
        return available > modelSizeMb + 500
    }
    

    Lifecycle Management

    class AiService : LifecycleObserver {
        private var engine: LlamaEngine? = null
    
        @OnLifecycleEvent(Lifecycle.Event.ON_RESUME)
        fun onResume() {
            // Consider loading model if AI screen is active
        }
    
        @OnLifecycleEvent(Lifecycle.Event.ON_PAUSE)
        fun onPause() {
            // Unload model to free memory
            engine?.unload()
        }
    }
    

    onTrimMemory Handler

    override fun onTrimMemory(level: Int) {
        super.onTrimMemory(level)
        if (level >= ComponentCallbacks2.TRIM_MEMORY_RUNNING_LOW) {
            engine?.unload()
        }
    }
    

    Model Delivery

    Asset Delivery (Play Feature Delivery)

    For models over 150MB, use Play Asset Delivery to avoid the APK size limit:

    // build.gradle.kts
    assetPacks += ":model_pack"
    
    // model_pack/build.gradle.kts
    plugins {
        id("com.android.asset-pack")
    }
    assetPack {
        packName.set("model_pack")
        dynamicDelivery {
            deliveryType.set("install-time") // or "fast-follow" or "on-demand"
        }
    }
    

    Post-Install Download

    Download the model after installation:

    suspend fun downloadModel(url: String, destination: File): Boolean {
        return withContext(Dispatchers.IO) {
            val client = OkHttpClient()
            val request = Request.Builder().url(url).build()
            val response = client.newCall(request).execute()
    
            response.body?.let { body ->
                val totalBytes = body.contentLength()
                destination.outputStream().use { output ->
                    body.byteStream().use { input ->
                        val buffer = ByteArray(8192)
                        var bytesRead: Long = 0
                        var read: Int
    
                        while (input.read(buffer).also { read = it } != -1) {
                            output.write(buffer, 0, read)
                            bytesRead += read
                            val progress = bytesRead.toFloat() / totalBytes
                            // Update progress UI
                        }
                    }
                }
                true
            } ?: false
        }
    }
    

    Use WorkManager for background downloads that survive app restarts.

    Testing Across Devices

    The Android device spectrum is wide. Test on:

    1. Flagship (SD 8 Gen 3, 12GB): Verify best-case performance
    2. Mid-range (SD 7 Gen 3, 6-8GB): Verify 1B model runs smoothly
    3. Budget (SD 6 Gen 3, 4GB): Verify graceful fallback if model cannot load
    4. Older flagship (SD 8 Gen 1, 8GB): Verify 2-year-old flagships work

    Use Firebase Test Lab or BrowserStack for device coverage testing without owning every device.

    Production Checklist

    1. Model loads and generates correctly on minimum-spec device
    2. Vulkan acceleration is detected and used when available
    3. CPU fallback works when Vulkan is unavailable
    4. Memory check prevents loading on low-RAM devices
    5. Model unloads on onPause/onTrimMemory
    6. Download progress shows correctly for post-install delivery
    7. Model integrity verified after download (SHA256)
    8. Generation can be cancelled by the user
    9. App functions normally when model is not available

    The GGUF model is what determines quality. A model fine-tuned on your domain data (via Ertas or similar) will produce responses specific to your app's purpose. The llama.cpp integration is identical regardless of which model you use.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading