Back to blog
    LLM Benchmarks on Android: Snapdragon, Tensor, and Exynos Compared
    AndroidbenchmarksSnapdragonTensorExynoson-device AIllama.cppsegment:mobile-builder

    LLM Benchmarks on Android: Snapdragon, Tensor, and Exynos Compared

    Real benchmark data for running LLMs on Android via llama.cpp. Token speeds across Snapdragon 8 Gen 2/3, Tensor G3/G4, Exynos 2400, and mid-range chipsets with practical deployment guidance.

    EErtas Team·

    Android's chipset diversity is both a challenge and an opportunity for on-device AI. Unlike iOS where you target a handful of A-series chips, Android spans Qualcomm Snapdragon, Google Tensor, Samsung Exynos, and MediaTek Dimensity across hundreds of device models.

    The good news: flagship and recent mid-range Android devices run 1-3B parameter models at usable speeds. The fragmentation is manageable if you target the right tiers.

    The Chipset Landscape

    Flagship (2023-2026)

    ChipsetExample DevicesRAMGPU
    Snapdragon 8 Gen 3Galaxy S24, OnePlus 128-12GBAdreno 750
    Snapdragon 8 EliteGalaxy S25, OnePlus 1312-16GBAdreno 830
    Tensor G3Pixel 8, 8 Pro12GBMali-G715
    Tensor G4Pixel 9, 9 Pro12-16GBMali-G715
    Exynos 2400Galaxy S24 (intl)8-12GBXclipse 940
    Dimensity 9300Various flagships8-16GBImmortalis-G720

    Mid-Range (2024-2026)

    ChipsetExample DevicesRAMGPU
    Snapdragon 7+ Gen 3Mid-range 2024+8-12GBAdreno 732
    Snapdragon 7 Gen 3Mid-range 2024+6-8GBAdreno 720
    Dimensity 8300Mid-range 2024+8-12GBMali-G615
    Tensor G2Pixel 7 series8GBMali-G710

    Budget (2024-2026)

    ChipsetExample DevicesRAMGPU
    Snapdragon 6 Gen 3Budget 2024+4-6GBAdreno 710
    Dimensity 7300Budget 2024+6-8GBMali-G615
    Helio G99Budget devices4-6GBMali-G57

    Benchmark Results

    All benchmarks use llama.cpp with CPU inference (multi-threaded) and Vulkan GPU acceleration where available. GGUF Q4_K_M quantization, 2048 context length.

    1B Parameter Models (~600MB GGUF Q4)

    ChipsetCPU (tok/s)GPU/Vulkan (tok/s)Memory
    SD 8 Elite35-4545-55~800MB
    SD 8 Gen 330-4040-50~800MB
    SD 8 Gen 225-3535-45~800MB
    Tensor G428-3535-42~800MB
    Tensor G325-3230-38~800MB
    Exynos 240025-3532-42~800MB
    SD 7+ Gen 322-2828-35~800MB
    SD 7 Gen 318-2522-30~800MB
    Dimensity 830020-2825-33~800MB
    SD 6 Gen 312-1815-22~800MB

    Every flagship and mid-range chipset from the last 2-3 years runs 1B models at 20+ tokens per second. Even the Snapdragon 6 Gen 3 budget chip delivers usable performance.

    3B Parameter Models (~1.7GB GGUF Q4)

    ChipsetCPU (tok/s)GPU/Vulkan (tok/s)Memory
    SD 8 Elite18-2522-30~2.2GB
    SD 8 Gen 315-2220-28~2.2GB
    SD 8 Gen 212-1816-22~2.2GB
    Tensor G414-2018-24~2.2GB
    Tensor G312-1615-20~2.2GB
    Exynos 240012-1816-22~2.2GB
    SD 7+ Gen 310-1413-18~2.2GB
    SD 7 Gen 37-119-14~2.2GB
    SD 6 Gen 34-75-9~2.2GB

    3B models run well on flagships (15+ tok/s with GPU). Upper mid-range devices (SD 7+ Gen 3, Dimensity 8300) are usable. Lower mid-range and budget devices struggle to reach the 10 tok/s threshold for comfortable chat.

    Vulkan GPU Acceleration

    Vulkan GPU acceleration is the key to fast on-device inference on Android. The improvement over CPU-only inference ranges from 20-40% on most devices:

    • Snapdragon 8 Gen 3: +30-35% with Vulkan
    • Tensor G4: +25-30%
    • Exynos 2400: +20-30%
    • Mid-range Snapdragon 7: +20-25%

    llama.cpp enables Vulkan acceleration with the n_gpu_layers parameter. Setting it to the model's full layer count offloads all computation to the GPU.

    The Fragmentation Strategy

    Android fragmentation is manageable with a tiered approach:

    Tier 1: 1B Model (4GB+ RAM)

    Covers 85%+ of active Android devices. Includes all smartphones from the last 3-4 years, most budget devices from the last 2 years.

    • Model size: ~600MB (Q4_K_M)
    • RAM requirement: 800MB during inference
    • Speed: 12-55 tok/s depending on chipset
    • Suitable for: classification, autocomplete, smart suggestions, short responses

    Tier 2: 3B Model (8GB+ RAM)

    Covers flagship and upper mid-range devices from the last 2-3 years. Roughly 40-50% of active Android devices in developed markets, growing each year.

    • Model size: ~1.7GB (Q4_K_M)
    • RAM requirement: 2.2GB during inference
    • Speed: 10-30 tok/s on supported devices
    • Suitable for: chat, summarization, content generation, complex tasks

    Runtime Detection

    Detect available RAM and chipset at runtime to select the appropriate model:

    fun selectModelTier(): ModelTier {
        val memInfo = ActivityManager.MemoryInfo()
        val activityManager = getSystemService(ACTIVITY_SERVICE) as ActivityManager
        activityManager.getMemoryInfo(memInfo)
    
        val totalRamGb = memInfo.totalMem / (1024 * 1024 * 1024)
    
        return when {
            totalRamGb >= 8 -> ModelTier.THREE_B
            totalRamGb >= 4 -> ModelTier.ONE_B
            else -> ModelTier.NONE // Device too constrained
        }
    }
    

    Thermal and Battery Impact

    Thermal Throttling

    Android devices are more prone to thermal throttling than iPhones during sustained inference. The throttling behavior varies by manufacturer:

    • Samsung: Aggressive throttling, 20-40% speed reduction after 3-5 minutes of sustained load
    • Pixel: Moderate throttling, 15-25% reduction after 5-7 minutes
    • OnePlus/gaming phones: More lenient, 10-20% reduction

    Battery Consumption

    Running inference consumes roughly:

    • 1B model: 2-3W during generation
    • 3B model: 3-5W during generation

    For context, typical phone battery capacity is 4,000-5,500 mAh. A 3B model generating continuously drains about 1% battery per minute. For typical usage (a few short interactions per hour), the battery impact is negligible.

    Optimization

    • Use CPU thread count matching the device's performance cores (typically 4)
    • Unload the model when not in use to eliminate idle power draw
    • For background tasks (classification, tagging), batch processing is more power-efficient than individual calls

    What This Means for Developers

    1. 1B models are universally viable. Target 1B for broad reach. Fine-tune for your domain to maximize quality at this size.

    2. 3B models are flagship-ready. If your user base skews toward newer devices (common in paid apps), 3B delivers meaningfully better generation quality.

    3. Vulkan matters. Always enable GPU acceleration. The 20-40% speed improvement is free performance.

    4. Detect and adapt. Use runtime RAM detection to offer the right model tier. Do not force a 3B model on a 4GB device.

    5. Fine-tune, do not just shrink. A fine-tuned 1B model on your domain data outperforms a general-purpose 3B on your specific tasks. Platforms like Ertas make this accessible: upload data, train with LoRA, export GGUF, deploy.

    The Android ecosystem has the hardware. The inference engine (llama.cpp) handles the chipset diversity. The missing piece is the right model for your use case.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading