LLM Benchmarks on Android: Snapdragon, Tensor, and Exynos Compared

Android's chipset diversity is both a challenge and an opportunity for on-device AI. Unlike iOS where you target a handful of A-series chips, Android spans Qualcomm Snapdragon, Google Tensor, Samsung Exynos, and MediaTek Dimensity across hundreds of device models.

The good news: flagship and recent mid-range Android devices run 1-3B parameter models at usable speeds. The fragmentation is manageable if you target the right tiers.

The Chipset Landscape

Flagship (2023-2026)

Chipset	Example Devices	RAM	GPU
Snapdragon 8 Gen 3	Galaxy S24, OnePlus 12	8-12GB	Adreno 750
Snapdragon 8 Elite	Galaxy S25, OnePlus 13	12-16GB	Adreno 830
Tensor G3	Pixel 8, 8 Pro	12GB	Mali-G715
Tensor G4	Pixel 9, 9 Pro	12-16GB	Mali-G715
Exynos 2400	Galaxy S24 (intl)	8-12GB	Xclipse 940
Dimensity 9300	Various flagships	8-16GB	Immortalis-G720

Mid-Range (2024-2026)

Chipset	Example Devices	RAM	GPU
Snapdragon 7+ Gen 3	Mid-range 2024+	8-12GB	Adreno 732
Snapdragon 7 Gen 3	Mid-range 2024+	6-8GB	Adreno 720
Dimensity 8300	Mid-range 2024+	8-12GB	Mali-G615
Tensor G2	Pixel 7 series	8GB	Mali-G710

Budget (2024-2026)

Chipset	Example Devices	RAM	GPU
Snapdragon 6 Gen 3	Budget 2024+	4-6GB	Adreno 710
Dimensity 7300	Budget 2024+	6-8GB	Mali-G615
Helio G99	Budget devices	4-6GB	Mali-G57

Benchmark Results

All benchmarks use llama.cpp with CPU inference (multi-threaded) and Vulkan GPU acceleration where available. GGUF Q4_K_M quantization, 2048 context length.

1B Parameter Models (~600MB GGUF Q4)

Chipset	CPU (tok/s)	GPU/Vulkan (tok/s)	Memory
SD 8 Elite	35-45	45-55	~800MB
SD 8 Gen 3	30-40	40-50	~800MB
SD 8 Gen 2	25-35	35-45	~800MB
Tensor G4	28-35	35-42	~800MB
Tensor G3	25-32	30-38	~800MB
Exynos 2400	25-35	32-42	~800MB
SD 7+ Gen 3	22-28	28-35	~800MB
SD 7 Gen 3	18-25	22-30	~800MB
Dimensity 8300	20-28	25-33	~800MB
SD 6 Gen 3	12-18	15-22	~800MB

Every flagship and mid-range chipset from the last 2-3 years runs 1B models at 20+ tokens per second. Even the Snapdragon 6 Gen 3 budget chip delivers usable performance.

3B Parameter Models (~1.7GB GGUF Q4)

Chipset	CPU (tok/s)	GPU/Vulkan (tok/s)	Memory
SD 8 Elite	18-25	22-30	~2.2GB
SD 8 Gen 3	15-22	20-28	~2.2GB
SD 8 Gen 2	12-18	16-22	~2.2GB
Tensor G4	14-20	18-24	~2.2GB
Tensor G3	12-16	15-20	~2.2GB
Exynos 2400	12-18	16-22	~2.2GB
SD 7+ Gen 3	10-14	13-18	~2.2GB
SD 7 Gen 3	7-11	9-14	~2.2GB
SD 6 Gen 3	4-7	5-9	~2.2GB

3B models run well on flagships (15+ tok/s with GPU). Upper mid-range devices (SD 7+ Gen 3, Dimensity 8300) are usable. Lower mid-range and budget devices struggle to reach the 10 tok/s threshold for comfortable chat.

Vulkan GPU Acceleration

Vulkan GPU acceleration is the key to fast on-device inference on Android. The improvement over CPU-only inference ranges from 20-40% on most devices:

Snapdragon 8 Gen 3: +30-35% with Vulkan
Tensor G4: +25-30%
Exynos 2400: +20-30%
Mid-range Snapdragon 7: +20-25%

llama.cpp enables Vulkan acceleration with the n_gpu_layers parameter. Setting it to the model's full layer count offloads all computation to the GPU.

The Fragmentation Strategy

Android fragmentation is manageable with a tiered approach:

Tier 1: 1B Model (4GB+ RAM)

Covers 85%+ of active Android devices. Includes all smartphones from the last 3-4 years, most budget devices from the last 2 years.

Model size: ~600MB (Q4_K_M)
RAM requirement: 800MB during inference
Speed: 12-55 tok/s depending on chipset
Suitable for: classification, autocomplete, smart suggestions, short responses

Tier 2: 3B Model (8GB+ RAM)

Covers flagship and upper mid-range devices from the last 2-3 years. Roughly 40-50% of active Android devices in developed markets, growing each year.

Model size: ~1.7GB (Q4_K_M)
RAM requirement: 2.2GB during inference
Speed: 10-30 tok/s on supported devices
Suitable for: chat, summarization, content generation, complex tasks

Runtime Detection

Detect available RAM and chipset at runtime to select the appropriate model:

fun selectModelTier(): ModelTier {
    val memInfo = ActivityManager.MemoryInfo()
    val activityManager = getSystemService(ACTIVITY_SERVICE) as ActivityManager
    activityManager.getMemoryInfo(memInfo)

    val totalRamGb = memInfo.totalMem / (1024 * 1024 * 1024)

    return when {
        totalRamGb >= 8 -> ModelTier.THREE_B
        totalRamGb >= 4 -> ModelTier.ONE_B
        else -> ModelTier.NONE // Device too constrained
    }
}

Thermal and Battery Impact

Thermal Throttling

Android devices are more prone to thermal throttling than iPhones during sustained inference. The throttling behavior varies by manufacturer:

Samsung: Aggressive throttling, 20-40% speed reduction after 3-5 minutes of sustained load
Pixel: Moderate throttling, 15-25% reduction after 5-7 minutes
OnePlus/gaming phones: More lenient, 10-20% reduction

Battery Consumption

Running inference consumes roughly:

1B model: 2-3W during generation
3B model: 3-5W during generation

For context, typical phone battery capacity is 4,000-5,500 mAh. A 3B model generating continuously drains about 1% battery per minute. For typical usage (a few short interactions per hour), the battery impact is negligible.

Optimization

Use CPU thread count matching the device's performance cores (typically 4)
Unload the model when not in use to eliminate idle power draw
For background tasks (classification, tagging), batch processing is more power-efficient than individual calls

What This Means for Developers

1B models are universally viable. Target 1B for broad reach. Fine-tune for your domain to maximize quality at this size.
3B models are flagship-ready. If your user base skews toward newer devices (common in paid apps), 3B delivers meaningfully better generation quality.
Vulkan matters. Always enable GPU acceleration. The 20-40% speed improvement is free performance.
Detect and adapt. Use runtime RAM detection to offer the right model tier. Do not force a 3B model on a 4GB device.
Fine-tune, do not just shrink. A fine-tuned 1B model on your domain data outperforms a general-purpose 3B on your specific tasks. Platforms like Ertas make this accessible: upload data, train with LoRA, export GGUF, deploy.

The Android ecosystem has the hardware. The inference engine (llama.cpp) handles the chipset diversity. The missing piece is the right model for your use case.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

LLM Benchmarks on Android: Snapdragon, Tensor, and Exynos Compared

The Chipset Landscape

Flagship (2023-2026)

Mid-Range (2024-2026)

Budget (2024-2026)

Benchmark Results

1B Parameter Models (~600MB GGUF Q4)

3B Parameter Models (~1.7GB GGUF Q4)

Vulkan GPU Acceleration

The Fragmentation Strategy

Tier 1: 1B Model (4GB+ RAM)

Tier 2: 3B Model (8GB+ RAM)

Runtime Detection

Thermal and Battery Impact

Thermal Throttling

Battery Consumption

Optimization

What This Means for Developers

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

AI in Android Apps: ML Kit, Cloud APIs, and On-Device LLMs Compared

Can LLMs Actually Run on iPhones? Benchmarks and Real-World Performance

llama.cpp on Android: A Kotlin Integration Guide