
LLM Benchmarks on Android: Snapdragon, Tensor, and Exynos Compared
Real benchmark data for running LLMs on Android via llama.cpp. Token speeds across Snapdragon 8 Gen 2/3, Tensor G3/G4, Exynos 2400, and mid-range chipsets with practical deployment guidance.
Android's chipset diversity is both a challenge and an opportunity for on-device AI. Unlike iOS where you target a handful of A-series chips, Android spans Qualcomm Snapdragon, Google Tensor, Samsung Exynos, and MediaTek Dimensity across hundreds of device models.
The good news: flagship and recent mid-range Android devices run 1-3B parameter models at usable speeds. The fragmentation is manageable if you target the right tiers.
The Chipset Landscape
Flagship (2023-2026)
| Chipset | Example Devices | RAM | GPU |
|---|---|---|---|
| Snapdragon 8 Gen 3 | Galaxy S24, OnePlus 12 | 8-12GB | Adreno 750 |
| Snapdragon 8 Elite | Galaxy S25, OnePlus 13 | 12-16GB | Adreno 830 |
| Tensor G3 | Pixel 8, 8 Pro | 12GB | Mali-G715 |
| Tensor G4 | Pixel 9, 9 Pro | 12-16GB | Mali-G715 |
| Exynos 2400 | Galaxy S24 (intl) | 8-12GB | Xclipse 940 |
| Dimensity 9300 | Various flagships | 8-16GB | Immortalis-G720 |
Mid-Range (2024-2026)
| Chipset | Example Devices | RAM | GPU |
|---|---|---|---|
| Snapdragon 7+ Gen 3 | Mid-range 2024+ | 8-12GB | Adreno 732 |
| Snapdragon 7 Gen 3 | Mid-range 2024+ | 6-8GB | Adreno 720 |
| Dimensity 8300 | Mid-range 2024+ | 8-12GB | Mali-G615 |
| Tensor G2 | Pixel 7 series | 8GB | Mali-G710 |
Budget (2024-2026)
| Chipset | Example Devices | RAM | GPU |
|---|---|---|---|
| Snapdragon 6 Gen 3 | Budget 2024+ | 4-6GB | Adreno 710 |
| Dimensity 7300 | Budget 2024+ | 6-8GB | Mali-G615 |
| Helio G99 | Budget devices | 4-6GB | Mali-G57 |
Benchmark Results
All benchmarks use llama.cpp with CPU inference (multi-threaded) and Vulkan GPU acceleration where available. GGUF Q4_K_M quantization, 2048 context length.
1B Parameter Models (~600MB GGUF Q4)
| Chipset | CPU (tok/s) | GPU/Vulkan (tok/s) | Memory |
|---|---|---|---|
| SD 8 Elite | 35-45 | 45-55 | ~800MB |
| SD 8 Gen 3 | 30-40 | 40-50 | ~800MB |
| SD 8 Gen 2 | 25-35 | 35-45 | ~800MB |
| Tensor G4 | 28-35 | 35-42 | ~800MB |
| Tensor G3 | 25-32 | 30-38 | ~800MB |
| Exynos 2400 | 25-35 | 32-42 | ~800MB |
| SD 7+ Gen 3 | 22-28 | 28-35 | ~800MB |
| SD 7 Gen 3 | 18-25 | 22-30 | ~800MB |
| Dimensity 8300 | 20-28 | 25-33 | ~800MB |
| SD 6 Gen 3 | 12-18 | 15-22 | ~800MB |
Every flagship and mid-range chipset from the last 2-3 years runs 1B models at 20+ tokens per second. Even the Snapdragon 6 Gen 3 budget chip delivers usable performance.
3B Parameter Models (~1.7GB GGUF Q4)
| Chipset | CPU (tok/s) | GPU/Vulkan (tok/s) | Memory |
|---|---|---|---|
| SD 8 Elite | 18-25 | 22-30 | ~2.2GB |
| SD 8 Gen 3 | 15-22 | 20-28 | ~2.2GB |
| SD 8 Gen 2 | 12-18 | 16-22 | ~2.2GB |
| Tensor G4 | 14-20 | 18-24 | ~2.2GB |
| Tensor G3 | 12-16 | 15-20 | ~2.2GB |
| Exynos 2400 | 12-18 | 16-22 | ~2.2GB |
| SD 7+ Gen 3 | 10-14 | 13-18 | ~2.2GB |
| SD 7 Gen 3 | 7-11 | 9-14 | ~2.2GB |
| SD 6 Gen 3 | 4-7 | 5-9 | ~2.2GB |
3B models run well on flagships (15+ tok/s with GPU). Upper mid-range devices (SD 7+ Gen 3, Dimensity 8300) are usable. Lower mid-range and budget devices struggle to reach the 10 tok/s threshold for comfortable chat.
Vulkan GPU Acceleration
Vulkan GPU acceleration is the key to fast on-device inference on Android. The improvement over CPU-only inference ranges from 20-40% on most devices:
- Snapdragon 8 Gen 3: +30-35% with Vulkan
- Tensor G4: +25-30%
- Exynos 2400: +20-30%
- Mid-range Snapdragon 7: +20-25%
llama.cpp enables Vulkan acceleration with the n_gpu_layers parameter. Setting it to the model's full layer count offloads all computation to the GPU.
The Fragmentation Strategy
Android fragmentation is manageable with a tiered approach:
Tier 1: 1B Model (4GB+ RAM)
Covers 85%+ of active Android devices. Includes all smartphones from the last 3-4 years, most budget devices from the last 2 years.
- Model size: ~600MB (Q4_K_M)
- RAM requirement: 800MB during inference
- Speed: 12-55 tok/s depending on chipset
- Suitable for: classification, autocomplete, smart suggestions, short responses
Tier 2: 3B Model (8GB+ RAM)
Covers flagship and upper mid-range devices from the last 2-3 years. Roughly 40-50% of active Android devices in developed markets, growing each year.
- Model size: ~1.7GB (Q4_K_M)
- RAM requirement: 2.2GB during inference
- Speed: 10-30 tok/s on supported devices
- Suitable for: chat, summarization, content generation, complex tasks
Runtime Detection
Detect available RAM and chipset at runtime to select the appropriate model:
fun selectModelTier(): ModelTier {
val memInfo = ActivityManager.MemoryInfo()
val activityManager = getSystemService(ACTIVITY_SERVICE) as ActivityManager
activityManager.getMemoryInfo(memInfo)
val totalRamGb = memInfo.totalMem / (1024 * 1024 * 1024)
return when {
totalRamGb >= 8 -> ModelTier.THREE_B
totalRamGb >= 4 -> ModelTier.ONE_B
else -> ModelTier.NONE // Device too constrained
}
}
Thermal and Battery Impact
Thermal Throttling
Android devices are more prone to thermal throttling than iPhones during sustained inference. The throttling behavior varies by manufacturer:
- Samsung: Aggressive throttling, 20-40% speed reduction after 3-5 minutes of sustained load
- Pixel: Moderate throttling, 15-25% reduction after 5-7 minutes
- OnePlus/gaming phones: More lenient, 10-20% reduction
Battery Consumption
Running inference consumes roughly:
- 1B model: 2-3W during generation
- 3B model: 3-5W during generation
For context, typical phone battery capacity is 4,000-5,500 mAh. A 3B model generating continuously drains about 1% battery per minute. For typical usage (a few short interactions per hour), the battery impact is negligible.
Optimization
- Use CPU thread count matching the device's performance cores (typically 4)
- Unload the model when not in use to eliminate idle power draw
- For background tasks (classification, tagging), batch processing is more power-efficient than individual calls
What This Means for Developers
-
1B models are universally viable. Target 1B for broad reach. Fine-tune for your domain to maximize quality at this size.
-
3B models are flagship-ready. If your user base skews toward newer devices (common in paid apps), 3B delivers meaningfully better generation quality.
-
Vulkan matters. Always enable GPU acceleration. The 20-40% speed improvement is free performance.
-
Detect and adapt. Use runtime RAM detection to offer the right model tier. Do not force a 3B model on a 4GB device.
-
Fine-tune, do not just shrink. A fine-tuned 1B model on your domain data outperforms a general-purpose 3B on your specific tasks. Platforms like Ertas make this accessible: upload data, train with LoRA, export GGUF, deploy.
The Android ecosystem has the hardware. The inference engine (llama.cpp) handles the chipset diversity. The missing piece is the right model for your use case.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

AI in Android Apps: ML Kit, Cloud APIs, and On-Device LLMs Compared
Three paths to AI in your Android app. Google ML Kit for common tasks, cloud APIs for full LLM capability, and on-device models via llama.cpp for cost and privacy. A practical comparison for Kotlin developers.

Can LLMs Actually Run on iPhones? Benchmarks and Real-World Performance
Real benchmark data for running LLMs on iPhones via llama.cpp. Token generation speeds, memory usage, and thermal behavior across iPhone models from the iPhone 12 to iPhone 16 Pro.

llama.cpp on Android: A Kotlin Integration Guide
Step-by-step guide to integrating llama.cpp into an Android app with Kotlin. JNI bindings, Vulkan GPU acceleration, model loading, and memory management across the Android device spectrum.