
Gemma 3 for Mobile: Fine-Tuning and On-Device Deployment
How to use Google's Gemma 3 models for on-device mobile AI. Model selection, fine-tuning with LoRA, GGUF export, and deployment via llama.cpp on iOS and Android.
Google's Gemma 3 represents a significant step forward from Gemma 2. The 1B model is practical for mobile classification tasks, and the 4B model offers reasoning capability that competes with larger models from other families.
For mobile developers already in the Google ecosystem (Android, Firebase, Google Cloud), Gemma is a natural choice with good tooling support.
Gemma 3 Model Lineup for Mobile
| Model | Parameters | GGUF Q4 Size | RAM Needed | Mobile Viability |
|---|---|---|---|---|
| Gemma 3 1B | 1B | ~600MB | ~800MB | Excellent (4GB+ devices) |
| Gemma 3 4B | 4B | ~2.3GB | ~3GB | Good (8GB+ devices) |
| Gemma 3 12B | 12B | ~7GB | ~9GB | Not viable for mobile |
| Gemma 3 27B | 27B | ~15GB | ~18GB | Not viable for mobile |
The 1B and 4B models are the mobile-relevant sizes. The 4B is slightly larger than the typical 3B target but runs within budget on 8GB devices.
Gemma 3 vs Gemma 2
| Improvement | Gemma 2 | Gemma 3 |
|---|---|---|
| Instruction following (IFEval) | 51.2 (2B) | 54.2 (1B) |
| General knowledge (MMLU) | 51.3 (2B) | 46.8 (1B), 67.2 (4B) |
| Multilingual support | 20 languages | 35+ languages |
| Context window (1B) | 8K | 32K |
| Context window (4B) | 8K | 128K |
Gemma 3's 4B model is a standout. It approaches the capability of Llama 3.2's 8B (which is not mobile-viable) while fitting on flagship mobile devices.
When Gemma 3 Is the Right Choice
Google ecosystem integration: If you already use Firebase, Android Studio, and Google Cloud, Gemma has the smoothest tooling path. Google provides Keras integration, Vertex AI fine-tuning, and Android-specific documentation.
4B quality on flagships: If your app targets flagship devices and you need stronger reasoning than a 3B model provides, Gemma 3 4B fills a gap. It sits between the typical 3B and 7B categories.
Multilingual requirements: Gemma 3's 35+ language support is broader than Llama 3.2 (though narrower than Qwen). For European and South Asian language apps, Gemma is a strong choice.
Fine-Tuning Gemma 3
Training Data Format
Gemma uses a specific chat template with <start_of_turn> and <end_of_turn> tokens:
<start_of_turn>user
What's the return policy for electronics?<end_of_turn>
<start_of_turn>model
Electronics purchased within the last 30 days can be returned with receipt for a full refund. Items must be in original packaging.<end_of_turn>
For fine-tuning, structure your data as conversations following this template. Most training frameworks (Hugging Face, Axolotl, Unsloth) handle the template automatically when you specify Gemma as the model type.
LoRA Configuration
| Parameter | 1B | 4B |
|---|---|---|
| LoRA rank (r) | 16-32 | 16-64 |
| LoRA alpha | 32-64 | 32-128 |
| Learning rate | 2e-4 | 1e-4 |
| Epochs | 3-5 | 2-4 |
| Target modules | q_proj, v_proj, k_proj, o_proj | Same |
| Adapter size | 30-80MB | 50-150MB |
Training Data Requirements
The same guidelines apply as other model families:
| Task | Minimum Examples | Recommended |
|---|---|---|
| Classification | 200 | 500-1,000 |
| Q&A | 300 | 1,000-2,000 |
| Chat | 500 | 2,000-5,000 |
Quality After Fine-Tuning
Gemma 3 responds well to fine-tuning. The 1B model jumps from general-purpose mediocrity to domain-specific competence with as few as 500 examples. The 4B model fine-tunes to quality levels that rival prompted GPT-4o on narrow tasks.
Expected accuracy ranges (domain-specific classification):
- 1B base: 65-72%
- 1B fine-tuned (500 examples): 88-92%
- 4B base: 75-80%
- 4B fine-tuned (500 examples): 92-96%
GGUF Export
Gemma 3 models convert to GGUF format using the standard llama.cpp conversion tools. The process:
- Fine-tune with LoRA
- Merge the LoRA adapter into the base weights
- Convert to GGUF using
convert_hf_to_gguf.py - Quantize to Q4_K_M with
llama-quantize
Platforms like Ertas automate this pipeline: select Gemma 3 as the base model, upload training data, train, and export directly to GGUF at your desired quantization level.
Deployment on iOS and Android
Gemma 3 GGUF models run on llama.cpp identically to Llama or any other GGUF model. The deployment process is the same:
iOS: Load the GGUF via llama.cpp with Metal acceleration. No Gemma-specific configuration needed.
Android: Load via llama.android with Vulkan GPU acceleration. Same API as any other GGUF model.
The advantage of GGUF as a universal format is that your deployment infrastructure works with any model family. Switching from Llama to Gemma (or vice versa) requires only swapping the model file.
Performance on Mobile Devices
Gemma 3 1B (Q4_K_M, ~600MB)
| Device | Tokens/sec | Memory |
|---|---|---|
| iPhone 16 Pro | 38-48 | ~800MB |
| iPhone 15 | 26-34 | ~800MB |
| Galaxy S24 (Vulkan) | 38-48 | ~800MB |
| Mid-range Android | 18-25 | ~800MB |
Gemma 3 4B (Q4_K_M, ~2.3GB)
| Device | Tokens/sec | Memory |
|---|---|---|
| iPhone 16 Pro | 16-22 | ~3.0GB |
| iPhone 15 Pro | 14-20 | ~3.0GB |
| Galaxy S24 (Vulkan) | 18-24 | ~3.0GB |
| Galaxy S25 (Vulkan) | 20-28 | ~3.0GB |
The 4B model is slightly slower than a 3B model but the difference is small. On flagship devices, it is still well above the 10 tok/s usability threshold.
Gemma vs Gemini Nano
Google offers both Gemma (open model for self-deployment) and Gemini Nano (on-device via Android AICore). They serve different purposes:
| Factor | Gemma 3 (GGUF) | Gemini Nano |
|---|---|---|
| Custom fine-tuning | Yes | No |
| Device coverage | Any 4GB+ device | Pixel 8+, Galaxy S24+ only |
| Model control | Full | None |
| Tasks | Any text generation | Limited pre-defined tasks |
| Platform | iOS and Android | Android only |
| Cost | Free (on-device) | Free (on-device) |
If you need custom AI behavior, domain-specific knowledge, or cross-platform deployment, Gemma via GGUF is the right path. Gemini Nano is only appropriate for pre-defined tasks on a narrow device set.
Licensing
Gemma 3 uses the Gemma Terms of Use:
- Commercial use: Allowed
- Fine-tuning and modification: Allowed
- Distribution: Allowed
- No MAU threshold (unlike Llama's 700M limit)
- Cannot use outputs to train models that compete with Gemini
The license is practical for most mobile app use cases. The restriction on competitive model training is unlikely to affect mobile developers.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

On-Device AI Model Size Guide: 1B vs 3B vs 7B for Mobile
How to choose the right model size for your mobile app. Capability breakdown, device requirements, quality benchmarks, and the fine-tuning factor that changes the math.

Llama 3.2 for Mobile Apps: Fine-Tuning and On-Device Deployment
A complete guide to using Meta's Llama 3.2 1B and 3B models in mobile apps. Fine-tuning with LoRA, exporting to GGUF, and deploying on iOS and Android via llama.cpp.

How to Add AI to Your Mobile App: A Developer's Decision Guide
A comprehensive guide covering every approach to adding AI features to iOS and Android apps. Cloud APIs, on-device models, and hybrid architectures compared with real cost and performance data.