Back to blog
    Gemma 3 for Mobile: Fine-Tuning and On-Device Deployment
    GemmaGooglefine-tuningmobile AIGGUFon-device AIsegment:mobile-builder

    Gemma 3 for Mobile: Fine-Tuning and On-Device Deployment

    How to use Google's Gemma 3 models for on-device mobile AI. Model selection, fine-tuning with LoRA, GGUF export, and deployment via llama.cpp on iOS and Android.

    EErtas Team·

    Google's Gemma 3 represents a significant step forward from Gemma 2. The 1B model is practical for mobile classification tasks, and the 4B model offers reasoning capability that competes with larger models from other families.

    For mobile developers already in the Google ecosystem (Android, Firebase, Google Cloud), Gemma is a natural choice with good tooling support.

    Gemma 3 Model Lineup for Mobile

    ModelParametersGGUF Q4 SizeRAM NeededMobile Viability
    Gemma 3 1B1B~600MB~800MBExcellent (4GB+ devices)
    Gemma 3 4B4B~2.3GB~3GBGood (8GB+ devices)
    Gemma 3 12B12B~7GB~9GBNot viable for mobile
    Gemma 3 27B27B~15GB~18GBNot viable for mobile

    The 1B and 4B models are the mobile-relevant sizes. The 4B is slightly larger than the typical 3B target but runs within budget on 8GB devices.

    Gemma 3 vs Gemma 2

    ImprovementGemma 2Gemma 3
    Instruction following (IFEval)51.2 (2B)54.2 (1B)
    General knowledge (MMLU)51.3 (2B)46.8 (1B), 67.2 (4B)
    Multilingual support20 languages35+ languages
    Context window (1B)8K32K
    Context window (4B)8K128K

    Gemma 3's 4B model is a standout. It approaches the capability of Llama 3.2's 8B (which is not mobile-viable) while fitting on flagship mobile devices.

    When Gemma 3 Is the Right Choice

    Google ecosystem integration: If you already use Firebase, Android Studio, and Google Cloud, Gemma has the smoothest tooling path. Google provides Keras integration, Vertex AI fine-tuning, and Android-specific documentation.

    4B quality on flagships: If your app targets flagship devices and you need stronger reasoning than a 3B model provides, Gemma 3 4B fills a gap. It sits between the typical 3B and 7B categories.

    Multilingual requirements: Gemma 3's 35+ language support is broader than Llama 3.2 (though narrower than Qwen). For European and South Asian language apps, Gemma is a strong choice.

    Fine-Tuning Gemma 3

    Training Data Format

    Gemma uses a specific chat template with <start_of_turn> and <end_of_turn> tokens:

    <start_of_turn>user
    What's the return policy for electronics?<end_of_turn>
    <start_of_turn>model
    Electronics purchased within the last 30 days can be returned with receipt for a full refund. Items must be in original packaging.<end_of_turn>
    

    For fine-tuning, structure your data as conversations following this template. Most training frameworks (Hugging Face, Axolotl, Unsloth) handle the template automatically when you specify Gemma as the model type.

    LoRA Configuration

    Parameter1B4B
    LoRA rank (r)16-3216-64
    LoRA alpha32-6432-128
    Learning rate2e-41e-4
    Epochs3-52-4
    Target modulesq_proj, v_proj, k_proj, o_projSame
    Adapter size30-80MB50-150MB

    Training Data Requirements

    The same guidelines apply as other model families:

    TaskMinimum ExamplesRecommended
    Classification200500-1,000
    Q&A3001,000-2,000
    Chat5002,000-5,000

    Quality After Fine-Tuning

    Gemma 3 responds well to fine-tuning. The 1B model jumps from general-purpose mediocrity to domain-specific competence with as few as 500 examples. The 4B model fine-tunes to quality levels that rival prompted GPT-4o on narrow tasks.

    Expected accuracy ranges (domain-specific classification):

    • 1B base: 65-72%
    • 1B fine-tuned (500 examples): 88-92%
    • 4B base: 75-80%
    • 4B fine-tuned (500 examples): 92-96%

    GGUF Export

    Gemma 3 models convert to GGUF format using the standard llama.cpp conversion tools. The process:

    1. Fine-tune with LoRA
    2. Merge the LoRA adapter into the base weights
    3. Convert to GGUF using convert_hf_to_gguf.py
    4. Quantize to Q4_K_M with llama-quantize

    Platforms like Ertas automate this pipeline: select Gemma 3 as the base model, upload training data, train, and export directly to GGUF at your desired quantization level.

    Deployment on iOS and Android

    Gemma 3 GGUF models run on llama.cpp identically to Llama or any other GGUF model. The deployment process is the same:

    iOS: Load the GGUF via llama.cpp with Metal acceleration. No Gemma-specific configuration needed.

    Android: Load via llama.android with Vulkan GPU acceleration. Same API as any other GGUF model.

    The advantage of GGUF as a universal format is that your deployment infrastructure works with any model family. Switching from Llama to Gemma (or vice versa) requires only swapping the model file.

    Performance on Mobile Devices

    Gemma 3 1B (Q4_K_M, ~600MB)

    DeviceTokens/secMemory
    iPhone 16 Pro38-48~800MB
    iPhone 1526-34~800MB
    Galaxy S24 (Vulkan)38-48~800MB
    Mid-range Android18-25~800MB

    Gemma 3 4B (Q4_K_M, ~2.3GB)

    DeviceTokens/secMemory
    iPhone 16 Pro16-22~3.0GB
    iPhone 15 Pro14-20~3.0GB
    Galaxy S24 (Vulkan)18-24~3.0GB
    Galaxy S25 (Vulkan)20-28~3.0GB

    The 4B model is slightly slower than a 3B model but the difference is small. On flagship devices, it is still well above the 10 tok/s usability threshold.

    Gemma vs Gemini Nano

    Google offers both Gemma (open model for self-deployment) and Gemini Nano (on-device via Android AICore). They serve different purposes:

    FactorGemma 3 (GGUF)Gemini Nano
    Custom fine-tuningYesNo
    Device coverageAny 4GB+ devicePixel 8+, Galaxy S24+ only
    Model controlFullNone
    TasksAny text generationLimited pre-defined tasks
    PlatformiOS and AndroidAndroid only
    CostFree (on-device)Free (on-device)

    If you need custom AI behavior, domain-specific knowledge, or cross-platform deployment, Gemma via GGUF is the right path. Gemini Nano is only appropriate for pre-defined tasks on a narrow device set.

    Licensing

    Gemma 3 uses the Gemma Terms of Use:

    • Commercial use: Allowed
    • Fine-tuning and modification: Allowed
    • Distribution: Allowed
    • No MAU threshold (unlike Llama's 700M limit)
    • Cannot use outputs to train models that compete with Gemini

    The license is practical for most mobile app use cases. The restriction on competitive model training is unlikely to affect mobile developers.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading