Back to blog
    Fine-Tuning Gemma 3: Google's Lightweight Model for On-Device Deployment
    gemmagoogleon-deviceedge-aifine-tuningslmsegment:developer

    Fine-Tuning Gemma 3: Google's Lightweight Model for On-Device Deployment

    Gemma 3 is optimized for on-device inference — phones, tablets, edge hardware. Here's how to fine-tune it for mobile AI features and IoT applications that run without a server.

    EErtas Team·

    Running AI on a phone, a Raspberry Pi, or an IoT gateway — without ever hitting a server — changes what's possible. No latency from network round-trips. No API costs that scale with users. No dependency on internet connectivity. Complete data privacy because nothing leaves the device.

    Gemma 3 is Google's open model family built specifically for this. While Llama and Qwen were designed for server-side inference and then squeezed onto smaller hardware, Gemma 3 was architected from the ground up for resource-constrained environments. The result is a model that runs faster, uses less memory, and handles on-device constraints more gracefully than its competitors at equivalent parameter counts.

    The 4B model is the target for most on-device deployments. At Q4_K_M quantization, it fits in under 3 GB of RAM — well within the capabilities of a modern smartphone, a Raspberry Pi 5, or a browser tab. And when you fine-tune it for a specific task, it can do that task as well as models 5x its size.

    Gemma 3 Model Sizes

    ModelParametersSize (Q4_K_M)RAM RequiredTarget Deployment
    Gemma 3 1B1B0.7 GB1.2 GBMicrocontrollers, wearables, ultra-constrained
    Gemma 3 4B4B2.5 GB3.5 GBPhones, tablets, Raspberry Pi, browser
    Gemma 3 12B12B7.5 GB9 GBLaptops, desktop apps, edge servers
    Gemma 3 27B27B16 GB19 GBWorkstations, GPU servers

    The 4B model hits the sweet spot for on-device. It's large enough to handle meaningful tasks (classification, extraction, simple generation, intent detection) while small enough to run on hardware people already own.

    The 1B model is for extreme constraints — wearables, embedded systems, or scenarios where you need the absolute smallest footprint. It handles simple classification and short-form tasks but struggles with anything requiring more than basic pattern matching.

    Why Gemma for On-Device

    Architecture Optimizations

    Google designed Gemma 3 with on-device inference in mind:

    • Sliding window attention on alternating layers reduces memory usage by 30-40% compared to full attention at equivalent context lengths. For on-device, this means you can process longer inputs without running out of RAM.
    • Grouped query attention (GQA) with a 1:4 ratio compresses the KV cache, reducing memory allocation during inference. On a phone with 6 GB RAM, this is the difference between running and crashing.
    • RMSNorm with learnable scale instead of LayerNorm — marginally faster per layer, which adds up over billions of operations on CPU/NPU hardware.
    • Logit soft-capping stabilizes output probabilities, reducing the chance of degenerate outputs on quantized models. When you're running at Q4_0 on a phone NPU, this stability matters.

    Inference Speed Comparison

    Gemma 3 4B vs comparable models on different hardware, all at Q4_K_M:

    HardwareGemma 3 4BQwen 2.5 3BPhi-3.5 Mini 3.8BLlama 3.2 3B
    iPhone 15 Pro (ANE)28 t/s22 t/s19 t/s24 t/s
    Pixel 8 Pro (GPU)22 t/s17 t/s15 t/s19 t/s
    Raspberry Pi 5 (8GB, CPU)6.4 t/s5.1 t/s4.2 t/s5.5 t/s
    M2 MacBook Air (GPU)48 t/s38 t/s33 t/s41 t/s
    Browser (WebLLM, Chrome)12 t/s9 t/s8 t/s10 t/s

    Gemma 3 is 15-30% faster than equivalent-size models across all on-device targets. On the iPhone's Apple Neural Engine, the advantage is particularly pronounced — Google optimized the weight layout for Apple's ML hardware.

    Memory Footprint

    ModelQ4_K_M SizePeak RAM (2K context)Peak RAM (4K context)
    Gemma 3 4B2.5 GB3.2 GB3.8 GB
    Qwen 2.5 3B2.0 GB2.8 GB3.5 GB
    Phi-3.5 Mini 3.8B2.3 GB3.4 GB4.3 GB
    Llama 3.2 3B2.0 GB2.9 GB3.6 GB

    Gemma 3 4B has slightly higher base size than the 3B models (because it has more parameters), but its KV cache efficiency means the gap narrows at longer context lengths. At 4K context, Gemma uses less RAM than Phi-3.5 Mini despite having 200M more parameters.

    Fine-Tuning the 4B Model

    VRAM Requirements

    ConfigurationVRAM
    QLoRA (rank 16, 4-bit base)6 GB
    QLoRA (rank 32, 4-bit base)8 GB
    LoRA (rank 16, FP16 base)12 GB
    Full fine-tuning (FP16)18 GB

    You can fine-tune Gemma 3 4B with QLoRA on an RTX 3060 12GB, an RTX 4060 8GB, or an M1 MacBook Pro with 16 GB unified memory. The training is fast — 500 examples typically completes in 12-18 minutes at rank 16.

    Dataset Strategies for On-Device Tasks

    On-device tasks have specific constraints that should shape your training data:

    Short inputs, concise outputs. On-device, every token costs latency and memory. Train the model to produce minimal, structured outputs:

    {"instruction": "Classify intent", "input": "Where's my order?", "output": "order_status"}
    {"instruction": "Classify intent", "input": "I want to cancel", "output": "cancellation"}
    {"instruction": "Classify intent", "input": "How do I change my payment method?", "output": "account_settings"}
    

    Not this:

    {"instruction": "Classify the customer's intent from the following message", "input": "Where's my order?", "output": "The customer's intent is to check their order status. Category: order_status"}
    

    The verbose version wastes tokens on both input (longer instruction) and output (explanation nobody asked for). On a phone generating at 28 t/s, every unnecessary token adds 36ms of latency.

    Optimize for latency-critical patterns. If the model needs to respond within 200ms, your training outputs should be under 15 tokens. Design your task format accordingly. Single-label classification (1 token output) is ideal for on-device. Short JSON objects (5-10 tokens) are acceptable. Paragraph-length generation is not a good fit for latency-sensitive on-device use.

    Include edge cases from real device usage. Mobile users type differently than desktop users — more typos, more abbreviations, more informal language. Include messy real-world inputs in your training data:

    {"instruction": "Classify intent", "input": "cant login pls help", "output": "authentication"}
    {"instruction": "Classify intent", "input": "where tf is my package", "output": "order_status"}
    {"instruction": "Classify intent", "input": "refudn pls", "output": "refund"}
    

    Training Configuration

    Recommended settings for Gemma 3 4B on-device fine-tuning:

    ParameterValueNotes
    LoRA rank16Sufficient for classification/extraction
    Learning rate2e-4Standard
    Epochs5-6Small models need more passes
    Batch size8Smaller model allows larger batches
    Max seq length512Keep short for on-device tasks
    Warmup ratio0.1Slightly higher for stability

    Setting max sequence length to 512 instead of the default 2048 significantly speeds up training and produces a model optimized for the short inputs typical of on-device use. If your on-device task involves longer documents, increase to 1024 or 2048 as needed.

    GGUF Export and Quantization for Mobile

    After fine-tuning, export to GGUF format. The quantization level you choose depends on your deployment target:

    QuantizationModel SizeQuality LossBest For
    Q4_02.1 GB3-4%Smallest footprint, memory-constrained devices
    Q4_K_M2.5 GB1.5-2%Good balance, most mobile deployments
    Q5_K_M2.9 GB0.5-1%Higher quality, devices with 4+ GB available RAM
    Q8_04.2 GBLess than 0.5%Near-lossless, laptops and desktops

    For mobile apps on phones from the last 2-3 years (6+ GB RAM), Q4_K_M is the default recommendation. It's small enough to fit alongside the OS and other apps, fast enough for real-time responses, and the quality loss is negligible for classification and extraction tasks.

    For Raspberry Pi or other memory-constrained edge devices, Q4_0 saves 400 MB of RAM, which can be the difference between running and not running. The quality drop is acceptable for simple classification tasks.

    For browser deployment via WebLLM, Q4_0 or Q4_K_M — the model needs to download to the browser and fit in GPU memory (which is shared with the rest of the browser). Smaller is better.

    Integration Patterns

    React Native + Local Model

    Use llama.rn for React Native integration. The library wraps llama.cpp and provides a JavaScript API:

    1. Bundle the GGUF file with your app (or download on first launch)
    2. Initialize the model on app startup (takes 2-4 seconds on iPhone 15)
    3. Run inference through the JS bridge
    4. Model stays loaded in memory until the app is backgrounded

    Typical classification latency: 80-150ms including bridge overhead. For comparison, an API call takes 600-1,200ms minimum.

    Native iOS with Core ML

    Convert your GGUF to Core ML format using Google's conversion tools. Core ML models run on the Apple Neural Engine, which is faster and more power-efficient than GPU inference:

    • iPhone 15 Pro: 32-35 t/s on ANE vs 22-25 t/s on GPU
    • Battery impact: ANE uses roughly 40% less power than GPU for the same workload
    • Memory: Core ML manages model memory more efficiently than llama.cpp

    The trade-off: Core ML conversion is an extra step, and you lose some flexibility (no dynamic quantization, fixed batch size). Worth it for production iOS apps; not worth it for prototyping.

    Native Android with NNAPI

    On Android, use llama.cpp's NNAPI backend for hardware-accelerated inference on Qualcomm, MediaTek, and Samsung NPUs:

    • Pixel 8 Pro (Tensor G3): 22-25 t/s on NPU
    • Samsung S24 (Snapdragon 8 Gen 3): 26-30 t/s on NPU
    • Battery: NPU inference uses 50-60% less power than CPU

    NNAPI support varies by device. Always include a CPU fallback path.

    Browser via WebLLM

    WebLLM runs GGUF models in the browser using WebGPU:

    • Chrome on M2 MacBook: 15-18 t/s
    • Chrome on gaming laptop (RTX 4060): 20-25 t/s
    • Chrome on iPhone 15 Pro: 8-10 t/s

    The model downloads once and caches in the browser's storage (IndexedDB). Subsequent loads are near-instant. Good for web apps that need offline capability or data privacy.

    Raspberry Pi via llama.cpp

    For IoT and edge deployments, run llama.cpp directly on the Pi:

    • Raspberry Pi 5 (8 GB): Gemma 3 4B at Q4_K_M, 6-7 t/s
    • Raspberry Pi 5 (4 GB): Gemma 3 4B at Q4_0, 5-6 t/s (tight fit)
    • Raspberry Pi 4 (8 GB): Gemma 3 4B at Q4_0, 2-3 t/s (usable for batch processing)

    For the Pi 4, consider the Gemma 3 1B model instead — it runs at 8-10 t/s at Q4_K_M and handles simple classification reliably.

    Performance Benchmarks on Real Devices

    Fine-tuned Gemma 3 4B (Q4_K_M) on a 12-category intent classification task (500 training examples):

    DeviceAccuracyLatency (avg)Latency (P99)Tokens/sec
    iPhone 15 Pro94%65ms120ms28 t/s
    Pixel 8 Pro94%85ms160ms22 t/s
    Samsung S2494%72ms135ms26 t/s
    M2 MacBook Air94%32ms55ms48 t/s
    Raspberry Pi 594%280ms420ms6.4 t/s
    Browser (Chrome, M2)94%110ms180ms12 t/s

    Note that accuracy is identical across all devices — quantization affects all platforms equally. The difference is purely latency/speed. Every device delivers sub-second responses for classification, which is fast enough for real-time user-facing features.

    For context, a GPT-4o API call for the same classification task averages 850ms with a P99 of 3,200ms. The on-device model on a $200 phone beats the world's best API on latency by 10x.

    Real Use Cases

    Offline Form Validation

    A field service app used Gemma 3 4B to validate and correct technician notes entered on tablets in areas with no cell coverage. The model checks spelling, flags missing required fields, and classifies the work type — all offline. Fine-tuned on 300 examples of technician notes, deployed at Q4_K_M on Android tablets. Accuracy: 92%. Latency: 120ms.

    On-Device Intent Classification

    A banking app uses Gemma 3 4B to classify customer messages into intents before routing to the appropriate service. Running on-device means the message text never leaves the phone — a compliance requirement in several jurisdictions. Fine-tuned on 400 examples, deployed via Core ML on iOS. Accuracy: 95%. Latency: 55ms.

    IoT Sensor Analysis

    An industrial monitoring system runs Gemma 3 4B on Raspberry Pi 5 gateways to classify sensor readings as normal, warning, or critical. Each gateway processes data from 12 sensors. Fine-tuned on 500 examples of sensor data patterns, deployed at Q4_0. Accuracy: 97% on the three-way classification. Processing: 4 readings/second per gateway.

    Privacy-First Text Processing

    A healthcare note-taking app uses Gemma 3 4B to structure and categorize clinical notes on the provider's iPad. No patient data leaves the device. Fine-tuned on 350 de-identified examples. HIPAA compliance is simplified because the AI processing happens entirely on-device — no cloud, no API, no BAA required for the inference step.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading