
Fine-Tuning Gemma 3: Google's Lightweight Model for On-Device Deployment
Gemma 3 is optimized for on-device inference — phones, tablets, edge hardware. Here's how to fine-tune it for mobile AI features and IoT applications that run without a server.
Running AI on a phone, a Raspberry Pi, or an IoT gateway — without ever hitting a server — changes what's possible. No latency from network round-trips. No API costs that scale with users. No dependency on internet connectivity. Complete data privacy because nothing leaves the device.
Gemma 3 is Google's open model family built specifically for this. While Llama and Qwen were designed for server-side inference and then squeezed onto smaller hardware, Gemma 3 was architected from the ground up for resource-constrained environments. The result is a model that runs faster, uses less memory, and handles on-device constraints more gracefully than its competitors at equivalent parameter counts.
The 4B model is the target for most on-device deployments. At Q4_K_M quantization, it fits in under 3 GB of RAM — well within the capabilities of a modern smartphone, a Raspberry Pi 5, or a browser tab. And when you fine-tune it for a specific task, it can do that task as well as models 5x its size.
Gemma 3 Model Sizes
| Model | Parameters | Size (Q4_K_M) | RAM Required | Target Deployment |
|---|---|---|---|---|
| Gemma 3 1B | 1B | 0.7 GB | 1.2 GB | Microcontrollers, wearables, ultra-constrained |
| Gemma 3 4B | 4B | 2.5 GB | 3.5 GB | Phones, tablets, Raspberry Pi, browser |
| Gemma 3 12B | 12B | 7.5 GB | 9 GB | Laptops, desktop apps, edge servers |
| Gemma 3 27B | 27B | 16 GB | 19 GB | Workstations, GPU servers |
The 4B model hits the sweet spot for on-device. It's large enough to handle meaningful tasks (classification, extraction, simple generation, intent detection) while small enough to run on hardware people already own.
The 1B model is for extreme constraints — wearables, embedded systems, or scenarios where you need the absolute smallest footprint. It handles simple classification and short-form tasks but struggles with anything requiring more than basic pattern matching.
Why Gemma for On-Device
Architecture Optimizations
Google designed Gemma 3 with on-device inference in mind:
- Sliding window attention on alternating layers reduces memory usage by 30-40% compared to full attention at equivalent context lengths. For on-device, this means you can process longer inputs without running out of RAM.
- Grouped query attention (GQA) with a 1:4 ratio compresses the KV cache, reducing memory allocation during inference. On a phone with 6 GB RAM, this is the difference between running and crashing.
- RMSNorm with learnable scale instead of LayerNorm — marginally faster per layer, which adds up over billions of operations on CPU/NPU hardware.
- Logit soft-capping stabilizes output probabilities, reducing the chance of degenerate outputs on quantized models. When you're running at Q4_0 on a phone NPU, this stability matters.
Inference Speed Comparison
Gemma 3 4B vs comparable models on different hardware, all at Q4_K_M:
| Hardware | Gemma 3 4B | Qwen 2.5 3B | Phi-3.5 Mini 3.8B | Llama 3.2 3B |
|---|---|---|---|---|
| iPhone 15 Pro (ANE) | 28 t/s | 22 t/s | 19 t/s | 24 t/s |
| Pixel 8 Pro (GPU) | 22 t/s | 17 t/s | 15 t/s | 19 t/s |
| Raspberry Pi 5 (8GB, CPU) | 6.4 t/s | 5.1 t/s | 4.2 t/s | 5.5 t/s |
| M2 MacBook Air (GPU) | 48 t/s | 38 t/s | 33 t/s | 41 t/s |
| Browser (WebLLM, Chrome) | 12 t/s | 9 t/s | 8 t/s | 10 t/s |
Gemma 3 is 15-30% faster than equivalent-size models across all on-device targets. On the iPhone's Apple Neural Engine, the advantage is particularly pronounced — Google optimized the weight layout for Apple's ML hardware.
Memory Footprint
| Model | Q4_K_M Size | Peak RAM (2K context) | Peak RAM (4K context) |
|---|---|---|---|
| Gemma 3 4B | 2.5 GB | 3.2 GB | 3.8 GB |
| Qwen 2.5 3B | 2.0 GB | 2.8 GB | 3.5 GB |
| Phi-3.5 Mini 3.8B | 2.3 GB | 3.4 GB | 4.3 GB |
| Llama 3.2 3B | 2.0 GB | 2.9 GB | 3.6 GB |
Gemma 3 4B has slightly higher base size than the 3B models (because it has more parameters), but its KV cache efficiency means the gap narrows at longer context lengths. At 4K context, Gemma uses less RAM than Phi-3.5 Mini despite having 200M more parameters.
Fine-Tuning the 4B Model
VRAM Requirements
| Configuration | VRAM |
|---|---|
| QLoRA (rank 16, 4-bit base) | 6 GB |
| QLoRA (rank 32, 4-bit base) | 8 GB |
| LoRA (rank 16, FP16 base) | 12 GB |
| Full fine-tuning (FP16) | 18 GB |
You can fine-tune Gemma 3 4B with QLoRA on an RTX 3060 12GB, an RTX 4060 8GB, or an M1 MacBook Pro with 16 GB unified memory. The training is fast — 500 examples typically completes in 12-18 minutes at rank 16.
Dataset Strategies for On-Device Tasks
On-device tasks have specific constraints that should shape your training data:
Short inputs, concise outputs. On-device, every token costs latency and memory. Train the model to produce minimal, structured outputs:
{"instruction": "Classify intent", "input": "Where's my order?", "output": "order_status"}
{"instruction": "Classify intent", "input": "I want to cancel", "output": "cancellation"}
{"instruction": "Classify intent", "input": "How do I change my payment method?", "output": "account_settings"}
Not this:
{"instruction": "Classify the customer's intent from the following message", "input": "Where's my order?", "output": "The customer's intent is to check their order status. Category: order_status"}
The verbose version wastes tokens on both input (longer instruction) and output (explanation nobody asked for). On a phone generating at 28 t/s, every unnecessary token adds 36ms of latency.
Optimize for latency-critical patterns. If the model needs to respond within 200ms, your training outputs should be under 15 tokens. Design your task format accordingly. Single-label classification (1 token output) is ideal for on-device. Short JSON objects (5-10 tokens) are acceptable. Paragraph-length generation is not a good fit for latency-sensitive on-device use.
Include edge cases from real device usage. Mobile users type differently than desktop users — more typos, more abbreviations, more informal language. Include messy real-world inputs in your training data:
{"instruction": "Classify intent", "input": "cant login pls help", "output": "authentication"}
{"instruction": "Classify intent", "input": "where tf is my package", "output": "order_status"}
{"instruction": "Classify intent", "input": "refudn pls", "output": "refund"}
Training Configuration
Recommended settings for Gemma 3 4B on-device fine-tuning:
| Parameter | Value | Notes |
|---|---|---|
| LoRA rank | 16 | Sufficient for classification/extraction |
| Learning rate | 2e-4 | Standard |
| Epochs | 5-6 | Small models need more passes |
| Batch size | 8 | Smaller model allows larger batches |
| Max seq length | 512 | Keep short for on-device tasks |
| Warmup ratio | 0.1 | Slightly higher for stability |
Setting max sequence length to 512 instead of the default 2048 significantly speeds up training and produces a model optimized for the short inputs typical of on-device use. If your on-device task involves longer documents, increase to 1024 or 2048 as needed.
GGUF Export and Quantization for Mobile
After fine-tuning, export to GGUF format. The quantization level you choose depends on your deployment target:
| Quantization | Model Size | Quality Loss | Best For |
|---|---|---|---|
| Q4_0 | 2.1 GB | 3-4% | Smallest footprint, memory-constrained devices |
| Q4_K_M | 2.5 GB | 1.5-2% | Good balance, most mobile deployments |
| Q5_K_M | 2.9 GB | 0.5-1% | Higher quality, devices with 4+ GB available RAM |
| Q8_0 | 4.2 GB | Less than 0.5% | Near-lossless, laptops and desktops |
For mobile apps on phones from the last 2-3 years (6+ GB RAM), Q4_K_M is the default recommendation. It's small enough to fit alongside the OS and other apps, fast enough for real-time responses, and the quality loss is negligible for classification and extraction tasks.
For Raspberry Pi or other memory-constrained edge devices, Q4_0 saves 400 MB of RAM, which can be the difference between running and not running. The quality drop is acceptable for simple classification tasks.
For browser deployment via WebLLM, Q4_0 or Q4_K_M — the model needs to download to the browser and fit in GPU memory (which is shared with the rest of the browser). Smaller is better.
Integration Patterns
React Native + Local Model
Use llama.rn for React Native integration. The library wraps llama.cpp and provides a JavaScript API:
- Bundle the GGUF file with your app (or download on first launch)
- Initialize the model on app startup (takes 2-4 seconds on iPhone 15)
- Run inference through the JS bridge
- Model stays loaded in memory until the app is backgrounded
Typical classification latency: 80-150ms including bridge overhead. For comparison, an API call takes 600-1,200ms minimum.
Native iOS with Core ML
Convert your GGUF to Core ML format using Google's conversion tools. Core ML models run on the Apple Neural Engine, which is faster and more power-efficient than GPU inference:
- iPhone 15 Pro: 32-35 t/s on ANE vs 22-25 t/s on GPU
- Battery impact: ANE uses roughly 40% less power than GPU for the same workload
- Memory: Core ML manages model memory more efficiently than llama.cpp
The trade-off: Core ML conversion is an extra step, and you lose some flexibility (no dynamic quantization, fixed batch size). Worth it for production iOS apps; not worth it for prototyping.
Native Android with NNAPI
On Android, use llama.cpp's NNAPI backend for hardware-accelerated inference on Qualcomm, MediaTek, and Samsung NPUs:
- Pixel 8 Pro (Tensor G3): 22-25 t/s on NPU
- Samsung S24 (Snapdragon 8 Gen 3): 26-30 t/s on NPU
- Battery: NPU inference uses 50-60% less power than CPU
NNAPI support varies by device. Always include a CPU fallback path.
Browser via WebLLM
WebLLM runs GGUF models in the browser using WebGPU:
- Chrome on M2 MacBook: 15-18 t/s
- Chrome on gaming laptop (RTX 4060): 20-25 t/s
- Chrome on iPhone 15 Pro: 8-10 t/s
The model downloads once and caches in the browser's storage (IndexedDB). Subsequent loads are near-instant. Good for web apps that need offline capability or data privacy.
Raspberry Pi via llama.cpp
For IoT and edge deployments, run llama.cpp directly on the Pi:
- Raspberry Pi 5 (8 GB): Gemma 3 4B at Q4_K_M, 6-7 t/s
- Raspberry Pi 5 (4 GB): Gemma 3 4B at Q4_0, 5-6 t/s (tight fit)
- Raspberry Pi 4 (8 GB): Gemma 3 4B at Q4_0, 2-3 t/s (usable for batch processing)
For the Pi 4, consider the Gemma 3 1B model instead — it runs at 8-10 t/s at Q4_K_M and handles simple classification reliably.
Performance Benchmarks on Real Devices
Fine-tuned Gemma 3 4B (Q4_K_M) on a 12-category intent classification task (500 training examples):
| Device | Accuracy | Latency (avg) | Latency (P99) | Tokens/sec |
|---|---|---|---|---|
| iPhone 15 Pro | 94% | 65ms | 120ms | 28 t/s |
| Pixel 8 Pro | 94% | 85ms | 160ms | 22 t/s |
| Samsung S24 | 94% | 72ms | 135ms | 26 t/s |
| M2 MacBook Air | 94% | 32ms | 55ms | 48 t/s |
| Raspberry Pi 5 | 94% | 280ms | 420ms | 6.4 t/s |
| Browser (Chrome, M2) | 94% | 110ms | 180ms | 12 t/s |
Note that accuracy is identical across all devices — quantization affects all platforms equally. The difference is purely latency/speed. Every device delivers sub-second responses for classification, which is fast enough for real-time user-facing features.
For context, a GPT-4o API call for the same classification task averages 850ms with a P99 of 3,200ms. The on-device model on a $200 phone beats the world's best API on latency by 10x.
Real Use Cases
Offline Form Validation
A field service app used Gemma 3 4B to validate and correct technician notes entered on tablets in areas with no cell coverage. The model checks spelling, flags missing required fields, and classifies the work type — all offline. Fine-tuned on 300 examples of technician notes, deployed at Q4_K_M on Android tablets. Accuracy: 92%. Latency: 120ms.
On-Device Intent Classification
A banking app uses Gemma 3 4B to classify customer messages into intents before routing to the appropriate service. Running on-device means the message text never leaves the phone — a compliance requirement in several jurisdictions. Fine-tuned on 400 examples, deployed via Core ML on iOS. Accuracy: 95%. Latency: 55ms.
IoT Sensor Analysis
An industrial monitoring system runs Gemma 3 4B on Raspberry Pi 5 gateways to classify sensor readings as normal, warning, or critical. Each gateway processes data from 12 sensors. Fine-tuned on 500 examples of sensor data patterns, deployed at Q4_0. Accuracy: 97% on the three-way classification. Processing: 4 readings/second per gateway.
Privacy-First Text Processing
A healthcare note-taking app uses Gemma 3 4B to structure and categorize clinical notes on the provider's iPad. No patient data leaves the device. Fine-tuned on 350 de-identified examples. HIPAA compliance is simplified because the AI processing happens entirely on-device — no cloud, no API, no BAA required for the inference step.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Further Reading
- Edge AI and Local Inference in 2026 — The landscape of on-device AI deployment, including hardware, frameworks, and real-world performance data.
- AI Agents Offline: Edge Fine-Tuned — How to build AI agents that work without internet connectivity using fine-tuned edge models.
- LoRA Adapter Edge Deployment Optimization — Optimizing LoRA adapters for minimal footprint on edge devices.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Fine-Tuning Phi-4: Microsoft's Best Small Model for Enterprise Tasks
Phi-4 14B outperforms GPT-4 on math benchmarks while running 15x faster on local hardware. Here's how to fine-tune it for classification, extraction, and structured output tasks.

Fine-Tuning Qwen 2.5 for Multilingual Applications
Qwen 2.5 covers 29 languages with 18 trillion training tokens. Here's how to fine-tune it for multilingual classification, support, and content generation without separate models per language.

SmolLM2 and Sub-3B Models: Fine-Tuning for Edge and Mobile
Sub-3B parameter models run on phones, Raspberry Pis, and browser tabs. Here's how to fine-tune SmolLM2, Phi-3.5 Mini, and Qwen 2.5 0.5B for edge deployment where every megabyte counts.