Fine-Tuning Gemma 3: Google's Lightweight Model for On-Device Deployment

Running AI on a phone, a Raspberry Pi, or an IoT gateway — without ever hitting a server — changes what's possible. No latency from network round-trips. No API costs that scale with users. No dependency on internet connectivity. Complete data privacy because nothing leaves the device.

Gemma 3 is Google's open model family built specifically for this. While Llama and Qwen were designed for server-side inference and then squeezed onto smaller hardware, Gemma 3 was architected from the ground up for resource-constrained environments. The result is a model that runs faster, uses less memory, and handles on-device constraints more gracefully than its competitors at equivalent parameter counts.

The 4B model is the target for most on-device deployments. At Q4_K_M quantization, it fits in under 3 GB of RAM — well within the capabilities of a modern smartphone, a Raspberry Pi 5, or a browser tab. And when you fine-tune it for a specific task, it can do that task as well as models 5x its size.

Gemma 3 Model Sizes

Model	Parameters	Size (Q4_K_M)	RAM Required	Target Deployment
Gemma 3 1B	1B	0.7 GB	1.2 GB	Microcontrollers, wearables, ultra-constrained
Gemma 3 4B	4B	2.5 GB	3.5 GB	Phones, tablets, Raspberry Pi, browser
Gemma 3 12B	12B	7.5 GB	9 GB	Laptops, desktop apps, edge servers
Gemma 3 27B	27B	16 GB	19 GB	Workstations, GPU servers

The 4B model hits the sweet spot for on-device. It's large enough to handle meaningful tasks (classification, extraction, simple generation, intent detection) while small enough to run on hardware people already own.

The 1B model is for extreme constraints — wearables, embedded systems, or scenarios where you need the absolute smallest footprint. It handles simple classification and short-form tasks but struggles with anything requiring more than basic pattern matching.

Why Gemma for On-Device

Architecture Optimizations

Google designed Gemma 3 with on-device inference in mind:

Sliding window attention on alternating layers reduces memory usage by 30-40% compared to full attention at equivalent context lengths. For on-device, this means you can process longer inputs without running out of RAM.
Grouped query attention (GQA) with a 1:4 ratio compresses the KV cache, reducing memory allocation during inference. On a phone with 6 GB RAM, this is the difference between running and crashing.
RMSNorm with learnable scale instead of LayerNorm — marginally faster per layer, which adds up over billions of operations on CPU/NPU hardware.
Logit soft-capping stabilizes output probabilities, reducing the chance of degenerate outputs on quantized models. When you're running at Q4_0 on a phone NPU, this stability matters.

Inference Speed Comparison

Gemma 3 4B vs comparable models on different hardware, all at Q4_K_M:

Hardware	Gemma 3 4B	Qwen 2.5 3B	Phi-3.5 Mini 3.8B	Llama 3.2 3B
iPhone 15 Pro (ANE)	28 t/s	22 t/s	19 t/s	24 t/s
Pixel 8 Pro (GPU)	22 t/s	17 t/s	15 t/s	19 t/s
Raspberry Pi 5 (8GB, CPU)	6.4 t/s	5.1 t/s	4.2 t/s	5.5 t/s
M2 MacBook Air (GPU)	48 t/s	38 t/s	33 t/s	41 t/s
Browser (WebLLM, Chrome)	12 t/s	9 t/s	8 t/s	10 t/s

Gemma 3 is 15-30% faster than equivalent-size models across all on-device targets. On the iPhone's Apple Neural Engine, the advantage is particularly pronounced — Google optimized the weight layout for Apple's ML hardware.

Memory Footprint

Model	Q4_K_M Size	Peak RAM (2K context)	Peak RAM (4K context)
Gemma 3 4B	2.5 GB	3.2 GB	3.8 GB
Qwen 2.5 3B	2.0 GB	2.8 GB	3.5 GB
Phi-3.5 Mini 3.8B	2.3 GB	3.4 GB	4.3 GB
Llama 3.2 3B	2.0 GB	2.9 GB	3.6 GB

Gemma 3 4B has slightly higher base size than the 3B models (because it has more parameters), but its KV cache efficiency means the gap narrows at longer context lengths. At 4K context, Gemma uses less RAM than Phi-3.5 Mini despite having 200M more parameters.

Fine-Tuning the 4B Model

VRAM Requirements

Configuration	VRAM
QLoRA (rank 16, 4-bit base)	6 GB
QLoRA (rank 32, 4-bit base)	8 GB
LoRA (rank 16, FP16 base)	12 GB
Full fine-tuning (FP16)	18 GB

You can fine-tune Gemma 3 4B with QLoRA on an RTX 3060 12GB, an RTX 4060 8GB, or an M1 MacBook Pro with 16 GB unified memory. The training is fast — 500 examples typically completes in 12-18 minutes at rank 16.

Dataset Strategies for On-Device Tasks

On-device tasks have specific constraints that should shape your training data:

Short inputs, concise outputs. On-device, every token costs latency and memory. Train the model to produce minimal, structured outputs:

{"instruction": "Classify intent", "input": "Where's my order?", "output": "order_status"}
{"instruction": "Classify intent", "input": "I want to cancel", "output": "cancellation"}
{"instruction": "Classify intent", "input": "How do I change my payment method?", "output": "account_settings"}

Not this:

{"instruction": "Classify the customer's intent from the following message", "input": "Where's my order?", "output": "The customer's intent is to check their order status. Category: order_status"}

The verbose version wastes tokens on both input (longer instruction) and output (explanation nobody asked for). On a phone generating at 28 t/s, every unnecessary token adds 36ms of latency.

Optimize for latency-critical patterns. If the model needs to respond within 200ms, your training outputs should be under 15 tokens. Design your task format accordingly. Single-label classification (1 token output) is ideal for on-device. Short JSON objects (5-10 tokens) are acceptable. Paragraph-length generation is not a good fit for latency-sensitive on-device use.

Include edge cases from real device usage. Mobile users type differently than desktop users — more typos, more abbreviations, more informal language. Include messy real-world inputs in your training data:

{"instruction": "Classify intent", "input": "cant login pls help", "output": "authentication"}
{"instruction": "Classify intent", "input": "where tf is my package", "output": "order_status"}
{"instruction": "Classify intent", "input": "refudn pls", "output": "refund"}

Training Configuration

Recommended settings for Gemma 3 4B on-device fine-tuning:

Parameter	Value	Notes
LoRA rank	16	Sufficient for classification/extraction
Learning rate	2e-4	Standard
Epochs	5-6	Small models need more passes
Batch size	8	Smaller model allows larger batches
Max seq length	512	Keep short for on-device tasks
Warmup ratio	0.1	Slightly higher for stability

Setting max sequence length to 512 instead of the default 2048 significantly speeds up training and produces a model optimized for the short inputs typical of on-device use. If your on-device task involves longer documents, increase to 1024 or 2048 as needed.

GGUF Export and Quantization for Mobile

After fine-tuning, export to GGUF format. The quantization level you choose depends on your deployment target:

Quantization	Model Size	Quality Loss	Best For
Q4_0	2.1 GB	3-4%	Smallest footprint, memory-constrained devices
Q4_K_M	2.5 GB	1.5-2%	Good balance, most mobile deployments
Q5_K_M	2.9 GB	0.5-1%	Higher quality, devices with 4+ GB available RAM
Q8_0	4.2 GB	Less than 0.5%	Near-lossless, laptops and desktops

For mobile apps on phones from the last 2-3 years (6+ GB RAM), Q4_K_M is the default recommendation. It's small enough to fit alongside the OS and other apps, fast enough for real-time responses, and the quality loss is negligible for classification and extraction tasks.

For Raspberry Pi or other memory-constrained edge devices, Q4_0 saves 400 MB of RAM, which can be the difference between running and not running. The quality drop is acceptable for simple classification tasks.

For browser deployment via WebLLM, Q4_0 or Q4_K_M — the model needs to download to the browser and fit in GPU memory (which is shared with the rest of the browser). Smaller is better.

Integration Patterns

React Native + Local Model

Use llama.rn for React Native integration. The library wraps llama.cpp and provides a JavaScript API:

Bundle the GGUF file with your app (or download on first launch)
Initialize the model on app startup (takes 2-4 seconds on iPhone 15)
Run inference through the JS bridge
Model stays loaded in memory until the app is backgrounded

Typical classification latency: 80-150ms including bridge overhead. For comparison, an API call takes 600-1,200ms minimum.

Native iOS with Core ML

Convert your GGUF to Core ML format using Google's conversion tools. Core ML models run on the Apple Neural Engine, which is faster and more power-efficient than GPU inference:

iPhone 15 Pro: 32-35 t/s on ANE vs 22-25 t/s on GPU
Battery impact: ANE uses roughly 40% less power than GPU for the same workload
Memory: Core ML manages model memory more efficiently than llama.cpp

The trade-off: Core ML conversion is an extra step, and you lose some flexibility (no dynamic quantization, fixed batch size). Worth it for production iOS apps; not worth it for prototyping.

Native Android with NNAPI

On Android, use llama.cpp's NNAPI backend for hardware-accelerated inference on Qualcomm, MediaTek, and Samsung NPUs:

Pixel 8 Pro (Tensor G3): 22-25 t/s on NPU
Samsung S24 (Snapdragon 8 Gen 3): 26-30 t/s on NPU
Battery: NPU inference uses 50-60% less power than CPU

NNAPI support varies by device. Always include a CPU fallback path.

Browser via WebLLM

WebLLM runs GGUF models in the browser using WebGPU:

Chrome on M2 MacBook: 15-18 t/s
Chrome on gaming laptop (RTX 4060): 20-25 t/s
Chrome on iPhone 15 Pro: 8-10 t/s

The model downloads once and caches in the browser's storage (IndexedDB). Subsequent loads are near-instant. Good for web apps that need offline capability or data privacy.

Raspberry Pi via llama.cpp

For IoT and edge deployments, run llama.cpp directly on the Pi:

Raspberry Pi 5 (8 GB): Gemma 3 4B at Q4_K_M, 6-7 t/s
Raspberry Pi 5 (4 GB): Gemma 3 4B at Q4_0, 5-6 t/s (tight fit)
Raspberry Pi 4 (8 GB): Gemma 3 4B at Q4_0, 2-3 t/s (usable for batch processing)

For the Pi 4, consider the Gemma 3 1B model instead — it runs at 8-10 t/s at Q4_K_M and handles simple classification reliably.

Performance Benchmarks on Real Devices

Fine-tuned Gemma 3 4B (Q4_K_M) on a 12-category intent classification task (500 training examples):

Device	Accuracy	Latency (avg)	Latency (P99)	Tokens/sec
iPhone 15 Pro	94%	65ms	120ms	28 t/s
Pixel 8 Pro	94%	85ms	160ms	22 t/s
Samsung S24	94%	72ms	135ms	26 t/s
M2 MacBook Air	94%	32ms	55ms	48 t/s
Raspberry Pi 5	94%	280ms	420ms	6.4 t/s
Browser (Chrome, M2)	94%	110ms	180ms	12 t/s

Note that accuracy is identical across all devices — quantization affects all platforms equally. The difference is purely latency/speed. Every device delivers sub-second responses for classification, which is fast enough for real-time user-facing features.

For context, a GPT-4o API call for the same classification task averages 850ms with a P99 of 3,200ms. The on-device model on a $200 phone beats the world's best API on latency by 10x.

Real Use Cases

Offline Form Validation

A field service app used Gemma 3 4B to validate and correct technician notes entered on tablets in areas with no cell coverage. The model checks spelling, flags missing required fields, and classifies the work type — all offline. Fine-tuned on 300 examples of technician notes, deployed at Q4_K_M on Android tablets. Accuracy: 92%. Latency: 120ms.

On-Device Intent Classification

A banking app uses Gemma 3 4B to classify customer messages into intents before routing to the appropriate service. Running on-device means the message text never leaves the phone — a compliance requirement in several jurisdictions. Fine-tuned on 400 examples, deployed via Core ML on iOS. Accuracy: 95%. Latency: 55ms.

IoT Sensor Analysis

An industrial monitoring system runs Gemma 3 4B on Raspberry Pi 5 gateways to classify sensor readings as normal, warning, or critical. Each gateway processes data from 12 sensors. Fine-tuned on 500 examples of sensor data patterns, deployed at Q4_0. Accuracy: 97% on the three-way classification. Processing: 4 readings/second per gateway.

Privacy-First Text Processing

A healthcare note-taking app uses Gemma 3 4B to structure and categorize clinical notes on the provider's iPad. No patient data leaves the device. Fine-tuned on 350 de-identified examples. HIPAA compliance is simplified because the AI processing happens entirely on-device — no cloud, no API, no BAA required for the inference step.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →