
Llama 3.2 for Mobile Apps: Fine-Tuning and On-Device Deployment
A complete guide to using Meta's Llama 3.2 1B and 3B models in mobile apps. Fine-tuning with LoRA, exporting to GGUF, and deploying on iOS and Android via llama.cpp.
Meta's Llama 3.2 includes 1B and 3B models designed specifically for mobile and edge deployment. These are not scaled-down afterthoughts. They were purpose-built for on-device inference, distilled from the larger Llama 3.1 models to retain capability while fitting in mobile memory budgets.
This guide covers the full pipeline: selecting the right size, fine-tuning on your data, exporting to GGUF, and deploying on iOS and Android.
Why Llama 3.2 for Mobile
Llama 3.2 1B and 3B have several advantages for mobile deployment:
Designed for mobile: Unlike larger models that are squeezed down, these were trained from the start with mobile constraints in mind. The architecture is optimized for fast inference on limited hardware.
Strong base capability: Trained on 9 trillion tokens. The 3B model scores 63.4 on MMLU and 77.4 on IFEval (instruction following), making it competitive with models 2-3x its size from just a year earlier.
Largest ecosystem: Llama has the biggest community of any open model family. More fine-tuning guides, more GGUF conversions, more tooling support, and more production deployment examples than any alternative.
128K context: Both 1B and 3B support 128K token context windows. For mobile, you will rarely use more than 2-4K, but the long context is there if needed.
Choosing 1B vs 3B
| Factor | 1B | 3B |
|---|---|---|
| GGUF Q4 size | ~600MB | ~1.7GB |
| RAM during inference | ~800MB | ~2.2GB |
| Device coverage | 4GB+ (90% of phones) | 6GB+ (65% of phones) |
| Generation speed (flagship) | 35-50 tok/s | 18-30 tok/s |
| Classification accuracy (fine-tuned) | 90-94% | 93-96% |
| Chat quality (fine-tuned) | Good for short responses | Good for multi-turn |
| Summarization | Adequate | Good |
Choose 1B when: Your task is classification, tagging, autocomplete, smart suggestions, or short-form generation. You want maximum device coverage.
Choose 3B when: Your task is conversational chat, summarization, content drafting, or complex instruction following. Your users have newer devices.
Fine-Tuning with LoRA
LoRA (Low-Rank Adaptation) is the standard fine-tuning method for mobile models. Instead of modifying all model weights, LoRA trains small adapter matrices that adjust the model's behavior. The adapter is 50-200MB, merged into the base model before GGUF export.
Training Data Format
Llama 3.2 uses a specific chat template. Your training data should follow this format:
{
"messages": [
{"role": "system", "content": "You are a travel assistant for TripHelper app."},
{"role": "user", "content": "What's the best time to visit Kyoto?"},
{"role": "assistant", "content": "March-April for cherry blossoms or November for autumn foliage. Both are peak seasons, so book 2-3 months ahead."}
]
}
Each training example is a conversation. Include the system prompt if your app uses one. Include multi-turn examples if your feature is conversational.
Data Requirements
| Task | Minimum Examples | Recommended | Training Time (LoRA) |
|---|---|---|---|
| Classification | 200 | 500-1,000 | 15-30 min |
| Short Q&A | 300 | 1,000-2,000 | 30-60 min |
| Chat | 500 | 2,000-5,000 | 1-3 hours |
| Summarization | 300 | 1,000-3,000 | 1-2 hours |
LoRA Hyperparameters
Standard settings for Llama 3.2 mobile fine-tuning:
| Parameter | 1B | 3B |
|---|---|---|
| LoRA rank (r) | 16-32 | 16-64 |
| LoRA alpha | 32-64 | 32-128 |
| Learning rate | 2e-4 | 1e-4 |
| Epochs | 3-5 | 2-4 |
| Batch size | 4-8 | 2-4 |
| Target modules | q_proj, v_proj, k_proj, o_proj | Same |
Higher rank captures more domain-specific knowledge but increases adapter size and training time. For most mobile use cases, rank 16-32 is sufficient.
Exporting to GGUF
After training, the pipeline is:
- Merge the LoRA adapter into the base model weights
- Convert to GGUF format
- Quantize to Q4_K_M (or your target level)
- Validate the quantized model on your evaluation set
The GGUF file is the final artifact you ship to devices. It contains the complete model in a single file that llama.cpp can load directly.
Platforms like Ertas handle this end-to-end:
Upload training data, select Llama 3.2 1B or 3B as the base model, configure LoRA parameters (or use defaults), train on cloud GPUs, and export directly to GGUF. No command-line tools, no GPU setup, no conversion scripts.
Deploying on iOS
Integration with llama.cpp
Add llama.cpp to your iOS project via Swift Package Manager or as a compiled framework. Load the GGUF model and run inference:
import llama
let modelPath = Bundle.main.path(forResource: "model", ofType: "gguf")!
let params = llama_model_default_params()
let model = llama_load_model_from_file(modelPath, params)
// Configure inference
var contextParams = llama_context_default_params()
contextParams.n_ctx = 2048
contextParams.n_threads = 4
let context = llama_new_context_with_model(model, contextParams)
Metal GPU Acceleration
llama.cpp automatically uses Metal on iOS for GPU-accelerated inference. Set n_gpu_layers to the model's total layer count to offload all computation to the GPU:
var modelParams = llama_model_default_params()
modelParams.n_gpu_layers = 32 // Offload all layers to Metal
This is enabled by default in recent llama.cpp builds and provides a 30-50% speed improvement over CPU-only inference.
Deploying on Android
Integration with llama.android
Use the llama.android library from the llama.cpp project. It provides Kotlin bindings through JNI:
val model = LlamaModel()
model.load(modelPath, nThreads = 4, nGpuLayers = 32)
// Generate with streaming
model.generate(prompt) { token ->
runOnUiThread { appendToUI(token) }
}
Vulkan GPU Acceleration
On Android, llama.cpp supports Vulkan for GPU acceleration on Snapdragon, Tensor, and other chipsets. Enable by setting nGpuLayers to the model's layer count.
Licensing
Llama 3.2 uses Meta's Llama Community License Agreement. Key points:
- Commercial use: Allowed
- Modification: Allowed (including fine-tuning)
- Distribution: Allowed
- 700M MAU threshold: If your product or service has over 700 million monthly active users, you need a special license from Meta
- Attribution: Required
For the vast majority of mobile apps, the license is fully permissive. The 700M MAU clause only applies to the largest platforms.
End-to-End Timeline
| Step | Duration | Notes |
|---|---|---|
| Prepare training data | 1-5 days | Depends on data availability |
| Fine-tune with LoRA | 30 min - 3 hours | GPU-dependent, cloud training recommended |
| Export to GGUF | 10-30 minutes | Automated in most platforms |
| Integrate llama.cpp | 1-2 days | One-time setup |
| Testing and evaluation | 1-3 days | Test across target devices |
| Total | 3-10 days | First deployment; subsequent iterations are faster |
The first deployment takes the longest because of the llama.cpp integration. After that, model updates (re-training, re-exporting GGUF) take hours, not days.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Gemma 3 for Mobile: Fine-Tuning and On-Device Deployment
How to use Google's Gemma 3 models for on-device mobile AI. Model selection, fine-tuning with LoRA, GGUF export, and deployment via llama.cpp on iOS and Android.

On-Device AI Model Size Guide: 1B vs 3B vs 7B for Mobile
How to choose the right model size for your mobile app. Capability breakdown, device requirements, quality benchmarks, and the fine-tuning factor that changes the math.

Quantization for Mobile: Q4, Q5, and Q8 Across Real Devices
A practical guide to GGUF quantization levels for mobile deployment. How Q4, Q5, and Q8 affect model size, speed, quality, and memory usage on iPhones and Android devices.