Llama 3.2 for Mobile Apps: Fine-Tuning and On-Device Deployment

Meta's Llama 3.2 includes 1B and 3B models designed specifically for mobile and edge deployment. These are not scaled-down afterthoughts. They were purpose-built for on-device inference, distilled from the larger Llama 3.1 models to retain capability while fitting in mobile memory budgets.

This guide covers the full pipeline: selecting the right size, fine-tuning on your data, exporting to GGUF, and deploying on iOS and Android.

Why Llama 3.2 for Mobile

Llama 3.2 1B and 3B have several advantages for mobile deployment:

Designed for mobile: Unlike larger models that are squeezed down, these were trained from the start with mobile constraints in mind. The architecture is optimized for fast inference on limited hardware.

Strong base capability: Trained on 9 trillion tokens. The 3B model scores 63.4 on MMLU and 77.4 on IFEval (instruction following), making it competitive with models 2-3x its size from just a year earlier.

Largest ecosystem: Llama has the biggest community of any open model family. More fine-tuning guides, more GGUF conversions, more tooling support, and more production deployment examples than any alternative.

128K context: Both 1B and 3B support 128K token context windows. For mobile, you will rarely use more than 2-4K, but the long context is there if needed.

Choosing 1B vs 3B

Factor	1B	3B
GGUF Q4 size	~600MB	~1.7GB
RAM during inference	~800MB	~2.2GB
Device coverage	4GB+ (90% of phones)	6GB+ (65% of phones)
Generation speed (flagship)	35-50 tok/s	18-30 tok/s
Classification accuracy (fine-tuned)	90-94%	93-96%
Chat quality (fine-tuned)	Good for short responses	Good for multi-turn
Summarization	Adequate	Good

Choose 1B when: Your task is classification, tagging, autocomplete, smart suggestions, or short-form generation. You want maximum device coverage.

Choose 3B when: Your task is conversational chat, summarization, content drafting, or complex instruction following. Your users have newer devices.

Fine-Tuning with LoRA

LoRA (Low-Rank Adaptation) is the standard fine-tuning method for mobile models. Instead of modifying all model weights, LoRA trains small adapter matrices that adjust the model's behavior. The adapter is 50-200MB, merged into the base model before GGUF export.

Training Data Format

Llama 3.2 uses a specific chat template. Your training data should follow this format:

{
  "messages": [
    {"role": "system", "content": "You are a travel assistant for TripHelper app."},
    {"role": "user", "content": "What's the best time to visit Kyoto?"},
    {"role": "assistant", "content": "March-April for cherry blossoms or November for autumn foliage. Both are peak seasons, so book 2-3 months ahead."}
  ]
}

Each training example is a conversation. Include the system prompt if your app uses one. Include multi-turn examples if your feature is conversational.

Data Requirements

Task	Minimum Examples	Recommended	Training Time (LoRA)
Classification	200	500-1,000	15-30 min
Short Q&A	300	1,000-2,000	30-60 min
Chat	500	2,000-5,000	1-3 hours
Summarization	300	1,000-3,000	1-2 hours

LoRA Hyperparameters

Standard settings for Llama 3.2 mobile fine-tuning:

Parameter	1B	3B
LoRA rank (r)	16-32	16-64
LoRA alpha	32-64	32-128
Learning rate	2e-4	1e-4
Epochs	3-5	2-4
Batch size	4-8	2-4
Target modules	q_proj, v_proj, k_proj, o_proj	Same

Higher rank captures more domain-specific knowledge but increases adapter size and training time. For most mobile use cases, rank 16-32 is sufficient.

Exporting to GGUF

After training, the pipeline is:

Merge the LoRA adapter into the base model weights
Convert to GGUF format
Quantize to Q4_K_M (or your target level)
Validate the quantized model on your evaluation set

The GGUF file is the final artifact you ship to devices. It contains the complete model in a single file that llama.cpp can load directly.

Platforms like Ertas handle this end-to-end:

Upload training data, select Llama 3.2 1B or 3B as the base model, configure LoRA parameters (or use defaults), train on cloud GPUs, and export directly to GGUF. No command-line tools, no GPU setup, no conversion scripts.

Deploying on iOS

Integration with llama.cpp

Add llama.cpp to your iOS project via Swift Package Manager or as a compiled framework. Load the GGUF model and run inference:

import llama

let modelPath = Bundle.main.path(forResource: "model", ofType: "gguf")!
let params = llama_model_default_params()
let model = llama_load_model_from_file(modelPath, params)

// Configure inference
var contextParams = llama_context_default_params()
contextParams.n_ctx = 2048
contextParams.n_threads = 4
let context = llama_new_context_with_model(model, contextParams)

Metal GPU Acceleration

llama.cpp automatically uses Metal on iOS for GPU-accelerated inference. Set n_gpu_layers to the model's total layer count to offload all computation to the GPU:

var modelParams = llama_model_default_params()
modelParams.n_gpu_layers = 32 // Offload all layers to Metal

This is enabled by default in recent llama.cpp builds and provides a 30-50% speed improvement over CPU-only inference.

Deploying on Android

Integration with llama.android

Use the llama.android library from the llama.cpp project. It provides Kotlin bindings through JNI:

val model = LlamaModel()
model.load(modelPath, nThreads = 4, nGpuLayers = 32)

// Generate with streaming
model.generate(prompt) { token ->
    runOnUiThread { appendToUI(token) }
}

Vulkan GPU Acceleration

On Android, llama.cpp supports Vulkan for GPU acceleration on Snapdragon, Tensor, and other chipsets. Enable by setting nGpuLayers to the model's layer count.

Licensing

Llama 3.2 uses Meta's Llama Community License Agreement. Key points:

Commercial use: Allowed
Modification: Allowed (including fine-tuning)
Distribution: Allowed
700M MAU threshold: If your product or service has over 700 million monthly active users, you need a special license from Meta
Attribution: Required

For the vast majority of mobile apps, the license is fully permissive. The 700M MAU clause only applies to the largest platforms.

End-to-End Timeline

Step	Duration	Notes
Prepare training data	1-5 days	Depends on data availability
Fine-tune with LoRA	30 min - 3 hours	GPU-dependent, cloud training recommended
Export to GGUF	10-30 minutes	Automated in most platforms
Integrate llama.cpp	1-2 days	One-time setup
Testing and evaluation	1-3 days	Test across target devices
Total	3-10 days	First deployment; subsequent iterations are faster

The first deployment takes the longest because of the llama.cpp integration. After that, model updates (re-training, re-exporting GGUF) take hours, not days.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Llama 3.2 for Mobile Apps: Fine-Tuning and On-Device Deployment

Why Llama 3.2 for Mobile

Choosing 1B vs 3B

Fine-Tuning with LoRA

Training Data Format

Data Requirements

LoRA Hyperparameters

Exporting to GGUF

Platforms like Ertas handle this end-to-end:

Deploying on iOS

Integration with llama.cpp

Metal GPU Acceleration

Deploying on Android

Integration with llama.android

Vulkan GPU Acceleration

Licensing

End-to-End Timeline

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

Gemma 3 for Mobile: Fine-Tuning and On-Device Deployment

On-Device AI Model Size Guide: 1B vs 3B vs 7B for Mobile

Quantization for Mobile: Q4, Q5, and Q8 Across Real Devices