Back to blog
    Llama 3.2 for Mobile Apps: Fine-Tuning and On-Device Deployment
    LlamaMetafine-tuningmobile AIGGUFLoRAsegment:mobile-builder

    Llama 3.2 for Mobile Apps: Fine-Tuning and On-Device Deployment

    A complete guide to using Meta's Llama 3.2 1B and 3B models in mobile apps. Fine-tuning with LoRA, exporting to GGUF, and deploying on iOS and Android via llama.cpp.

    EErtas Team·

    Meta's Llama 3.2 includes 1B and 3B models designed specifically for mobile and edge deployment. These are not scaled-down afterthoughts. They were purpose-built for on-device inference, distilled from the larger Llama 3.1 models to retain capability while fitting in mobile memory budgets.

    This guide covers the full pipeline: selecting the right size, fine-tuning on your data, exporting to GGUF, and deploying on iOS and Android.

    Why Llama 3.2 for Mobile

    Llama 3.2 1B and 3B have several advantages for mobile deployment:

    Designed for mobile: Unlike larger models that are squeezed down, these were trained from the start with mobile constraints in mind. The architecture is optimized for fast inference on limited hardware.

    Strong base capability: Trained on 9 trillion tokens. The 3B model scores 63.4 on MMLU and 77.4 on IFEval (instruction following), making it competitive with models 2-3x its size from just a year earlier.

    Largest ecosystem: Llama has the biggest community of any open model family. More fine-tuning guides, more GGUF conversions, more tooling support, and more production deployment examples than any alternative.

    128K context: Both 1B and 3B support 128K token context windows. For mobile, you will rarely use more than 2-4K, but the long context is there if needed.

    Choosing 1B vs 3B

    Factor1B3B
    GGUF Q4 size~600MB~1.7GB
    RAM during inference~800MB~2.2GB
    Device coverage4GB+ (90% of phones)6GB+ (65% of phones)
    Generation speed (flagship)35-50 tok/s18-30 tok/s
    Classification accuracy (fine-tuned)90-94%93-96%
    Chat quality (fine-tuned)Good for short responsesGood for multi-turn
    SummarizationAdequateGood

    Choose 1B when: Your task is classification, tagging, autocomplete, smart suggestions, or short-form generation. You want maximum device coverage.

    Choose 3B when: Your task is conversational chat, summarization, content drafting, or complex instruction following. Your users have newer devices.

    Fine-Tuning with LoRA

    LoRA (Low-Rank Adaptation) is the standard fine-tuning method for mobile models. Instead of modifying all model weights, LoRA trains small adapter matrices that adjust the model's behavior. The adapter is 50-200MB, merged into the base model before GGUF export.

    Training Data Format

    Llama 3.2 uses a specific chat template. Your training data should follow this format:

    {
      "messages": [
        {"role": "system", "content": "You are a travel assistant for TripHelper app."},
        {"role": "user", "content": "What's the best time to visit Kyoto?"},
        {"role": "assistant", "content": "March-April for cherry blossoms or November for autumn foliage. Both are peak seasons, so book 2-3 months ahead."}
      ]
    }
    

    Each training example is a conversation. Include the system prompt if your app uses one. Include multi-turn examples if your feature is conversational.

    Data Requirements

    TaskMinimum ExamplesRecommendedTraining Time (LoRA)
    Classification200500-1,00015-30 min
    Short Q&A3001,000-2,00030-60 min
    Chat5002,000-5,0001-3 hours
    Summarization3001,000-3,0001-2 hours

    LoRA Hyperparameters

    Standard settings for Llama 3.2 mobile fine-tuning:

    Parameter1B3B
    LoRA rank (r)16-3216-64
    LoRA alpha32-6432-128
    Learning rate2e-41e-4
    Epochs3-52-4
    Batch size4-82-4
    Target modulesq_proj, v_proj, k_proj, o_projSame

    Higher rank captures more domain-specific knowledge but increases adapter size and training time. For most mobile use cases, rank 16-32 is sufficient.

    Exporting to GGUF

    After training, the pipeline is:

    1. Merge the LoRA adapter into the base model weights
    2. Convert to GGUF format
    3. Quantize to Q4_K_M (or your target level)
    4. Validate the quantized model on your evaluation set

    The GGUF file is the final artifact you ship to devices. It contains the complete model in a single file that llama.cpp can load directly.

    Platforms like Ertas handle this end-to-end:

    Upload training data, select Llama 3.2 1B or 3B as the base model, configure LoRA parameters (or use defaults), train on cloud GPUs, and export directly to GGUF. No command-line tools, no GPU setup, no conversion scripts.

    Deploying on iOS

    Integration with llama.cpp

    Add llama.cpp to your iOS project via Swift Package Manager or as a compiled framework. Load the GGUF model and run inference:

    import llama
    
    let modelPath = Bundle.main.path(forResource: "model", ofType: "gguf")!
    let params = llama_model_default_params()
    let model = llama_load_model_from_file(modelPath, params)
    
    // Configure inference
    var contextParams = llama_context_default_params()
    contextParams.n_ctx = 2048
    contextParams.n_threads = 4
    let context = llama_new_context_with_model(model, contextParams)
    

    Metal GPU Acceleration

    llama.cpp automatically uses Metal on iOS for GPU-accelerated inference. Set n_gpu_layers to the model's total layer count to offload all computation to the GPU:

    var modelParams = llama_model_default_params()
    modelParams.n_gpu_layers = 32 // Offload all layers to Metal
    

    This is enabled by default in recent llama.cpp builds and provides a 30-50% speed improvement over CPU-only inference.

    Deploying on Android

    Integration with llama.android

    Use the llama.android library from the llama.cpp project. It provides Kotlin bindings through JNI:

    val model = LlamaModel()
    model.load(modelPath, nThreads = 4, nGpuLayers = 32)
    
    // Generate with streaming
    model.generate(prompt) { token ->
        runOnUiThread { appendToUI(token) }
    }
    

    Vulkan GPU Acceleration

    On Android, llama.cpp supports Vulkan for GPU acceleration on Snapdragon, Tensor, and other chipsets. Enable by setting nGpuLayers to the model's layer count.

    Licensing

    Llama 3.2 uses Meta's Llama Community License Agreement. Key points:

    • Commercial use: Allowed
    • Modification: Allowed (including fine-tuning)
    • Distribution: Allowed
    • 700M MAU threshold: If your product or service has over 700 million monthly active users, you need a special license from Meta
    • Attribution: Required

    For the vast majority of mobile apps, the license is fully permissive. The 700M MAU clause only applies to the largest platforms.

    End-to-End Timeline

    StepDurationNotes
    Prepare training data1-5 daysDepends on data availability
    Fine-tune with LoRA30 min - 3 hoursGPU-dependent, cloud training recommended
    Export to GGUF10-30 minutesAutomated in most platforms
    Integrate llama.cpp1-2 daysOne-time setup
    Testing and evaluation1-3 daysTest across target devices
    Total3-10 daysFirst deployment; subsequent iterations are faster

    The first deployment takes the longest because of the llama.cpp integration. After that, model updates (re-training, re-exporting GGUF) take hours, not days.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading