
Fine-Tuning for App Developers: A Non-ML-Engineer's Guide
A practical guide to fine-tuning AI models for mobile app developers. Learn LoRA, QLoRA, and GGUF export without needing an ML background.
You build great apps. You know Swift, Kotlin, or React Native. You can ship a polished UI, wire up a REST API, and debug a race condition at 11pm. Now you want to add an AI feature: something genuinely useful, not a thin wrapper around GPT-4.
The problem: every fine-tuning tutorial assumes you know what a "gradient" is. They open with PyTorch imports. They reference "attention heads" like it's common knowledge. You close the tab.
This guide is different. It explains fine-tuning the way an app developer actually needs to understand it: focused on the pipeline, the practical decisions, and what things cost. You will not need to understand backpropagation. You will need to understand JSON.
What Fine-Tuning Actually Is
The easiest way to understand fine-tuning is to think about how autocomplete works on your phone.
Your keyboard's autocomplete started as a general model trained on billions of words. Over time, it adapted to you. It learned that you type "lgtm" at the end of code reviews, that you always follow "Hey" with "so", and that you misspell "necessary" in a specific way. That adaptation happened because the model saw your patterns and adjusted.
Fine-tuning an AI model is the same idea, applied deliberately. You take a general-purpose model (something like Llama 3.2, which can write essays, answer questions, and generate code) and show it hundreds of examples of exactly how you want it to behave in your app. After training, it responds in that specific way, reliably, every time.
The result is a model that does your task better than any amount of prompting could achieve, runs on-device with no API calls, and costs you nothing per inference.
LoRA: Why You Do Not Retrain the Whole Model
Before fine-tuning became practical for non-ML teams, the only option was full fine-tuning: updating every single parameter in the model. For a 7 billion parameter model, that means updating 7 billion numbers. The compute cost was enormous. A full fine-tuning run required expensive hardware and days of training time. Only large labs could afford it.
In 2022, researchers published LoRA (Low-Rank Adaptation, arXiv:2106.09685) at ICLR. The key insight: you do not need to update all the parameters to change how a model behaves. You can freeze the original model weights entirely and add a small set of new, trainable layers on top. These new layers are called a LoRA adapter.
Here is the practical outcome:
- LoRA trains only 0.1-1% of the parameters that full fine-tuning would update
- Training is dramatically faster and cheaper
- The adapter file is small: typically 50-200MB for a 7B model
- The original base model is untouched and reusable
Think of it like a plugin for the model. The base model is the app. Your LoRA adapter is the extension that makes it behave exactly the way your use case requires.
In 2023, QLoRA (arXiv:2305.14314, NeurIPS 2023) pushed this further. QLoRA combines LoRA with 4-bit quantization, which compresses the model's numbers to take up less memory during training. The result: you can fine-tune a 7 billion parameter model on a consumer GPU with 6-10GB of VRAM, or on a cloud instance that costs $3-$10 per run. That is within the budget of a solo app developer.
The 4-Step Pipeline
Fine-tuning a model for your app follows four steps. The rest of this guide walks through each one.
- Prepare your training data as a JSONL file
- Fine-tune on a cloud GPU using LoRA or QLoRA
- Export to GGUF with quantization for mobile
- Integrate in your iOS or Android app using llama.cpp bindings
That is it. No ML theory required at any step.
Step 1: Preparing Your Training Data
Your training data is the most important part of the pipeline. A small, high-quality dataset beats a large, messy one every time. Stanford's Alpaca project demonstrated this in 2023 (arXiv:2303.16199): 52,000 synthetic examples and roughly $500 in compute was enough to make Llama 7B match GPT-3.5 on a wide range of tasks. Quality and format matter more than volume.
How Many Examples Do You Need?
OpenAI's own guidance recommends starting with 50-100 examples when fine-tuning their models. For open-source models with LoRA, 100-500 examples is a practical starting range for a focused task. More data helps, but diminishing returns set in quickly once the model has seen enough variation to generalize.
For your first fine-tuning run, aim for 100-300 examples. You can always add more later.
The Format
Training examples are stored in a JSONL file (JSON Lines), where each line is a separate JSON object. The standard format uses chat-style messages:
{"messages": [{"role": "system", "content": "You are a cooking assistant that suggests recipes based on available ingredients."}, {"role": "user", "content": "I have chicken, garlic, lemon, and rosemary."}, {"role": "assistant", "content": "Here are two recipes you can make right now:\n\n1. Lemon Garlic Roasted Chicken..."}]}
{"messages": [{"role": "system", "content": "You are a cooking assistant that suggests recipes based on available ingredients."}, {"role": "user", "content": "I have eggs, spinach, and feta."}, {"role": "assistant", "content": "Three quick options:\n\n1. Greek Scrambled Eggs..."}]}
Each line is one training example. The system message defines the model's role. The user message is the input. The assistant message is the ideal output you want the model to learn.
Tips for High-Quality Training Data
Match your real inputs. If users in your app send short, casual messages, your training inputs should look like short, casual messages. If they send structured queries, train on structured queries.
Be consistent in your outputs. If some assistant responses are two sentences and others are ten paragraphs, the model will not learn a reliable pattern. Pick a format and stick to it.
Cover the edge cases. Include examples where the user asks something outside the model's scope. Show the model how to respond gracefully to off-topic requests, not just the happy path.
Vary the phrasing. Ten examples that all ask the same question with slightly different wording teach the model less than ten examples that each cover a different scenario.
Save your file as training_data.jsonl. That is all you need for the next step.
Step 2: Fine-Tuning on a Cloud GPU
You do not need to own any GPU hardware. Cloud GPU providers rent compute by the hour or by the run.
What Happens During Fine-Tuning
You upload your JSONL file and pick a base model. The training process runs your examples through the model repeatedly, measuring how far off the model's outputs are from your ideal outputs, and adjusting the LoRA adapter weights to reduce that gap. This process is called a training epoch.
For 100-300 examples with LoRA, expect:
- Training time: 10-30 minutes on a single GPU
- Cost with LoRA (12-16GB VRAM): $5-$15 per run
- Cost with QLoRA (6-10GB VRAM): $3-$10 per run
You will run training a few times as you refine your dataset. Total fine-tuning cost for a new feature is typically $20-$50 before you have a model you are happy with.
Choosing a Base Model for Mobile
For on-device deployment on iOS and Android, model size is the primary constraint. The practical sweet spot:
- Llama 3.2 1B Q4_K_M: 808MB on disk, runs at roughly 22 tokens per second on an iPhone 16 Pro, under 100MB RAM overhead
- Llama 3.2 3B Q4_K_M: 2.02GB on disk, meaningfully more capable, still fits on most modern devices
For most app features (classification, short-form generation, Q&A, summarization), the 1B model after fine-tuning on your specific task will outperform a general-purpose 3B model on that same task. Specialization through fine-tuning is more effective than raw parameter count for narrow tasks.
Apple's Foundation Models API (announced at WWDC 2025) gives third-party apps access to an on-device model of roughly 3B parameters with no download required. If your use case fits within what Apple's model supports, that is worth exploring. For cases where you need custom behavior or output format, fine-tuning your own model gives you control that a platform API cannot.
What to Watch During Training
You do not need to deeply understand the training metrics, but two numbers matter:
Training loss should decrease over time. If it stays flat, your data might have quality issues.
Validation loss is measured on a small held-out portion of your data. If training loss decreases but validation loss increases, the model is memorizing your training examples rather than learning to generalize. This is called overfitting. The fix is to add more diverse examples or reduce the number of training epochs.
Step 3: Export to GGUF with Quantization
After fine-tuning, you have a LoRA adapter. Before you can ship it in a mobile app, you need to:
- Merge the adapter into the base model
- Quantize the merged model to reduce its size
- Export as a GGUF file
GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp, the inference library that powers on-device AI on iOS and Android. It is a single file that contains everything needed to run the model.
Quantization reduces the precision of the model's numbers. Instead of storing each number as a 32-bit float, quantization stores them as 4-bit integers. This cuts file size by roughly 75% and speeds up inference, with a small quality tradeoff.
Understanding Q4_K_M
When you see a model listed as Q4_K_M, each part means something:
Q4: 4-bit quantization (vs 8-bit Q8 or full-precision F16)K: uses a quantization algorithm called K-quant that is more accurate than simpler methodsM: medium variant (balanced between size and quality; there is alsoSfor small andLfor large)
For mobile deployment, Q4_K_M is the recommended quantization level. It balances size, speed, and quality well for consumer hardware.
The resulting file sizes after quantization with Q4_K_M:
- Llama 3.2 1B: 808MB
- Llama 3.2 3B: 2.02GB
Both fit in an app download, though the 3B model may want to be downloaded after first launch rather than bundled in the initial install.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Step 4: Integrating in iOS and Android
You have a .gguf file. Now you need to run it inside your app.
iOS (Swift)
The llama.cpp project provides Swift bindings via llama.cpp SPM package. Add it to your Package.swift:
.package(url: "https://github.com/ggerganov/llama.cpp", from: "b3000.0.0")
Bundle your GGUF file in your app's resources or download it on first launch to the app's Documents directory. Then load and run the model:
import llama
let modelPath = Bundle.main.path(forResource: "my-model.Q4_K_M", ofType: "gguf")!
let params = LlamaContextParams.default
let context = try LlamaContext(modelPath: modelPath, params: params)
let response = try await context.complete(prompt: "Summarize: \(userInput)")
On iPhone 16 Pro, a 1.5B model runs at roughly 22 tokens per second via the CPU path. iPhone 17 Pro, with its improved NPU (INT8 operations), benchmarks at approximately 136 tokens per second. For most app features, 22 tok/s is fast enough for streaming responses that feel natural.
Android (Kotlin/NDK)
On Android, llama.cpp integrates via JNI (Java Native Interface) or the NDK. The MediaPipe LLM Inference API from Google provides a higher-level abstraction that wraps llama.cpp and is easier to integrate for Android developers who do not want to manage JNI directly:
val options = LlmInference.LlmInferenceOptions.builder()
.setModelPath("/data/local/tmp/my-model.Q4_K_M.gguf")
.setMaxTokens(512)
.build()
val llmInference = LlmInference.createFromOptions(context, options)
val result = llmInference.generateResponse(userPrompt)
MediaPipe handles GPU acceleration automatically on Qualcomm and Mali GPUs when available.
React Native
For React Native apps, the llama.rn library provides a cross-platform wrapper around llama.cpp for both iOS and Android:
npm install llama.rn
import { initLlama } from 'llama.rn';
const context = await initLlama({
model: require('./assets/my-model.Q4_K_M.gguf'),
n_ctx: 2048,
});
const result = await context.completion({
prompt: userInput,
n_predict: 256,
});
The model runs entirely on-device. No API keys, no network requests, no per-inference cost.
What Fine-Tuning Can and Cannot Do
This is the section that will save you from a wasted fine-tuning run. Fine-tuning is a powerful tool, but it solves a specific set of problems.
Fine-Tuning Is Great For
Learning a format. If you need the model to always respond in JSON, always follow a specific structure, or always apply a consistent style, fine-tuning is the right tool. A few hundred examples of the target format and the model will follow it reliably.
Domain-specific language. If your app is in a specialized domain (medical, legal, a particular industry), fine-tuning teaches the model the terminology and conventions of that domain.
Tone and personality. If you want your AI feature to respond in a specific voice that matches your app's brand, fine-tuning is more reliable than lengthy system prompts.
Task specialization. Classifying support tickets, extracting structured data from user input, generating short-form content in a specific style. Any narrow, well-defined task benefits from fine-tuning.
Fine-Tuning Cannot Add New Facts
This is the most common misunderstanding. Fine-tuning does not teach the model new information about the world. It teaches the model how to behave, not what to know.
If you want the model to answer questions about your product catalog, your pricing page, or your documentation, fine-tuning is the wrong tool. Fine-tuning on your docs will not reliably make the model recall specific facts from those docs. It will just learn the format and style of your docs.
For factual recall of specific, changing information, use RAG (Retrieval-Augmented Generation) instead. RAG lets the model search a database of documents at inference time and use the retrieved text as context. Fine-tuning and RAG complement each other: fine-tune for behavior and format, use RAG for facts.
Fine-Tuning Cannot Expand the Model's Reasoning Ability
A 1B model fine-tuned on your task will perform better at your task than a general-purpose 3B model. But it cannot perform complex multi-step reasoning, generate long coherent documents, or solve problems that require the kind of capability that only comes from more parameters. Know where your model's ceiling is and design your feature within it.
Bringing It Together
Here is the full pipeline as a checklist:
- Write 100-300 training examples in JSONL chat format
- Review for consistency: matching format, varied inputs, edge cases covered
- Upload to a fine-tuning platform or cloud GPU service
- Train with LoRA on Llama 3.2 1B or 3B base model
- Check training and validation loss; re-run with adjustments if needed
- Merge adapter and export as
model.Q4_K_M.gguf - Add llama.cpp Swift package (iOS), MediaPipe or llama.rn (Android/React Native)
- Bundle or download the GGUF on first launch
- Test on real device hardware at your target token throughput
The toolchain has matured to the point where a React Native developer with no ML background can complete this pipeline in a weekend. The hard part is not the training or the integration. It is collecting and cleaning a good dataset.
Start with your training data. Everything else follows from there.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

GGUF + llama.cpp: Shipping a Fine-Tuned Model in Your Mobile App
A practical guide to packaging fine-tuned AI models as GGUF files and running them on iOS and Android with llama.cpp. Includes file sizes, benchmarks, and integration patterns.

How Many Training Examples Do You Actually Need? The 100-Sample Myth
The real data requirements for fine-tuning AI models. Research shows 50-500 examples can be enough for many tasks. Here's what the papers say and how to build your dataset.

How to Fine-Tune an LLM: The Complete 2026 Guide
Learn how to fine-tune large language models step by step — from preparing training data and choosing a base model to configuring LoRA, evaluating results, and deploying locally.