Fine-Tuning for App Developers: A Non-ML-Engineer's Guide

You build great apps. You know Swift, Kotlin, or React Native. You can ship a polished UI, wire up a REST API, and debug a race condition at 11pm. Now you want to add an AI feature: something genuinely useful, not a thin wrapper around GPT-4.

The problem: every fine-tuning tutorial assumes you know what a "gradient" is. They open with PyTorch imports. They reference "attention heads" like it's common knowledge. You close the tab.

This guide is different. It explains fine-tuning the way an app developer actually needs to understand it: focused on the pipeline, the practical decisions, and what things cost. You will not need to understand backpropagation. You will need to understand JSON.

What Fine-Tuning Actually Is

The easiest way to understand fine-tuning is to think about how autocomplete works on your phone.

Your keyboard's autocomplete started as a general model trained on billions of words. Over time, it adapted to you. It learned that you type "lgtm" at the end of code reviews, that you always follow "Hey" with "so", and that you misspell "necessary" in a specific way. That adaptation happened because the model saw your patterns and adjusted.

Fine-tuning an AI model is the same idea, applied deliberately. You take a general-purpose model (something like Llama 3.2, which can write essays, answer questions, and generate code) and show it hundreds of examples of exactly how you want it to behave in your app. After training, it responds in that specific way, reliably, every time.

The result is a model that does your task better than any amount of prompting could achieve, runs on-device with no API calls, and costs you nothing per inference.

LoRA: Why You Do Not Retrain the Whole Model

Before fine-tuning became practical for non-ML teams, the only option was full fine-tuning: updating every single parameter in the model. For a 7 billion parameter model, that means updating 7 billion numbers. The compute cost was enormous. A full fine-tuning run required expensive hardware and days of training time. Only large labs could afford it.

In 2022, researchers published LoRA (Low-Rank Adaptation, arXiv:2106.09685) at ICLR. The key insight: you do not need to update all the parameters to change how a model behaves. You can freeze the original model weights entirely and add a small set of new, trainable layers on top. These new layers are called a LoRA adapter.

Here is the practical outcome:

LoRA trains only 0.1-1% of the parameters that full fine-tuning would update
Training is dramatically faster and cheaper
The adapter file is small: typically 50-200MB for a 7B model
The original base model is untouched and reusable

Think of it like a plugin for the model. The base model is the app. Your LoRA adapter is the extension that makes it behave exactly the way your use case requires.

In 2023, QLoRA (arXiv:2305.14314, NeurIPS 2023) pushed this further. QLoRA combines LoRA with 4-bit quantization, which compresses the model's numbers to take up less memory during training. The result: you can fine-tune a 7 billion parameter model on a consumer GPU with 6-10GB of VRAM, or on a cloud instance that costs $3-$10 per run. That is within the budget of a solo app developer.

The 4-Step Pipeline

Fine-tuning a model for your app follows four steps. The rest of this guide walks through each one.

Prepare your training data as a JSONL file
Fine-tune on a cloud GPU using LoRA or QLoRA
Export to GGUF with quantization for mobile
Integrate in your iOS or Android app using llama.cpp bindings

That is it. No ML theory required at any step.

Step 1: Preparing Your Training Data

Your training data is the most important part of the pipeline. A small, high-quality dataset beats a large, messy one every time. Stanford's Alpaca project demonstrated this in 2023 (arXiv:2303.16199): 52,000 synthetic examples and roughly $500 in compute was enough to make Llama 7B match GPT-3.5 on a wide range of tasks. Quality and format matter more than volume.

How Many Examples Do You Need?

OpenAI's own guidance recommends starting with 50-100 examples when fine-tuning their models. For open-source models with LoRA, 100-500 examples is a practical starting range for a focused task. More data helps, but diminishing returns set in quickly once the model has seen enough variation to generalize.

For your first fine-tuning run, aim for 100-300 examples. You can always add more later.

The Format

Training examples are stored in a JSONL file (JSON Lines), where each line is a separate JSON object. The standard format uses chat-style messages:

{"messages": [{"role": "system", "content": "You are a cooking assistant that suggests recipes based on available ingredients."}, {"role": "user", "content": "I have chicken, garlic, lemon, and rosemary."}, {"role": "assistant", "content": "Here are two recipes you can make right now:\n\n1. Lemon Garlic Roasted Chicken..."}]}
{"messages": [{"role": "system", "content": "You are a cooking assistant that suggests recipes based on available ingredients."}, {"role": "user", "content": "I have eggs, spinach, and feta."}, {"role": "assistant", "content": "Three quick options:\n\n1. Greek Scrambled Eggs..."}]}

Each line is one training example. The system message defines the model's role. The user message is the input. The assistant message is the ideal output you want the model to learn.

Tips for High-Quality Training Data

Match your real inputs. If users in your app send short, casual messages, your training inputs should look like short, casual messages. If they send structured queries, train on structured queries.

Be consistent in your outputs. If some assistant responses are two sentences and others are ten paragraphs, the model will not learn a reliable pattern. Pick a format and stick to it.

Cover the edge cases. Include examples where the user asks something outside the model's scope. Show the model how to respond gracefully to off-topic requests, not just the happy path.

Vary the phrasing. Ten examples that all ask the same question with slightly different wording teach the model less than ten examples that each cover a different scenario.

Save your file as training_data.jsonl. That is all you need for the next step.

Step 2: Fine-Tuning on a Cloud GPU

You do not need to own any GPU hardware. Cloud GPU providers rent compute by the hour or by the run.

What Happens During Fine-Tuning

You upload your JSONL file and pick a base model. The training process runs your examples through the model repeatedly, measuring how far off the model's outputs are from your ideal outputs, and adjusting the LoRA adapter weights to reduce that gap. This process is called a training epoch.

For 100-300 examples with LoRA, expect:

Training time: 10-30 minutes on a single GPU
Cost with LoRA (12-16GB VRAM): $5-$15 per run
Cost with QLoRA (6-10GB VRAM): $3-$10 per run

You will run training a few times as you refine your dataset. Total fine-tuning cost for a new feature is typically $20-$50 before you have a model you are happy with.

Choosing a Base Model for Mobile

For on-device deployment on iOS and Android, model size is the primary constraint. The practical sweet spot:

Llama 3.2 1B Q4_K_M: 808MB on disk, runs at roughly 22 tokens per second on an iPhone 16 Pro, under 100MB RAM overhead
Llama 3.2 3B Q4_K_M: 2.02GB on disk, meaningfully more capable, still fits on most modern devices

For most app features (classification, short-form generation, Q&A, summarization), the 1B model after fine-tuning on your specific task will outperform a general-purpose 3B model on that same task. Specialization through fine-tuning is more effective than raw parameter count for narrow tasks.

Apple's Foundation Models API (announced at WWDC 2025) gives third-party apps access to an on-device model of roughly 3B parameters with no download required. If your use case fits within what Apple's model supports, that is worth exploring. For cases where you need custom behavior or output format, fine-tuning your own model gives you control that a platform API cannot.

What to Watch During Training

You do not need to deeply understand the training metrics, but two numbers matter:

Training loss should decrease over time. If it stays flat, your data might have quality issues.

Validation loss is measured on a small held-out portion of your data. If training loss decreases but validation loss increases, the model is memorizing your training examples rather than learning to generalize. This is called overfitting. The fix is to add more diverse examples or reduce the number of training epochs.

Step 3: Export to GGUF with Quantization

After fine-tuning, you have a LoRA adapter. Before you can ship it in a mobile app, you need to:

Merge the adapter into the base model
Quantize the merged model to reduce its size
Export as a GGUF file

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp, the inference library that powers on-device AI on iOS and Android. It is a single file that contains everything needed to run the model.

Quantization reduces the precision of the model's numbers. Instead of storing each number as a 32-bit float, quantization stores them as 4-bit integers. This cuts file size by roughly 75% and speeds up inference, with a small quality tradeoff.

Understanding Q4_K_M

When you see a model listed as Q4_K_M, each part means something:

Q4: 4-bit quantization (vs 8-bit Q8 or full-precision F16)
K: uses a quantization algorithm called K-quant that is more accurate than simpler methods
M: medium variant (balanced between size and quality; there is also S for small and L for large)

For mobile deployment, Q4_K_M is the recommended quantization level. It balances size, speed, and quality well for consumer hardware.

The resulting file sizes after quantization with Q4_K_M:

Llama 3.2 1B: 808MB
Llama 3.2 3B: 2.02GB

Both fit in an app download, though the 3B model may want to be downloaded after first launch rather than bundled in the initial install.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Step 4: Integrating in iOS and Android

You have a .gguf file. Now you need to run it inside your app.

iOS (Swift)

The llama.cpp project provides Swift bindings via llama.cpp SPM package. Add it to your Package.swift:

.package(url: "https://github.com/ggerganov/llama.cpp", from: "b3000.0.0")

Bundle your GGUF file in your app's resources or download it on first launch to the app's Documents directory. Then load and run the model:

import llama

let modelPath = Bundle.main.path(forResource: "my-model.Q4_K_M", ofType: "gguf")!
let params = LlamaContextParams.default
let context = try LlamaContext(modelPath: modelPath, params: params)

let response = try await context.complete(prompt: "Summarize: \(userInput)")

On iPhone 16 Pro, a 1.5B model runs at roughly 22 tokens per second via the CPU path. iPhone 17 Pro, with its improved NPU (INT8 operations), benchmarks at approximately 136 tokens per second. For most app features, 22 tok/s is fast enough for streaming responses that feel natural.

Android (Kotlin/NDK)

On Android, llama.cpp integrates via JNI (Java Native Interface) or the NDK. The MediaPipe LLM Inference API from Google provides a higher-level abstraction that wraps llama.cpp and is easier to integrate for Android developers who do not want to manage JNI directly:

val options = LlmInference.LlmInferenceOptions.builder()
    .setModelPath("/data/local/tmp/my-model.Q4_K_M.gguf")
    .setMaxTokens(512)
    .build()

val llmInference = LlmInference.createFromOptions(context, options)
val result = llmInference.generateResponse(userPrompt)

MediaPipe handles GPU acceleration automatically on Qualcomm and Mali GPUs when available.

React Native

For React Native apps, the llama.rn library provides a cross-platform wrapper around llama.cpp for both iOS and Android:

npm install llama.rn

import { initLlama } from 'llama.rn';

const context = await initLlama({
  model: require('./assets/my-model.Q4_K_M.gguf'),
  n_ctx: 2048,
});

const result = await context.completion({
  prompt: userInput,
  n_predict: 256,
});

The model runs entirely on-device. No API keys, no network requests, no per-inference cost.

What Fine-Tuning Can and Cannot Do

This is the section that will save you from a wasted fine-tuning run. Fine-tuning is a powerful tool, but it solves a specific set of problems.

Fine-Tuning Is Great For

Learning a format. If you need the model to always respond in JSON, always follow a specific structure, or always apply a consistent style, fine-tuning is the right tool. A few hundred examples of the target format and the model will follow it reliably.

Domain-specific language. If your app is in a specialized domain (medical, legal, a particular industry), fine-tuning teaches the model the terminology and conventions of that domain.

Tone and personality. If you want your AI feature to respond in a specific voice that matches your app's brand, fine-tuning is more reliable than lengthy system prompts.

Task specialization. Classifying support tickets, extracting structured data from user input, generating short-form content in a specific style. Any narrow, well-defined task benefits from fine-tuning.

Fine-Tuning Cannot Add New Facts

This is the most common misunderstanding. Fine-tuning does not teach the model new information about the world. It teaches the model how to behave, not what to know.

If you want the model to answer questions about your product catalog, your pricing page, or your documentation, fine-tuning is the wrong tool. Fine-tuning on your docs will not reliably make the model recall specific facts from those docs. It will just learn the format and style of your docs.

For factual recall of specific, changing information, use RAG (Retrieval-Augmented Generation) instead. RAG lets the model search a database of documents at inference time and use the retrieved text as context. Fine-tuning and RAG complement each other: fine-tune for behavior and format, use RAG for facts.

Fine-Tuning Cannot Expand the Model's Reasoning Ability

A 1B model fine-tuned on your task will perform better at your task than a general-purpose 3B model. But it cannot perform complex multi-step reasoning, generate long coherent documents, or solve problems that require the kind of capability that only comes from more parameters. Know where your model's ceiling is and design your feature within it.

Bringing It Together

Here is the full pipeline as a checklist:

Write 100-300 training examples in JSONL chat format
Review for consistency: matching format, varied inputs, edge cases covered
Upload to a fine-tuning platform or cloud GPU service
Train with LoRA on Llama 3.2 1B or 3B base model
Check training and validation loss; re-run with adjustments if needed
Merge adapter and export as model.Q4_K_M.gguf
Add llama.cpp Swift package (iOS), MediaPipe or llama.rn (Android/React Native)
Bundle or download the GGUF on first launch
Test on real device hardware at your target token throughput

The toolchain has matured to the point where a React Native developer with no ML background can complete this pipeline in a weekend. The hard part is not the training or the integration. It is collecting and cleaning a good dataset.

Start with your training data. Everything else follows from there.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Fine-Tuning for App Developers: A Non-ML-Engineer's Guide

What Fine-Tuning Actually Is

LoRA: Why You Do Not Retrain the Whole Model

The 4-Step Pipeline

Step 1: Preparing Your Training Data

How Many Examples Do You Need?

The Format

Tips for High-Quality Training Data

Step 2: Fine-Tuning on a Cloud GPU

What Happens During Fine-Tuning

Choosing a Base Model for Mobile

What to Watch During Training

Step 3: Export to GGUF with Quantization

Understanding Q4_K_M

Step 4: Integrating in iOS and Android

iOS (Swift)

Android (Kotlin/NDK)

React Native

What Fine-Tuning Can and Cannot Do

Fine-Tuning Is Great For

Fine-Tuning Cannot Add New Facts

Fine-Tuning Cannot Expand the Model's Reasoning Ability

Bringing It Together

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

GGUF + llama.cpp: Shipping a Fine-Tuned Model in Your Mobile App

How Many Training Examples Do You Actually Need? The 100-Sample Myth

How to Fine-Tune an LLM: The Complete 2026 Guide