Fine-Tuning vs Prompt Engineering for Mobile Apps

Prompt engineering is the first tool every developer reaches for. Write a system prompt that tells the model how to behave, what to output, and what to avoid. It works surprisingly well for prototyping.

Fine-tuning is the second tool, used when prompting hits its limits. Train the model on examples of the exact behavior you want. It takes more upfront work but delivers better results at lower cost.

For mobile apps, the choice has implications beyond accuracy. Prompt engineering requires long system prompts sent on every API call (cost). Fine-tuning bakes the instructions into the model weights (free at inference time).

Prompt Engineering: The Fast Path

How It Works

You write a system prompt that instructs the model:

You are a cooking assistant for the RecipeApp. When users ask about
recipes, provide step-by-step instructions. Always include prep time
and cooking time. Format ingredients as a bulleted list. Keep
responses under 200 words. Never suggest recipes that include
allergens without a warning. If the user asks about non-cooking
topics, politely redirect to cooking.

This prompt is sent with every API call. The model follows (most of) these instructions most of the time.

Strengths

Speed: Minutes to write and test
Flexibility: Change behavior by editing text
No training data needed: Works with zero examples
No infrastructure: Just API calls
Iteration speed: Try a new prompt in seconds

Weaknesses

Token cost: Your 800-1,500 token system prompt is billed on every request. At 10K MAU with 90K daily requests, that is 72-135 million extra tokens per month.
Inconsistency: The model follows instructions probabilistically. 70-85% adherence is typical. Edge cases slip through.
Limited domain knowledge: The model only knows what is in the prompt. You cannot fit your entire product catalog into a system prompt.
Prompt injection vulnerability: Users can sometimes override system prompt instructions.
Ceiling on accuracy: No amount of prompt engineering can teach a general model your specific domain terminology, style, or edge cases.

Fine-Tuning: The Accurate Path

How It Works

You create examples of the exact input-output behavior you want:

{"messages": [
  {"role": "user", "content": "Quick pasta recipe for one person"},
  {"role": "assistant", "content": "**One-Pot Garlic Pasta**\n\nPrep: 5 min | Cook: 12 min\n\n- 100g spaghetti\n- 2 cloves garlic, minced\n- 2 tbsp olive oil\n- Red pepper flakes\n- Parmesan\n\n1. Boil pasta in salted water\n2. Saute garlic in olive oil (2 min)\n3. Toss drained pasta with garlic oil\n4. Top with pepper flakes and parmesan"}
]}

Train the model on 500-5,000 such examples. The model learns your format, style, domain knowledge, and edge case handling.

Strengths

Accuracy: 90-96% on domain tasks vs 70-85% with prompting
No system prompt needed: Instructions are in the weights. Zero extra tokens per request.
Domain knowledge: The model knows your product, terminology, and style
Consistency: Responses follow the trained format reliably
On-device deployment: Fine-tuned models run locally. No API cost, no latency, no network dependency.
Prompt injection resistance: Behavior is in the weights, not an overridable text instruction

Weaknesses

Upfront time: Training data preparation takes hours to days
Training cost: $5-50 per fine-tuning run (one-time)
Less flexible: Changing behavior requires retraining
Data requirement: Minimum 200-500 quality examples

Head-to-Head Comparison

Accuracy on Domain Tasks

Metric	Prompted GPT-4o	Prompted GPT-4o-mini	Fine-Tuned 3B	Fine-Tuned 1B
Classification accuracy	80-85%	71-78%	93-96%	90-94%
Format adherence	85-90%	75-85%	95-98%	92-96%
Domain term usage	60-70%	50-60%	95%+	90%+
Edge case handling	65-75%	55-65%	85-92%	80-88%

Fine-tuning consistently outperforms prompting on domain-specific metrics. The gap is largest on format adherence and domain terminology, where fine-tuning locks in the exact patterns you need.

Cost Per Month (10K MAU, 90K daily requests)

Approach	Token Cost	Infrastructure	Total Monthly
Prompted GPT-4o	$5,625+	API only	$5,625+
Prompted GPT-4o-mini	$338+	API only	$338+
Prompted Gemini Flash	$225+	API only	$225+
Fine-tuned 3B (on-device)	$0	CDN for model delivery	~$10-50
Fine-tuned 1B (on-device)	$0	CDN for model delivery	~$10-50

Fine-tuning has a one-time cost ($5-50 per training run). After deployment, per-inference cost is zero. The monthly cost is just CDN bandwidth for new user model downloads.

Latency

Approach	Time to First Token
Cloud API (any model)	500-2,000ms
Fine-tuned on-device 1B	80-150ms
Fine-tuned on-device 3B	150-300ms

When Each Wins

Prompt engineering wins when:

You are prototyping and do not know if users want the feature yet
The task is general (not domain-specific)
You have zero training data
Behavior needs to change weekly
User count is very small (under 500 MAU)

Fine-tuning wins when:

You have validated the feature and are scaling
The task is domain-specific (your product, your terminology, your format)
Accuracy matters (classification, extraction, compliance-sensitive content)
You have 500+ examples of desired behavior (or can create them)
Cost, latency, offline support, or privacy matter

The Migration Path

The two approaches are not mutually exclusive. They are sequential:

Start with prompt engineering. Build the feature fast. Validate user interest. Ship with a cloud API.
Collect training data. Every API call with your prompts generates an input-output pair. Your prompt-engineered API logs become your fine-tuning dataset.
Fine-tune when the signal is clear. When you know users want the feature, when your prompt is stable, when cost or latency matters, fine-tune a small model on your collected data.
Deploy on-device. Export GGUF, ship to users. The system prompt disappears. The accuracy improves. The cost drops to zero.

Platforms like Ertas make the fine-tuning step accessible: upload your training data (which can come directly from your API logs), select a base model, train with LoRA, export GGUF. The fine-tuning infrastructure is handled for you.

The Prompt-to-Fine-Tune Pipeline

Your API logs are a goldmine. Each log entry contains:

The user input (training input)
The system prompt (implicitly encoded in the expected output)
The model output (training output, after quality filtering)

Filter for high-quality outputs (where the model followed your instructions correctly), format as training examples, and you have a fine-tuning dataset. The better your prompt engineering was, the better your fine-tuning data will be.

This is why the two approaches complement each other. Good prompts create good training data. Good training data creates a model that no longer needs prompts.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Fine-Tuning vs Prompt Engineering for Mobile Apps

Prompt Engineering: The Fast Path

How It Works

Strengths

Weaknesses

Fine-Tuning: The Accurate Path

How It Works

Strengths

Weaknesses

Head-to-Head Comparison

Accuracy on Domain Tasks

Cost Per Month (10K MAU, 90K daily requests)

Latency

When Each Wins

The Migration Path

The Prompt-to-Fine-Tune Pipeline

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

Your AI API Bill Will 10x When Your App Gets Users

AI API Pricing for Mobile: The Real Cost Per User

Fine-Tuning vs RAG for Mobile: Why RAG Still Needs a Server