
Fine-Tuning vs Prompt Engineering for Mobile Apps
Prompt engineering is fast and flexible. Fine-tuning is accurate and cheap at scale. Here is the practical comparison for mobile developers deciding between the two approaches.
Prompt engineering is the first tool every developer reaches for. Write a system prompt that tells the model how to behave, what to output, and what to avoid. It works surprisingly well for prototyping.
Fine-tuning is the second tool, used when prompting hits its limits. Train the model on examples of the exact behavior you want. It takes more upfront work but delivers better results at lower cost.
For mobile apps, the choice has implications beyond accuracy. Prompt engineering requires long system prompts sent on every API call (cost). Fine-tuning bakes the instructions into the model weights (free at inference time).
Prompt Engineering: The Fast Path
How It Works
You write a system prompt that instructs the model:
You are a cooking assistant for the RecipeApp. When users ask about
recipes, provide step-by-step instructions. Always include prep time
and cooking time. Format ingredients as a bulleted list. Keep
responses under 200 words. Never suggest recipes that include
allergens without a warning. If the user asks about non-cooking
topics, politely redirect to cooking.
This prompt is sent with every API call. The model follows (most of) these instructions most of the time.
Strengths
- Speed: Minutes to write and test
- Flexibility: Change behavior by editing text
- No training data needed: Works with zero examples
- No infrastructure: Just API calls
- Iteration speed: Try a new prompt in seconds
Weaknesses
- Token cost: Your 800-1,500 token system prompt is billed on every request. At 10K MAU with 90K daily requests, that is 72-135 million extra tokens per month.
- Inconsistency: The model follows instructions probabilistically. 70-85% adherence is typical. Edge cases slip through.
- Limited domain knowledge: The model only knows what is in the prompt. You cannot fit your entire product catalog into a system prompt.
- Prompt injection vulnerability: Users can sometimes override system prompt instructions.
- Ceiling on accuracy: No amount of prompt engineering can teach a general model your specific domain terminology, style, or edge cases.
Fine-Tuning: The Accurate Path
How It Works
You create examples of the exact input-output behavior you want:
{"messages": [
{"role": "user", "content": "Quick pasta recipe for one person"},
{"role": "assistant", "content": "**One-Pot Garlic Pasta**\n\nPrep: 5 min | Cook: 12 min\n\n- 100g spaghetti\n- 2 cloves garlic, minced\n- 2 tbsp olive oil\n- Red pepper flakes\n- Parmesan\n\n1. Boil pasta in salted water\n2. Saute garlic in olive oil (2 min)\n3. Toss drained pasta with garlic oil\n4. Top with pepper flakes and parmesan"}
]}
Train the model on 500-5,000 such examples. The model learns your format, style, domain knowledge, and edge case handling.
Strengths
- Accuracy: 90-96% on domain tasks vs 70-85% with prompting
- No system prompt needed: Instructions are in the weights. Zero extra tokens per request.
- Domain knowledge: The model knows your product, terminology, and style
- Consistency: Responses follow the trained format reliably
- On-device deployment: Fine-tuned models run locally. No API cost, no latency, no network dependency.
- Prompt injection resistance: Behavior is in the weights, not an overridable text instruction
Weaknesses
- Upfront time: Training data preparation takes hours to days
- Training cost: $5-50 per fine-tuning run (one-time)
- Less flexible: Changing behavior requires retraining
- Data requirement: Minimum 200-500 quality examples
Head-to-Head Comparison
Accuracy on Domain Tasks
| Metric | Prompted GPT-4o | Prompted GPT-4o-mini | Fine-Tuned 3B | Fine-Tuned 1B |
|---|---|---|---|---|
| Classification accuracy | 80-85% | 71-78% | 93-96% | 90-94% |
| Format adherence | 85-90% | 75-85% | 95-98% | 92-96% |
| Domain term usage | 60-70% | 50-60% | 95%+ | 90%+ |
| Edge case handling | 65-75% | 55-65% | 85-92% | 80-88% |
Fine-tuning consistently outperforms prompting on domain-specific metrics. The gap is largest on format adherence and domain terminology, where fine-tuning locks in the exact patterns you need.
Cost Per Month (10K MAU, 90K daily requests)
| Approach | Token Cost | Infrastructure | Total Monthly |
|---|---|---|---|
| Prompted GPT-4o | $5,625+ | API only | $5,625+ |
| Prompted GPT-4o-mini | $338+ | API only | $338+ |
| Prompted Gemini Flash | $225+ | API only | $225+ |
| Fine-tuned 3B (on-device) | $0 | CDN for model delivery | ~$10-50 |
| Fine-tuned 1B (on-device) | $0 | CDN for model delivery | ~$10-50 |
Fine-tuning has a one-time cost ($5-50 per training run). After deployment, per-inference cost is zero. The monthly cost is just CDN bandwidth for new user model downloads.
Latency
| Approach | Time to First Token |
|---|---|
| Cloud API (any model) | 500-2,000ms |
| Fine-tuned on-device 1B | 80-150ms |
| Fine-tuned on-device 3B | 150-300ms |
When Each Wins
Prompt engineering wins when:
- You are prototyping and do not know if users want the feature yet
- The task is general (not domain-specific)
- You have zero training data
- Behavior needs to change weekly
- User count is very small (under 500 MAU)
Fine-tuning wins when:
- You have validated the feature and are scaling
- The task is domain-specific (your product, your terminology, your format)
- Accuracy matters (classification, extraction, compliance-sensitive content)
- You have 500+ examples of desired behavior (or can create them)
- Cost, latency, offline support, or privacy matter
The Migration Path
The two approaches are not mutually exclusive. They are sequential:
-
Start with prompt engineering. Build the feature fast. Validate user interest. Ship with a cloud API.
-
Collect training data. Every API call with your prompts generates an input-output pair. Your prompt-engineered API logs become your fine-tuning dataset.
-
Fine-tune when the signal is clear. When you know users want the feature, when your prompt is stable, when cost or latency matters, fine-tune a small model on your collected data.
-
Deploy on-device. Export GGUF, ship to users. The system prompt disappears. The accuracy improves. The cost drops to zero.
Platforms like Ertas make the fine-tuning step accessible: upload your training data (which can come directly from your API logs), select a base model, train with LoRA, export GGUF. The fine-tuning infrastructure is handled for you.
The Prompt-to-Fine-Tune Pipeline
Your API logs are a goldmine. Each log entry contains:
- The user input (training input)
- The system prompt (implicitly encoded in the expected output)
- The model output (training output, after quality filtering)
Filter for high-quality outputs (where the model followed your instructions correctly), format as training examples, and you have a fine-tuning dataset. The better your prompt engineering was, the better your fine-tuning data will be.
This is why the two approaches complement each other. Good prompts create good training data. Good training data creates a model that no longer needs prompts.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Your AI API Bill Will 10x When Your App Gets Users
The cost math most AI tutorials skip. Your API bill scales linearly with every user, and the real multipliers are worse than the pricing page suggests. Here's what happens at 1K, 10K, and 100K MAU.

AI API Pricing for Mobile: The Real Cost Per User
How to calculate the true cost of AI per mobile app user. Provider comparison, hidden multipliers, and the unit economics that determine whether your AI feature is sustainable.

Fine-Tuning vs RAG for Mobile: Why RAG Still Needs a Server
RAG is the go-to solution for giving AI domain knowledge. But on mobile, RAG reintroduces the server dependency you are trying to eliminate. Fine-tuning bakes the knowledge into the model itself.