
Migrating from Cloud API to On-Device AI: The Complete Guide
A step-by-step migration plan for moving your mobile app from cloud AI APIs to on-device inference. Data extraction, fine-tuning, integration, testing, rollout, and monitoring.
You have a mobile app with AI features powered by a cloud API. It works, but the costs are growing, the latency is noticeable, and you want offline support. This guide walks through the complete migration from cloud API to on-device inference.
The migration is not a rewrite. It is a gradual transition where on-device AI replaces the cloud API for specific features, validated at each step.
Phase 1: Assess (Week 1)
Inventory Your AI Features
List every feature that calls a cloud AI API:
| Feature | API Calls/Day | Avg Tokens | Monthly Cost | Offline Needed? |
|---|---|---|---|---|
| Chat assistant | 5,000 | 2,500 | $450 | Nice to have |
| Content categorization | 12,000 | 800 | $180 | Yes |
| Smart replies | 8,000 | 600 | $120 | Yes |
| Summarization | 2,000 | 3,000 | $270 | No |
Prioritize by Migration Value
Score each feature:
- Cost impact: High-volume features save the most money
- Latency sensitivity: Features where speed matters most benefit most from on-device
- Offline value: Features that users need without connectivity
- Complexity: Simple tasks (classification, short generation) migrate more easily
Start with the highest-value, lowest-complexity feature. Classification and smart replies are typical first candidates.
Set Quality Baselines
Before migrating, measure your cloud AI's performance on each feature:
# Evaluation set: 200-500 examples with human-verified correct answers
evaluation_set = load_evaluation_data("eval_set.jsonl")
for example in evaluation_set:
cloud_response = call_cloud_api(example.input)
cloud_score = evaluate(cloud_response, example.expected)
log_baseline(example.id, cloud_score)
The baseline is what your on-device model must match or exceed.
Phase 2: Prepare Training Data (Weeks 1-3)
Extract from API Logs
Your existing API call logs are your primary training data source:
- Export API logs (input/output pairs) from your logging infrastructure
- Filter for quality (successful responses, user-accepted outputs)
- Anonymize (remove PII)
- Format into the standard chat training format
# Extract training data from API logs
training_examples = []
for log in api_logs:
if log.status == "success" and log.user_feedback != "negative":
training_examples.append({
"messages": [
{"role": "user", "content": log.user_input},
{"role": "assistant", "content": log.model_output}
]
})
Supplement with Synthetic Data
If your API logs are insufficient (under 500 quality examples), augment with synthetic data:
- Use the cloud API to generate variations of your best examples
- Create examples for edge cases not well-represented in logs
- Validate synthetic examples manually (sample 10-20%)
Target Dataset Sizes
| Feature | Minimum | Recommended |
|---|---|---|
| Classification | 500 | 1,000 |
| Smart replies | 500 | 2,000 |
| Chat | 1,000 | 3,000 |
| Summarization | 500 | 2,000 |
Phase 3: Fine-Tune (Week 3-4)
Select Base Model
| Feature Type | Recommended Model | Size |
|---|---|---|
| Classification, tagging | Llama 3.2 1B | 600MB (Q4) |
| Smart replies, suggestions | Llama 3.2 1B | 600MB (Q4) |
| Chat, conversation | Llama 3.2 3B | 1.7GB (Q4) |
| Summarization | Llama 3.2 3B | 1.7GB (Q4) |
Train with LoRA
Upload your training data to a fine-tuning platform. Configure LoRA parameters:
- Rank: 16-32 (1B) or 32-64 (3B)
- Learning rate: 1e-4 to 2e-4
- Epochs: 3-5
- Evaluation split: 10%
Platforms like Ertas handle the training infrastructure: select your base model, upload data, configure parameters, train on cloud GPUs, and export GGUF.
Evaluate Against Baseline
Run your evaluation set through the fine-tuned model:
for example in evaluation_set:
on_device_response = run_local_model(example.input)
on_device_score = evaluate(on_device_response, example.expected)
compare_to_baseline(example.id, on_device_score)
Pass criteria: On-device accuracy within 3% of cloud baseline on primary metrics.
Iterate if Needed
If the model does not meet your quality bar:
- Add more training examples for failing categories
- Increase LoRA rank for more capacity
- Try a larger base model (1B to 3B)
- Adjust hyperparameters (lower learning rate, more epochs)
Re-evaluate after each change. Most models reach production quality within 2-3 training iterations.
Phase 4: Integrate (Weeks 4-5)
Add llama.cpp to Your Project
Follow the platform-specific integration guides:
- iOS: Swift Package or pre-built framework
- Android: llama.android library or NDK build
- React Native: llama.rn package
Build the AI Provider Abstraction
Create a common interface that both cloud and on-device providers implement:
protocol AiProvider {
func generate(prompt: String, maxTokens: Int) async throws -> AiResponse
var isAvailable: Bool { get }
}
class CloudProvider: AiProvider { /* existing implementation */ }
class OnDeviceProvider: AiProvider { /* new llama.cpp implementation */ }
Implement Model Delivery
Choose bundling or post-install download. Set up CDN hosting for the GGUF file. Implement download progress UI, SHA256 verification, and storage management.
Add Routing Logic
Route requests to cloud or on-device based on model availability and A/B test cohort:
func getProvider(for userId: String) -> AiProvider {
let cohort = getCohort(userId)
if cohort == .onDevice && onDeviceProvider.isAvailable {
return onDeviceProvider
}
return cloudProvider
}
Phase 5: A/B Test (Weeks 5-7)
Gradual Rollout
- Week 5: 10% on-device (monitor for crashes, errors)
- Week 6: 50/50 split (gather comparison metrics)
- Week 7: Analyze results
Key Metrics to Watch
- Task completion rate (primary)
- Feature engagement (D7 retention)
- Latency (TTFT and total)
- Crash rate
- User feedback (if you have a feedback mechanism)
Decision Criteria
Migrate if task completion is within 3% of cloud and no increase in crashes. Iterate if quality is lower but close (under 5% gap). Re-train with more data. Hold if quality gap is significant (over 5%). Investigate root causes.
Phase 6: Migrate (Weeks 7-8)
Full Rollout
Ramp to 100% on-device over 1-2 weeks:
- 75% on-device, 25% cloud (1 week)
- 100% on-device (keep cloud as fallback)
Keep Cloud as Fallback
Do not remove the cloud API integration immediately. Keep it as a fallback for:
- Devices that cannot run the model (under 4GB RAM)
- Model download failures
- Emergency rollback if a model quality issue is discovered
Decommission Cloud (Gradually)
After 30 days at 100% on-device with stable metrics:
- Reduce cloud API tier (lower your rate limit commitment)
- After 90 days: evaluate whether to keep cloud as fallback or remove entirely
- Track cloud fallback usage. If it is under 1% of requests, it is safe to remove.
Phase 7: Operate (Ongoing)
Model Update Pipeline
Establish a regular update cycle:
- Collect new training data from user interactions
- Re-train monthly or quarterly
- Evaluate against the previous model version
- Push via OTA model updates
- Monitor quality metrics after each update
Monitoring
Track on-device AI health metrics:
- Inference latency (P50, P95)
- Model load success rate
- Generation quality (via user feedback/actions)
- Memory usage and crash rates
- Model download success rate
Cost Tracking
Document the cost savings:
- Cloud API bill before migration
- Cloud API bill after migration (fallback only)
- CDN costs for model distribution
- Fine-tuning costs (per re-training run)
- Net savings per month
Timeline Summary
| Phase | Duration | Key Deliverable |
|---|---|---|
| Assess | Week 1 | Feature inventory, quality baselines |
| Prepare data | Weeks 1-3 | Cleaned, formatted training dataset |
| Fine-tune | Weeks 3-4 | GGUF model meeting quality bar |
| Integrate | Weeks 4-5 | llama.cpp in app, model delivery working |
| A/B test | Weeks 5-7 | Statistical evidence of quality parity |
| Migrate | Weeks 7-8 | 100% on-device with cloud fallback |
| Operate | Ongoing | Regular model updates, monitoring |
Total: 8 weeks from start to full migration. Subsequent model updates (re-train and deploy) take days, not weeks.
The fine-tuning step is where platforms like Ertas save the most time. Upload your API log data, select your base model, train, export GGUF. The platform handles the GPU infrastructure, training optimization, and GGUF conversion. Your focus stays on data quality and app integration.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

How to Add AI to Your Mobile App: A Developer's Decision Guide
A comprehensive guide covering every approach to adding AI features to iOS and Android apps. Cloud APIs, on-device models, and hybrid architectures compared with real cost and performance data.

A/B Testing Cloud API vs On-Device AI in Production
How to run a fair A/B test between your cloud API and on-device model in a live mobile app. Metrics, cohort design, statistical significance, and the metrics that actually matter.

AI in iOS Apps: CoreML, Cloud APIs, and On-Device LLMs Compared
Three paths to AI in your iOS app. CoreML for Apple's ecosystem, cloud APIs for capability, and on-device LLMs via llama.cpp for cost and privacy. A practical comparison for Swift developers.