Migrating from Cloud API to On-Device AI: The Complete Guide

You have a mobile app with AI features powered by a cloud API. It works, but the costs are growing, the latency is noticeable, and you want offline support. This guide walks through the complete migration from cloud API to on-device inference.

The migration is not a rewrite. It is a gradual transition where on-device AI replaces the cloud API for specific features, validated at each step.

Phase 1: Assess (Week 1)

Inventory Your AI Features

List every feature that calls a cloud AI API:

Feature	API Calls/Day	Avg Tokens	Monthly Cost	Offline Needed?
Chat assistant	5,000	2,500	$450	Nice to have
Content categorization	12,000	800	$180	Yes
Smart replies	8,000	600	$120	Yes
Summarization	2,000	3,000	$270	No

Prioritize by Migration Value

Score each feature:

Cost impact: High-volume features save the most money
Latency sensitivity: Features where speed matters most benefit most from on-device
Offline value: Features that users need without connectivity
Complexity: Simple tasks (classification, short generation) migrate more easily

Start with the highest-value, lowest-complexity feature. Classification and smart replies are typical first candidates.

Set Quality Baselines

Before migrating, measure your cloud AI's performance on each feature:

# Evaluation set: 200-500 examples with human-verified correct answers
evaluation_set = load_evaluation_data("eval_set.jsonl")

for example in evaluation_set:
    cloud_response = call_cloud_api(example.input)
    cloud_score = evaluate(cloud_response, example.expected)
    log_baseline(example.id, cloud_score)

The baseline is what your on-device model must match or exceed.

Phase 2: Prepare Training Data (Weeks 1-3)

Extract from API Logs

Your existing API call logs are your primary training data source:

Export API logs (input/output pairs) from your logging infrastructure
Filter for quality (successful responses, user-accepted outputs)
Anonymize (remove PII)
Format into the standard chat training format

# Extract training data from API logs
training_examples = []
for log in api_logs:
    if log.status == "success" and log.user_feedback != "negative":
        training_examples.append({
            "messages": [
                {"role": "user", "content": log.user_input},
                {"role": "assistant", "content": log.model_output}
            ]
        })

Supplement with Synthetic Data

If your API logs are insufficient (under 500 quality examples), augment with synthetic data:

Use the cloud API to generate variations of your best examples
Create examples for edge cases not well-represented in logs
Validate synthetic examples manually (sample 10-20%)

Target Dataset Sizes

Feature	Minimum	Recommended
Classification	500	1,000
Smart replies	500	2,000
Chat	1,000	3,000
Summarization	500	2,000

Phase 3: Fine-Tune (Week 3-4)

Select Base Model

Feature Type	Recommended Model	Size
Classification, tagging	Llama 3.2 1B	600MB (Q4)
Smart replies, suggestions	Llama 3.2 1B	600MB (Q4)
Chat, conversation	Llama 3.2 3B	1.7GB (Q4)
Summarization	Llama 3.2 3B	1.7GB (Q4)

Train with LoRA

Upload your training data to a fine-tuning platform. Configure LoRA parameters:

Rank: 16-32 (1B) or 32-64 (3B)
Learning rate: 1e-4 to 2e-4
Epochs: 3-5
Evaluation split: 10%

Platforms like Ertas handle the training infrastructure: select your base model, upload data, configure parameters, train on cloud GPUs, and export GGUF.

Evaluate Against Baseline

Run your evaluation set through the fine-tuned model:

for example in evaluation_set:
    on_device_response = run_local_model(example.input)
    on_device_score = evaluate(on_device_response, example.expected)
    compare_to_baseline(example.id, on_device_score)

Pass criteria: On-device accuracy within 3% of cloud baseline on primary metrics.

Iterate if Needed

If the model does not meet your quality bar:

Add more training examples for failing categories
Increase LoRA rank for more capacity
Try a larger base model (1B to 3B)
Adjust hyperparameters (lower learning rate, more epochs)

Re-evaluate after each change. Most models reach production quality within 2-3 training iterations.

Phase 4: Integrate (Weeks 4-5)

Add llama.cpp to Your Project

Follow the platform-specific integration guides:

iOS: Swift Package or pre-built framework
Android: llama.android library or NDK build
React Native: llama.rn package

Build the AI Provider Abstraction

Create a common interface that both cloud and on-device providers implement:

protocol AiProvider {
    func generate(prompt: String, maxTokens: Int) async throws -> AiResponse
    var isAvailable: Bool { get }
}

class CloudProvider: AiProvider { /* existing implementation */ }
class OnDeviceProvider: AiProvider { /* new llama.cpp implementation */ }

Implement Model Delivery

Choose bundling or post-install download. Set up CDN hosting for the GGUF file. Implement download progress UI, SHA256 verification, and storage management.

Add Routing Logic

Route requests to cloud or on-device based on model availability and A/B test cohort:

func getProvider(for userId: String) -> AiProvider {
    let cohort = getCohort(userId)

    if cohort == .onDevice && onDeviceProvider.isAvailable {
        return onDeviceProvider
    }
    return cloudProvider
}

Phase 5: A/B Test (Weeks 5-7)

Gradual Rollout

Week 5: 10% on-device (monitor for crashes, errors)
Week 6: 50/50 split (gather comparison metrics)
Week 7: Analyze results

Key Metrics to Watch

Task completion rate (primary)
Feature engagement (D7 retention)
Latency (TTFT and total)
Crash rate
User feedback (if you have a feedback mechanism)

Decision Criteria

Migrate if task completion is within 3% of cloud and no increase in crashes. Iterate if quality is lower but close (under 5% gap). Re-train with more data. Hold if quality gap is significant (over 5%). Investigate root causes.

Phase 6: Migrate (Weeks 7-8)

Full Rollout

Ramp to 100% on-device over 1-2 weeks:

75% on-device, 25% cloud (1 week)
100% on-device (keep cloud as fallback)

Keep Cloud as Fallback

Do not remove the cloud API integration immediately. Keep it as a fallback for:

Devices that cannot run the model (under 4GB RAM)
Model download failures
Emergency rollback if a model quality issue is discovered

Decommission Cloud (Gradually)

After 30 days at 100% on-device with stable metrics:

Reduce cloud API tier (lower your rate limit commitment)
After 90 days: evaluate whether to keep cloud as fallback or remove entirely
Track cloud fallback usage. If it is under 1% of requests, it is safe to remove.

Phase 7: Operate (Ongoing)

Model Update Pipeline

Establish a regular update cycle:

Collect new training data from user interactions
Re-train monthly or quarterly
Evaluate against the previous model version
Push via OTA model updates
Monitor quality metrics after each update

Monitoring

Track on-device AI health metrics:

Inference latency (P50, P95)
Model load success rate
Generation quality (via user feedback/actions)
Memory usage and crash rates
Model download success rate

Cost Tracking

Document the cost savings:

Cloud API bill before migration
Cloud API bill after migration (fallback only)
CDN costs for model distribution
Fine-tuning costs (per re-training run)
Net savings per month

Timeline Summary

Phase	Duration	Key Deliverable
Assess	Week 1	Feature inventory, quality baselines
Prepare data	Weeks 1-3	Cleaned, formatted training dataset
Fine-tune	Weeks 3-4	GGUF model meeting quality bar
Integrate	Weeks 4-5	llama.cpp in app, model delivery working
A/B test	Weeks 5-7	Statistical evidence of quality parity
Migrate	Weeks 7-8	100% on-device with cloud fallback
Operate	Ongoing	Regular model updates, monitoring

Total: 8 weeks from start to full migration. Subsequent model updates (re-train and deploy) take days, not weeks.

The fine-tuning step is where platforms like Ertas save the most time. Upload your API log data, select your base model, train, export GGUF. The platform handles the GPU infrastructure, training optimization, and GGUF conversion. Your focus stays on data quality and app integration.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Migrating from Cloud API to On-Device AI: The Complete Guide

Phase 1: Assess (Week 1)

Inventory Your AI Features

Prioritize by Migration Value

Set Quality Baselines

Phase 2: Prepare Training Data (Weeks 1-3)

Extract from API Logs

Supplement with Synthetic Data

Target Dataset Sizes

Phase 3: Fine-Tune (Week 3-4)

Select Base Model

Train with LoRA

Evaluate Against Baseline

Iterate if Needed

Phase 4: Integrate (Weeks 4-5)

Add llama.cpp to Your Project

Build the AI Provider Abstraction

Implement Model Delivery

Add Routing Logic

Phase 5: A/B Test (Weeks 5-7)

Gradual Rollout

Key Metrics to Watch

Decision Criteria

Phase 6: Migrate (Weeks 7-8)

Full Rollout

Keep Cloud as Fallback

Decommission Cloud (Gradually)

Phase 7: Operate (Ongoing)

Model Update Pipeline

Monitoring

Cost Tracking

Timeline Summary

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

How to Add AI to Your Mobile App: A Developer's Decision Guide

A/B Testing Cloud API vs On-Device AI in Production

AI in iOS Apps: CoreML, Cloud APIs, and On-Device LLMs Compared