Back to blog
    Migrating from Cloud API to On-Device AI: The Complete Guide
    migrationcloud APIon-device AIdeploymentmobile AIsegment:mobile-builder

    Migrating from Cloud API to On-Device AI: The Complete Guide

    A step-by-step migration plan for moving your mobile app from cloud AI APIs to on-device inference. Data extraction, fine-tuning, integration, testing, rollout, and monitoring.

    EErtas Team·

    You have a mobile app with AI features powered by a cloud API. It works, but the costs are growing, the latency is noticeable, and you want offline support. This guide walks through the complete migration from cloud API to on-device inference.

    The migration is not a rewrite. It is a gradual transition where on-device AI replaces the cloud API for specific features, validated at each step.

    Phase 1: Assess (Week 1)

    Inventory Your AI Features

    List every feature that calls a cloud AI API:

    FeatureAPI Calls/DayAvg TokensMonthly CostOffline Needed?
    Chat assistant5,0002,500$450Nice to have
    Content categorization12,000800$180Yes
    Smart replies8,000600$120Yes
    Summarization2,0003,000$270No

    Prioritize by Migration Value

    Score each feature:

    • Cost impact: High-volume features save the most money
    • Latency sensitivity: Features where speed matters most benefit most from on-device
    • Offline value: Features that users need without connectivity
    • Complexity: Simple tasks (classification, short generation) migrate more easily

    Start with the highest-value, lowest-complexity feature. Classification and smart replies are typical first candidates.

    Set Quality Baselines

    Before migrating, measure your cloud AI's performance on each feature:

    # Evaluation set: 200-500 examples with human-verified correct answers
    evaluation_set = load_evaluation_data("eval_set.jsonl")
    
    for example in evaluation_set:
        cloud_response = call_cloud_api(example.input)
        cloud_score = evaluate(cloud_response, example.expected)
        log_baseline(example.id, cloud_score)
    

    The baseline is what your on-device model must match or exceed.

    Phase 2: Prepare Training Data (Weeks 1-3)

    Extract from API Logs

    Your existing API call logs are your primary training data source:

    1. Export API logs (input/output pairs) from your logging infrastructure
    2. Filter for quality (successful responses, user-accepted outputs)
    3. Anonymize (remove PII)
    4. Format into the standard chat training format
    # Extract training data from API logs
    training_examples = []
    for log in api_logs:
        if log.status == "success" and log.user_feedback != "negative":
            training_examples.append({
                "messages": [
                    {"role": "user", "content": log.user_input},
                    {"role": "assistant", "content": log.model_output}
                ]
            })
    

    Supplement with Synthetic Data

    If your API logs are insufficient (under 500 quality examples), augment with synthetic data:

    • Use the cloud API to generate variations of your best examples
    • Create examples for edge cases not well-represented in logs
    • Validate synthetic examples manually (sample 10-20%)

    Target Dataset Sizes

    FeatureMinimumRecommended
    Classification5001,000
    Smart replies5002,000
    Chat1,0003,000
    Summarization5002,000

    Phase 3: Fine-Tune (Week 3-4)

    Select Base Model

    Feature TypeRecommended ModelSize
    Classification, taggingLlama 3.2 1B600MB (Q4)
    Smart replies, suggestionsLlama 3.2 1B600MB (Q4)
    Chat, conversationLlama 3.2 3B1.7GB (Q4)
    SummarizationLlama 3.2 3B1.7GB (Q4)

    Train with LoRA

    Upload your training data to a fine-tuning platform. Configure LoRA parameters:

    • Rank: 16-32 (1B) or 32-64 (3B)
    • Learning rate: 1e-4 to 2e-4
    • Epochs: 3-5
    • Evaluation split: 10%

    Platforms like Ertas handle the training infrastructure: select your base model, upload data, configure parameters, train on cloud GPUs, and export GGUF.

    Evaluate Against Baseline

    Run your evaluation set through the fine-tuned model:

    for example in evaluation_set:
        on_device_response = run_local_model(example.input)
        on_device_score = evaluate(on_device_response, example.expected)
        compare_to_baseline(example.id, on_device_score)
    

    Pass criteria: On-device accuracy within 3% of cloud baseline on primary metrics.

    Iterate if Needed

    If the model does not meet your quality bar:

    1. Add more training examples for failing categories
    2. Increase LoRA rank for more capacity
    3. Try a larger base model (1B to 3B)
    4. Adjust hyperparameters (lower learning rate, more epochs)

    Re-evaluate after each change. Most models reach production quality within 2-3 training iterations.

    Phase 4: Integrate (Weeks 4-5)

    Add llama.cpp to Your Project

    Follow the platform-specific integration guides:

    • iOS: Swift Package or pre-built framework
    • Android: llama.android library or NDK build
    • React Native: llama.rn package

    Build the AI Provider Abstraction

    Create a common interface that both cloud and on-device providers implement:

    protocol AiProvider {
        func generate(prompt: String, maxTokens: Int) async throws -> AiResponse
        var isAvailable: Bool { get }
    }
    
    class CloudProvider: AiProvider { /* existing implementation */ }
    class OnDeviceProvider: AiProvider { /* new llama.cpp implementation */ }
    

    Implement Model Delivery

    Choose bundling or post-install download. Set up CDN hosting for the GGUF file. Implement download progress UI, SHA256 verification, and storage management.

    Add Routing Logic

    Route requests to cloud or on-device based on model availability and A/B test cohort:

    func getProvider(for userId: String) -> AiProvider {
        let cohort = getCohort(userId)
    
        if cohort == .onDevice && onDeviceProvider.isAvailable {
            return onDeviceProvider
        }
        return cloudProvider
    }
    

    Phase 5: A/B Test (Weeks 5-7)

    Gradual Rollout

    • Week 5: 10% on-device (monitor for crashes, errors)
    • Week 6: 50/50 split (gather comparison metrics)
    • Week 7: Analyze results

    Key Metrics to Watch

    • Task completion rate (primary)
    • Feature engagement (D7 retention)
    • Latency (TTFT and total)
    • Crash rate
    • User feedback (if you have a feedback mechanism)

    Decision Criteria

    Migrate if task completion is within 3% of cloud and no increase in crashes. Iterate if quality is lower but close (under 5% gap). Re-train with more data. Hold if quality gap is significant (over 5%). Investigate root causes.

    Phase 6: Migrate (Weeks 7-8)

    Full Rollout

    Ramp to 100% on-device over 1-2 weeks:

    • 75% on-device, 25% cloud (1 week)
    • 100% on-device (keep cloud as fallback)

    Keep Cloud as Fallback

    Do not remove the cloud API integration immediately. Keep it as a fallback for:

    • Devices that cannot run the model (under 4GB RAM)
    • Model download failures
    • Emergency rollback if a model quality issue is discovered

    Decommission Cloud (Gradually)

    After 30 days at 100% on-device with stable metrics:

    • Reduce cloud API tier (lower your rate limit commitment)
    • After 90 days: evaluate whether to keep cloud as fallback or remove entirely
    • Track cloud fallback usage. If it is under 1% of requests, it is safe to remove.

    Phase 7: Operate (Ongoing)

    Model Update Pipeline

    Establish a regular update cycle:

    1. Collect new training data from user interactions
    2. Re-train monthly or quarterly
    3. Evaluate against the previous model version
    4. Push via OTA model updates
    5. Monitor quality metrics after each update

    Monitoring

    Track on-device AI health metrics:

    • Inference latency (P50, P95)
    • Model load success rate
    • Generation quality (via user feedback/actions)
    • Memory usage and crash rates
    • Model download success rate

    Cost Tracking

    Document the cost savings:

    • Cloud API bill before migration
    • Cloud API bill after migration (fallback only)
    • CDN costs for model distribution
    • Fine-tuning costs (per re-training run)
    • Net savings per month

    Timeline Summary

    PhaseDurationKey Deliverable
    AssessWeek 1Feature inventory, quality baselines
    Prepare dataWeeks 1-3Cleaned, formatted training dataset
    Fine-tuneWeeks 3-4GGUF model meeting quality bar
    IntegrateWeeks 4-5llama.cpp in app, model delivery working
    A/B testWeeks 5-7Statistical evidence of quality parity
    MigrateWeeks 7-8100% on-device with cloud fallback
    OperateOngoingRegular model updates, monitoring

    Total: 8 weeks from start to full migration. Subsequent model updates (re-train and deploy) take days, not weeks.

    The fine-tuning step is where platforms like Ertas save the most time. Upload your API log data, select your base model, train, export GGUF. The platform handles the GPU infrastructure, training optimization, and GGUF conversion. Your focus stays on data quality and app integration.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading