Back to blog
    A/B Testing Cloud API vs On-Device AI in Production
    A/B testingmigrationcloud APIon-device AIproductionsegment:mobile-builder

    A/B Testing Cloud API vs On-Device AI in Production

    How to run a fair A/B test between your cloud API and on-device model in a live mobile app. Metrics, cohort design, statistical significance, and the metrics that actually matter.

    EErtas Team·

    You have a cloud AI feature in production. You have a fine-tuned on-device model ready to deploy. Before migrating all users, you need evidence that the on-device model meets or exceeds the cloud model's quality.

    An A/B test in production gives you that evidence on real users, real queries, and real behavior.

    What to Test

    The goal is not to prove the on-device model is "as good" as the cloud model. The goal is to measure whether users notice or care about any difference, and to quantify the operational improvements (latency, cost, offline capability).

    Primary Metrics

    MetricWhy It MattersHow to Measure
    Task completion rateDid users get what they needed?% of AI interactions that resulted in user action (send, save, accept)
    Feature engagement (D7/D30)Do users keep using the AI feature?Return rate to AI feature over 7/30 days
    Time to first actionIs the UX faster or slower?Time from query to user's next action
    Error/retry rateDoes the AI fail or frustrate?% of interactions where user retries or abandons

    Secondary Metrics

    MetricWhy It Matters
    Latency (TTFT, full response)On-device should win, but verify
    Cost per userCloud cohort has API costs; on-device has ~$0
    Offline usageOn-device cohort should show AI usage in offline conditions
    App crash rateOn-device model loading can cause memory issues
    Battery impactOn-device inference uses device resources

    Metrics to Avoid

    Model quality scores (perplexity, BLEU, ROUGE): Users do not care about perplexity. They care about whether the feature solved their problem. Automated quality metrics are useful during development but not as A/B test primary metrics.

    Response length: Longer is not better. Shorter is not worse. Length is a proxy for nothing useful.

    Cohort Design

    Random Assignment

    Assign users to cohorts on first AI interaction:

    fun getAiCohort(userId: String): AiCohort {
        // Deterministic hash ensures same user always gets same cohort
        val hash = userId.hashCode().absoluteValue
        return if (hash % 100 < 50) AiCohort.CLOUD else AiCohort.ON_DEVICE
    }
    

    Use the user ID hash (not random) so each user stays in their cohort across sessions.

    Cohort Sizes

    Test DurationUsers Per CohortDetectable Effect Size
    1 week50010%+ difference
    2 weeks1,0005-7% difference
    4 weeks2,5003-5% difference

    For a 50/50 split at 10K MAU, you have 5,000 users per cohort. Two weeks gives you statistically significant results for differences of 5% or more.

    Gradual Rollout

    Start conservative:

    1. Week 1: 10% on-device, 90% cloud (catch crashes and critical issues)
    2. Week 2: 25% on-device, 75% cloud (gather initial metrics)
    3. Week 3-4: 50/50 (full A/B test with statistical power)
    4. After results: Ramp to 100% on-device if metrics pass

    Implementation Architecture

    The AI Provider Abstraction

    Both variants should go through the same interface:

    interface AiProvider {
      generate(prompt: string, options: GenerateOptions): Promise<AiResponse>;
      isAvailable(): boolean;
    }
    
    class CloudProvider implements AiProvider {
      async generate(prompt, options) {
        const response = await callCloudAPI(prompt, options);
        return { text: response.text, source: "cloud", latencyMs: response.latency };
      }
      isAvailable() { return navigator.onLine; }
    }
    
    class OnDeviceProvider implements AiProvider {
      async generate(prompt, options) {
        const response = await llamaGenerate(prompt, options);
        return { text: response.text, source: "on_device", latencyMs: response.latency };
      }
      isAvailable() { return this.modelLoaded; }
    }
    

    Routing

    function getProvider(userId: string): AiProvider {
      const cohort = getAiCohort(userId);
    
      if (cohort === "on_device" && onDeviceProvider.isAvailable()) {
        return onDeviceProvider;
      }
    
      // Fallback to cloud if on-device model not loaded yet
      return cloudProvider;
    }
    

    Event Logging

    Log every AI interaction with cohort and metrics:

    function logAiInteraction(event: AiInteractionEvent) {
      analytics.track("ai_interaction", {
        cohort: event.cohort,           // "cloud" or "on_device"
        action: event.action,           // "generate", "accept", "retry", "abandon"
        latency_ttft_ms: event.ttft,    // Time to first token
        latency_total_ms: event.total,  // Total response time
        tokens_generated: event.tokens,
        user_action: event.userAction,  // What user did after (send, edit, dismiss)
        offline: !navigator.onLine,
        device_model: getDeviceModel(),
        timestamp: Date.now(),
      });
    }
    

    Analyzing Results

    Statistical Significance

    Use a chi-squared test for rate metrics (completion rate, retry rate) and a t-test for continuous metrics (latency, time to action).

    Minimum confidence level: 95% (p less than 0.05). For critical metrics, use 99%.

    Expected Results

    Based on typical on-device vs cloud migrations:

    MetricExpected CloudExpected On-DeviceDirection
    Latency (TTFT)500-2,000ms50-200msOn-device wins significantly
    Task completion rateBaseline-2% to +3%Usually comparable
    Feature engagement (D7)Baseline+0% to +10%On-device often wins (faster = more usage)
    Retry rateBaseline-5% to +2%Usually comparable or better
    Offline AI usage0%5-15% of sessionsNew capability
    Cost per user$0.05-0.10~$0.00On-device wins

    When to Ship On-Device

    Ship if:

    • Task completion rate is within 3% of cloud (not statistically significantly worse)
    • No increase in crash rate
    • Latency is measurably better (expected)
    • Feature engagement is stable or improved

    Do not ship if:

    • Task completion rate is significantly lower (more than 5% drop)
    • Crash rate increases (memory issues on some devices)
    • Users in the on-device cohort have measurably lower satisfaction

    When to Iterate

    If the on-device model underperforms on task completion but wins on latency and engagement, the model quality needs improvement. Options:

    • Add more training data and re-train
    • Switch to a larger model (1B to 3B)
    • Improve fine-tuning (more epochs, different hyperparameters)
    • Expand training data coverage for the failing cases

    Re-run the A/B test with the improved model.

    Edge Cases

    On-Device Model Not Yet Downloaded

    Users in the on-device cohort who have not downloaded the model yet should fall back to cloud. Track how long it takes for the on-device cohort to fully activate (all users have the model).

    Device Capability Mismatch

    Some users' devices may not support the on-device model (insufficient RAM). These users should stay on cloud regardless of cohort assignment. Track what percentage of the on-device cohort falls back to cloud and on which devices.

    Offline Comparison

    The on-device cohort has a capability the cloud cohort does not: offline AI. Track offline AI usage separately. This is incremental value that does not appear in a direct quality comparison.

    The Business Case

    The A/B test produces the data for the migration decision:

    • Quality delta: Is the user experience equivalent?
    • Cost delta: How much does the cloud cohort cost per month?
    • Engagement delta: Do users interact more with faster AI?
    • New capability: How much offline AI usage exists?

    For most well-fine-tuned models, the A/B test confirms what the engineering team expects: equivalent quality, better latency, zero cost, plus offline capability. The data makes the migration decision straightforward.

    Fine-tuning quality is the key variable. Platforms like Ertas enable rapid iteration: re-train with improved data, export GGUF, deploy to the on-device cohort, and measure again.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading