A/B Testing Cloud API vs On-Device AI in Production

You have a cloud AI feature in production. You have a fine-tuned on-device model ready to deploy. Before migrating all users, you need evidence that the on-device model meets or exceeds the cloud model's quality.

An A/B test in production gives you that evidence on real users, real queries, and real behavior.

What to Test

The goal is not to prove the on-device model is "as good" as the cloud model. The goal is to measure whether users notice or care about any difference, and to quantify the operational improvements (latency, cost, offline capability).

Primary Metrics

Metric	Why It Matters	How to Measure
Task completion rate	Did users get what they needed?	% of AI interactions that resulted in user action (send, save, accept)
Feature engagement (D7/D30)	Do users keep using the AI feature?	Return rate to AI feature over 7/30 days
Time to first action	Is the UX faster or slower?	Time from query to user's next action
Error/retry rate	Does the AI fail or frustrate?	% of interactions where user retries or abandons

Secondary Metrics

Metric	Why It Matters
Latency (TTFT, full response)	On-device should win, but verify
Cost per user	Cloud cohort has API costs; on-device has ~$0
Offline usage	On-device cohort should show AI usage in offline conditions
App crash rate	On-device model loading can cause memory issues
Battery impact	On-device inference uses device resources

Metrics to Avoid

Model quality scores (perplexity, BLEU, ROUGE): Users do not care about perplexity. They care about whether the feature solved their problem. Automated quality metrics are useful during development but not as A/B test primary metrics.

Response length: Longer is not better. Shorter is not worse. Length is a proxy for nothing useful.

Cohort Design

Random Assignment

Assign users to cohorts on first AI interaction:

fun getAiCohort(userId: String): AiCohort {
    // Deterministic hash ensures same user always gets same cohort
    val hash = userId.hashCode().absoluteValue
    return if (hash % 100 < 50) AiCohort.CLOUD else AiCohort.ON_DEVICE
}

Use the user ID hash (not random) so each user stays in their cohort across sessions.

Cohort Sizes

Test Duration	Users Per Cohort	Detectable Effect Size
1 week	500	10%+ difference
2 weeks	1,000	5-7% difference
4 weeks	2,500	3-5% difference

For a 50/50 split at 10K MAU, you have 5,000 users per cohort. Two weeks gives you statistically significant results for differences of 5% or more.

Gradual Rollout

Start conservative:

Week 1: 10% on-device, 90% cloud (catch crashes and critical issues)
Week 2: 25% on-device, 75% cloud (gather initial metrics)
Week 3-4: 50/50 (full A/B test with statistical power)
After results: Ramp to 100% on-device if metrics pass

Implementation Architecture

The AI Provider Abstraction

Both variants should go through the same interface:

interface AiProvider {
  generate(prompt: string, options: GenerateOptions): Promise<AiResponse>;
  isAvailable(): boolean;
}

class CloudProvider implements AiProvider {
  async generate(prompt, options) {
    const response = await callCloudAPI(prompt, options);
    return { text: response.text, source: "cloud", latencyMs: response.latency };
  }
  isAvailable() { return navigator.onLine; }
}

class OnDeviceProvider implements AiProvider {
  async generate(prompt, options) {
    const response = await llamaGenerate(prompt, options);
    return { text: response.text, source: "on_device", latencyMs: response.latency };
  }
  isAvailable() { return this.modelLoaded; }
}

Routing

function getProvider(userId: string): AiProvider {
  const cohort = getAiCohort(userId);

  if (cohort === "on_device" && onDeviceProvider.isAvailable()) {
    return onDeviceProvider;
  }

  // Fallback to cloud if on-device model not loaded yet
  return cloudProvider;
}

Event Logging

Log every AI interaction with cohort and metrics:

function logAiInteraction(event: AiInteractionEvent) {
  analytics.track("ai_interaction", {
    cohort: event.cohort,           // "cloud" or "on_device"
    action: event.action,           // "generate", "accept", "retry", "abandon"
    latency_ttft_ms: event.ttft,    // Time to first token
    latency_total_ms: event.total,  // Total response time
    tokens_generated: event.tokens,
    user_action: event.userAction,  // What user did after (send, edit, dismiss)
    offline: !navigator.onLine,
    device_model: getDeviceModel(),
    timestamp: Date.now(),
  });
}

Analyzing Results

Statistical Significance

Use a chi-squared test for rate metrics (completion rate, retry rate) and a t-test for continuous metrics (latency, time to action).

Minimum confidence level: 95% (p less than 0.05). For critical metrics, use 99%.

Expected Results

Based on typical on-device vs cloud migrations:

Metric	Expected Cloud	Expected On-Device	Direction
Latency (TTFT)	500-2,000ms	50-200ms	On-device wins significantly
Task completion rate	Baseline	-2% to +3%	Usually comparable
Feature engagement (D7)	Baseline	+0% to +10%	On-device often wins (faster = more usage)
Retry rate	Baseline	-5% to +2%	Usually comparable or better
Offline AI usage	0%	5-15% of sessions	New capability
Cost per user	$0.05-0.10	~$0.00	On-device wins

When to Ship On-Device

Ship if:

Task completion rate is within 3% of cloud (not statistically significantly worse)
No increase in crash rate
Latency is measurably better (expected)
Feature engagement is stable or improved

Do not ship if:

Task completion rate is significantly lower (more than 5% drop)
Crash rate increases (memory issues on some devices)
Users in the on-device cohort have measurably lower satisfaction

When to Iterate

If the on-device model underperforms on task completion but wins on latency and engagement, the model quality needs improvement. Options:

Add more training data and re-train
Switch to a larger model (1B to 3B)
Improve fine-tuning (more epochs, different hyperparameters)
Expand training data coverage for the failing cases

Re-run the A/B test with the improved model.

Edge Cases

On-Device Model Not Yet Downloaded

Users in the on-device cohort who have not downloaded the model yet should fall back to cloud. Track how long it takes for the on-device cohort to fully activate (all users have the model).

Device Capability Mismatch

Some users' devices may not support the on-device model (insufficient RAM). These users should stay on cloud regardless of cohort assignment. Track what percentage of the on-device cohort falls back to cloud and on which devices.

Offline Comparison

The on-device cohort has a capability the cloud cohort does not: offline AI. Track offline AI usage separately. This is incremental value that does not appear in a direct quality comparison.

The Business Case

The A/B test produces the data for the migration decision:

Quality delta: Is the user experience equivalent?
Cost delta: How much does the cloud cohort cost per month?
Engagement delta: Do users interact more with faster AI?
New capability: How much offline AI usage exists?

For most well-fine-tuned models, the A/B test confirms what the engineering team expects: equivalent quality, better latency, zero cost, plus offline capability. The data makes the migration decision straightforward.

Fine-tuning quality is the key variable. Platforms like Ertas enable rapid iteration: re-train with improved data, export GGUF, deploy to the on-device cohort, and measure again.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

A/B Testing Cloud API vs On-Device AI in Production

What to Test

Primary Metrics

Secondary Metrics

Metrics to Avoid

Cohort Design

Random Assignment

Cohort Sizes

Gradual Rollout

Implementation Architecture

The AI Provider Abstraction

Routing

Event Logging

Analyzing Results

Statistical Significance

Expected Results

When to Ship On-Device

When to Iterate

Edge Cases

On-Device Model Not Yet Downloaded

Device Capability Mismatch

Offline Comparison

The Business Case

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

Migrating from Cloud API to On-Device AI: The Complete Guide

How to Add AI to Your Mobile App: A Developer's Decision Guide

AI in iOS Apps: CoreML, Cloud APIs, and On-Device LLMs Compared