
A/B Testing Cloud API vs On-Device AI in Production
How to run a fair A/B test between your cloud API and on-device model in a live mobile app. Metrics, cohort design, statistical significance, and the metrics that actually matter.
You have a cloud AI feature in production. You have a fine-tuned on-device model ready to deploy. Before migrating all users, you need evidence that the on-device model meets or exceeds the cloud model's quality.
An A/B test in production gives you that evidence on real users, real queries, and real behavior.
What to Test
The goal is not to prove the on-device model is "as good" as the cloud model. The goal is to measure whether users notice or care about any difference, and to quantify the operational improvements (latency, cost, offline capability).
Primary Metrics
| Metric | Why It Matters | How to Measure |
|---|---|---|
| Task completion rate | Did users get what they needed? | % of AI interactions that resulted in user action (send, save, accept) |
| Feature engagement (D7/D30) | Do users keep using the AI feature? | Return rate to AI feature over 7/30 days |
| Time to first action | Is the UX faster or slower? | Time from query to user's next action |
| Error/retry rate | Does the AI fail or frustrate? | % of interactions where user retries or abandons |
Secondary Metrics
| Metric | Why It Matters |
|---|---|
| Latency (TTFT, full response) | On-device should win, but verify |
| Cost per user | Cloud cohort has API costs; on-device has ~$0 |
| Offline usage | On-device cohort should show AI usage in offline conditions |
| App crash rate | On-device model loading can cause memory issues |
| Battery impact | On-device inference uses device resources |
Metrics to Avoid
Model quality scores (perplexity, BLEU, ROUGE): Users do not care about perplexity. They care about whether the feature solved their problem. Automated quality metrics are useful during development but not as A/B test primary metrics.
Response length: Longer is not better. Shorter is not worse. Length is a proxy for nothing useful.
Cohort Design
Random Assignment
Assign users to cohorts on first AI interaction:
fun getAiCohort(userId: String): AiCohort {
// Deterministic hash ensures same user always gets same cohort
val hash = userId.hashCode().absoluteValue
return if (hash % 100 < 50) AiCohort.CLOUD else AiCohort.ON_DEVICE
}
Use the user ID hash (not random) so each user stays in their cohort across sessions.
Cohort Sizes
| Test Duration | Users Per Cohort | Detectable Effect Size |
|---|---|---|
| 1 week | 500 | 10%+ difference |
| 2 weeks | 1,000 | 5-7% difference |
| 4 weeks | 2,500 | 3-5% difference |
For a 50/50 split at 10K MAU, you have 5,000 users per cohort. Two weeks gives you statistically significant results for differences of 5% or more.
Gradual Rollout
Start conservative:
- Week 1: 10% on-device, 90% cloud (catch crashes and critical issues)
- Week 2: 25% on-device, 75% cloud (gather initial metrics)
- Week 3-4: 50/50 (full A/B test with statistical power)
- After results: Ramp to 100% on-device if metrics pass
Implementation Architecture
The AI Provider Abstraction
Both variants should go through the same interface:
interface AiProvider {
generate(prompt: string, options: GenerateOptions): Promise<AiResponse>;
isAvailable(): boolean;
}
class CloudProvider implements AiProvider {
async generate(prompt, options) {
const response = await callCloudAPI(prompt, options);
return { text: response.text, source: "cloud", latencyMs: response.latency };
}
isAvailable() { return navigator.onLine; }
}
class OnDeviceProvider implements AiProvider {
async generate(prompt, options) {
const response = await llamaGenerate(prompt, options);
return { text: response.text, source: "on_device", latencyMs: response.latency };
}
isAvailable() { return this.modelLoaded; }
}
Routing
function getProvider(userId: string): AiProvider {
const cohort = getAiCohort(userId);
if (cohort === "on_device" && onDeviceProvider.isAvailable()) {
return onDeviceProvider;
}
// Fallback to cloud if on-device model not loaded yet
return cloudProvider;
}
Event Logging
Log every AI interaction with cohort and metrics:
function logAiInteraction(event: AiInteractionEvent) {
analytics.track("ai_interaction", {
cohort: event.cohort, // "cloud" or "on_device"
action: event.action, // "generate", "accept", "retry", "abandon"
latency_ttft_ms: event.ttft, // Time to first token
latency_total_ms: event.total, // Total response time
tokens_generated: event.tokens,
user_action: event.userAction, // What user did after (send, edit, dismiss)
offline: !navigator.onLine,
device_model: getDeviceModel(),
timestamp: Date.now(),
});
}
Analyzing Results
Statistical Significance
Use a chi-squared test for rate metrics (completion rate, retry rate) and a t-test for continuous metrics (latency, time to action).
Minimum confidence level: 95% (p less than 0.05). For critical metrics, use 99%.
Expected Results
Based on typical on-device vs cloud migrations:
| Metric | Expected Cloud | Expected On-Device | Direction |
|---|---|---|---|
| Latency (TTFT) | 500-2,000ms | 50-200ms | On-device wins significantly |
| Task completion rate | Baseline | -2% to +3% | Usually comparable |
| Feature engagement (D7) | Baseline | +0% to +10% | On-device often wins (faster = more usage) |
| Retry rate | Baseline | -5% to +2% | Usually comparable or better |
| Offline AI usage | 0% | 5-15% of sessions | New capability |
| Cost per user | $0.05-0.10 | ~$0.00 | On-device wins |
When to Ship On-Device
Ship if:
- Task completion rate is within 3% of cloud (not statistically significantly worse)
- No increase in crash rate
- Latency is measurably better (expected)
- Feature engagement is stable or improved
Do not ship if:
- Task completion rate is significantly lower (more than 5% drop)
- Crash rate increases (memory issues on some devices)
- Users in the on-device cohort have measurably lower satisfaction
When to Iterate
If the on-device model underperforms on task completion but wins on latency and engagement, the model quality needs improvement. Options:
- Add more training data and re-train
- Switch to a larger model (1B to 3B)
- Improve fine-tuning (more epochs, different hyperparameters)
- Expand training data coverage for the failing cases
Re-run the A/B test with the improved model.
Edge Cases
On-Device Model Not Yet Downloaded
Users in the on-device cohort who have not downloaded the model yet should fall back to cloud. Track how long it takes for the on-device cohort to fully activate (all users have the model).
Device Capability Mismatch
Some users' devices may not support the on-device model (insufficient RAM). These users should stay on cloud regardless of cohort assignment. Track what percentage of the on-device cohort falls back to cloud and on which devices.
Offline Comparison
The on-device cohort has a capability the cloud cohort does not: offline AI. Track offline AI usage separately. This is incremental value that does not appear in a direct quality comparison.
The Business Case
The A/B test produces the data for the migration decision:
- Quality delta: Is the user experience equivalent?
- Cost delta: How much does the cloud cohort cost per month?
- Engagement delta: Do users interact more with faster AI?
- New capability: How much offline AI usage exists?
For most well-fine-tuned models, the A/B test confirms what the engineering team expects: equivalent quality, better latency, zero cost, plus offline capability. The data makes the migration decision straightforward.
Fine-tuning quality is the key variable. Platforms like Ertas enable rapid iteration: re-train with improved data, export GGUF, deploy to the on-device cohort, and measure again.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Migrating from Cloud API to On-Device AI: The Complete Guide
A step-by-step migration plan for moving your mobile app from cloud AI APIs to on-device inference. Data extraction, fine-tuning, integration, testing, rollout, and monitoring.

How to Add AI to Your Mobile App: A Developer's Decision Guide
A comprehensive guide covering every approach to adding AI features to iOS and Android apps. Cloud APIs, on-device models, and hybrid architectures compared with real cost and performance data.

AI in iOS Apps: CoreML, Cloud APIs, and On-Device LLMs Compared
Three paths to AI in your iOS app. CoreML for Apple's ecosystem, cloud APIs for capability, and on-device LLMs via llama.cpp for cost and privacy. A practical comparison for Swift developers.