On-Device vs Cloud API: The Real Math at 10K, 50K, and 100K MAU

Your AI feature works great in testing. Responses are fast, the model is capable, costs are negligible. Then you hit 10K monthly active users and the invoice arrives.

This is the moment that separates apps that scale from apps that quietly get rebuilt. Seventy percent of CIOs cite AI cost unpredictability as their top adoption barrier, according to a 2026 Forrester report. Menlo Ventures found that average monthly organizational AI spend jumped from $63K in 2024 to $85.5K in 2025, a 36% increase in a single year. Replit's gross margins reportedly swung from +36% to -14% as AI inference costs scaled with usage (Sacra).

The good news: you can model this before it happens. This article shows the math.

The Pricing Landscape

First, let's establish the actual numbers. All prices are per 1 million tokens as of early 2026.

Model	Input (per 1M tokens)	Output (per 1M tokens)
OpenAI GPT-4o	$2.50	$10.00
OpenAI GPT-4.1-mini	$0.40	$1.60
OpenAI GPT-4o-mini	$0.15	$0.60
Anthropic Claude 3.5 Haiku	$0.80	$4.00
Google Gemini 2.0 Flash	$0.10	$0.40

Output tokens cost significantly more than input tokens across every provider. This matters because most cost estimates focus on input length and undercount the output side.

The Cost Model: Assumptions

To make this concrete, we need a baseline usage assumption. Here is a reasonable model for a mobile app with an AI assistant feature:

3 interactions per user per day (conservative for a daily-use app)
500 input tokens per interaction (a short system prompt plus user message)
500 output tokens per interaction (a paragraph-length response)
Monthly active users at 10K, 50K, and 100K

That gives us 30 interactions per user per month, and 1,000 tokens total per interaction (split evenly between input and output).

Total tokens per user per month: 30,000 (15K input + 15K output).

Cloud API Costs at Scale

Here is what that math produces at three MAU milestones.

10,000 MAU

Model	Monthly Cost
Gemini 2.0 Flash	$67.50
GPT-4o-mini	$337.50
GPT-4.1-mini	$900.00
Claude 3.5 Haiku	$1,500.00
GPT-4o	$5,625.00

50,000 MAU

Model	Monthly Cost
Gemini 2.0 Flash	$337.50
GPT-4o-mini	$1,687.50
GPT-4.1-mini	$4,500.00
Claude 3.5 Haiku	$7,500.00
GPT-4o	$28,125.00

100,000 MAU

Model	Monthly Cost
Gemini 2.0 Flash	$675.00
GPT-4o-mini	$3,375.00
GPT-4.1-mini	$9,000.00
Claude 3.5 Haiku	$15,000.00
GPT-4o	$56,250.00

These are bare-minimum estimates. They do not include retry logic, streaming overhead, context window growth as conversations extend, or the cost of embedding calls if you are running RAG. Real-world token usage is typically 1.5-2x higher than estimates.

The On-Device Alternative

On-device inference runs the model on the user's hardware. After the model is distributed, each inference costs you nothing. No per-token fees, no API calls, no egress costs.

The two cost components you actually pay are:

Fine-tuning (one-time): Training a LoRA adapter on a cloud GPU service runs approximately $5-$50 depending on dataset size and base model. This is a one-time cost per model version, not per user or per inference.
Model distribution (one-time per install): You are shipping a GGUF file with your app. GGUF model sizes for practical mobile-capable models: Llama 3.2 1B at Q4_K_M quantization is 808MB; the 3B variant is 2.02GB. CDN egress for a 1GB file at standard rates is under $0.10 per install. For 10K users, that is roughly $1,000 total distribution cost amortized at install time, not monthly.

Ongoing monthly cost: $0.

The Break-Even Point

Using GPT-4o-mini as a baseline (a common choice for cost-conscious teams):

MAU	GPT-4o-mini Monthly	On-Device Monthly	Break-Even (months)
10K	$337.50	$0	Less than 1 month after setup
50K	$1,687.50	$0	Less than 1 month after setup
100K	$3,375.00	$0	Less than 1 month after setup

The one-time fine-tuning cost of $5-$50 is recovered within the first month at virtually any MAU above a few hundred users. The only real cost is engineering time for integration and the initial model distribution.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

The Hidden Costs of Cloud APIs

The pricing table is not the full story. Cloud API dependencies carry a set of costs that do not show up on your monthly invoice.

Rate Limits and Latency Spikes

Every major provider imposes rate limits: tokens per minute, requests per minute, and daily caps. These are tiered by account level and can require weeks of usage history to increase. During a spike (a viral moment, a product launch, a feature going trending), you will hit limits exactly when you need reliability most. Rate limit errors require client-side retry logic, which adds complexity and can cascade into user-facing failures.

Latency also varies. Cloud model endpoints are shared infrastructure. P99 latencies can reach 5-10 seconds during peak load periods. On-device inference, by contrast, is deterministic. It runs on dedicated hardware with no network round-trip.

Vendor Lock-In and Deprecation Risk

Model APIs are not stable contracts. OpenAI has deprecated GPT-3, GPT-3.5, and multiple fine-tuned endpoints. Anthropic, Google, and others have followed similar patterns. When a model is deprecated, you have a migration window, often 6-12 months, to update your prompts, retest, and redeploy. Prompt engineering that works well on GPT-4o-mini does not always transfer directly to a new model.

On-device models do not deprecate on a provider's schedule. You control when you update and can support older app versions indefinitely without paying for an API endpoint you no longer control.

Network Dependency

Mobile apps that require an active internet connection for every AI feature have a hard constraint. On-device models work offline. For note-taking apps, productivity tools, local-first apps, or any app targeting regions with unreliable connectivity, offline capability is a genuine competitive advantage, not just a nice-to-have.

Privacy and Data Residency

Every API call sends user input to a third-party server. For apps handling sensitive data (health, finance, legal, HR), this creates compliance surface area. On-device inference keeps user data on the device. It never leaves.

When Cloud APIs Still Make Sense

On-device is not the right answer for every use case. Be honest about these scenarios:

Prototyping and early-stage development. When you have fewer than a few hundred MAU, the economics favor cloud. You are still validating the feature. Use GPT-4o-mini or Gemini Flash, instrument your token usage carefully, and revisit the model architecture at 1K-5K MAU.

Tasks requiring frontier model capability. On-device models in the 1B-7B parameter range are capable at summarization, classification, extraction, simple Q&A, and short-form generation. They are not suitable for complex multi-step reasoning, code generation across large codebases, or tasks that genuinely benefit from 100B+ parameter models. If your feature requires GPT-4o level reasoning, on-device is not a substitute.

Low-volume B2B tools. If you have 200 enterprise users each doing 10 interactions per week, your GPT-4o bill is under $100/month. The engineering investment to implement on-device is not worth it at that volume.

Tasks with rapidly changing requirements. If your system prompt changes weekly and you are iterating fast on the model behavior, the cloud iteration loop is much faster. Re-fine-tuning and redistributing an on-device model takes more time than pushing a new system prompt.

A Practical Decision Framework

Factor	Cloud API	On-Device
MAU under 2,000	Preferred	Overhead not justified
MAU over 10,000	Expensive	Cost-effective
Offline required	No	Yes
Privacy-sensitive data	Risky	Safe by default
Complex reasoning tasks	Better capability	Limited
Rapid prompt iteration	Easy	Requires re-deploy
Deterministic latency	No	Yes
Vendor deprecation risk	High	None

The decision is not binary. A common hybrid architecture: use on-device for the core features (summarization, tagging, quick responses) and route specific high-complexity requests to a cloud API. This keeps the 80-90% of your inference volume on-device at zero per-token cost while preserving access to frontier capability for edge cases.

The Engineering Path to On-Device

The practical barrier to on-device AI has historically been the toolchain. Fine-tuning requires ML infrastructure, exporting to GGUF requires model conversion tooling, and integrating inference into a mobile app requires platform-specific bindings.

This is where Ertas fits. The platform handles fine-tuning (LoRA adapters on your dataset), quantization, and GGUF export in a single pipeline. You provide your training data and target use case. You get back a GGUF file ready for mobile deployment, along with integration guides for iOS (via llama.cpp bindings) and Android.

The one-time fine-tuning cost of $5-$50 versus a monthly API bill that grows linearly with every user you acquire: the math resolves itself quickly.

Conclusion

At 10K MAU using GPT-4o-mini, you are paying $337/month. At 50K MAU, that is $1,687. At 100K MAU, it is $3,375 per month, and that is with a cheap model and conservative usage assumptions. GPT-4o at 100K MAU is $56,250 per month.

On-device inference costs $0 after a one-time fine-tuning investment of under $50 and model distribution costs that are amortized at install time.

The break-even is not months away. For almost any app above a few hundred active users, the API bill exceeds the fine-tuning cost within the first billing cycle after launch. The question is not whether on-device is cheaper. The question is when you build it.

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

On-Device vs Cloud API: The Real Math at 10K, 50K, and 100K MAU

The Pricing Landscape

The Cost Model: Assumptions

Cloud API Costs at Scale

10,000 MAU

50,000 MAU

100,000 MAU

The On-Device Alternative

The Break-Even Point

The Hidden Costs of Cloud APIs

Rate Limits and Latency Spikes

Vendor Lock-In and Deprecation Risk

Network Dependency

Privacy and Data Residency

When Cloud APIs Still Make Sense

A Practical Decision Framework

The Engineering Path to On-Device

Conclusion

Ship AI that runs on your users' devices.

Ship AI that runs on your users' devices.

Keep reading

Fine-Tuning vs RAG for Mobile: Why RAG Still Needs a Server

Fine-Tuning vs Prompt Engineering for Mobile Apps

On-Device AI Unit Economics: The Math That Makes Mobile AI Profitable