
On-Device vs Cloud API: The Real Math at 10K, 50K, and 100K MAU
A no-fluff cost breakdown of cloud API pricing vs on-device inference at scale. See exactly when on-device fine-tuning pays for itself, with tables, real pricing data, and the hidden costs nobody puts in the README.
Your AI feature works great in testing. Responses are fast, the model is capable, costs are negligible. Then you hit 10K monthly active users and the invoice arrives.
This is the moment that separates apps that scale from apps that quietly get rebuilt. Seventy percent of CIOs cite AI cost unpredictability as their top adoption barrier, according to a 2026 Forrester report. Menlo Ventures found that average monthly organizational AI spend jumped from $63K in 2024 to $85.5K in 2025, a 36% increase in a single year. Replit's gross margins reportedly swung from +36% to -14% as AI inference costs scaled with usage (Sacra).
The good news: you can model this before it happens. This article shows the math.
The Pricing Landscape
First, let's establish the actual numbers. All prices are per 1 million tokens as of early 2026.
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| OpenAI GPT-4o | $2.50 | $10.00 |
| OpenAI GPT-4.1-mini | $0.40 | $1.60 |
| OpenAI GPT-4o-mini | $0.15 | $0.60 |
| Anthropic Claude 3.5 Haiku | $0.80 | $4.00 |
| Google Gemini 2.0 Flash | $0.10 | $0.40 |
Output tokens cost significantly more than input tokens across every provider. This matters because most cost estimates focus on input length and undercount the output side.
The Cost Model: Assumptions
To make this concrete, we need a baseline usage assumption. Here is a reasonable model for a mobile app with an AI assistant feature:
- 3 interactions per user per day (conservative for a daily-use app)
- 500 input tokens per interaction (a short system prompt plus user message)
- 500 output tokens per interaction (a paragraph-length response)
- Monthly active users at 10K, 50K, and 100K
That gives us 30 interactions per user per month, and 1,000 tokens total per interaction (split evenly between input and output).
Total tokens per user per month: 30,000 (15K input + 15K output).
Cloud API Costs at Scale
Here is what that math produces at three MAU milestones.
10,000 MAU
| Model | Monthly Cost |
|---|---|
| Gemini 2.0 Flash | $67.50 |
| GPT-4o-mini | $337.50 |
| GPT-4.1-mini | $900.00 |
| Claude 3.5 Haiku | $1,500.00 |
| GPT-4o | $5,625.00 |
50,000 MAU
| Model | Monthly Cost |
|---|---|
| Gemini 2.0 Flash | $337.50 |
| GPT-4o-mini | $1,687.50 |
| GPT-4.1-mini | $4,500.00 |
| Claude 3.5 Haiku | $7,500.00 |
| GPT-4o | $28,125.00 |
100,000 MAU
| Model | Monthly Cost |
|---|---|
| Gemini 2.0 Flash | $675.00 |
| GPT-4o-mini | $3,375.00 |
| GPT-4.1-mini | $9,000.00 |
| Claude 3.5 Haiku | $15,000.00 |
| GPT-4o | $56,250.00 |
These are bare-minimum estimates. They do not include retry logic, streaming overhead, context window growth as conversations extend, or the cost of embedding calls if you are running RAG. Real-world token usage is typically 1.5-2x higher than estimates.
The On-Device Alternative
On-device inference runs the model on the user's hardware. After the model is distributed, each inference costs you nothing. No per-token fees, no API calls, no egress costs.
The two cost components you actually pay are:
-
Fine-tuning (one-time): Training a LoRA adapter on a cloud GPU service runs approximately $5-$50 depending on dataset size and base model. This is a one-time cost per model version, not per user or per inference.
-
Model distribution (one-time per install): You are shipping a GGUF file with your app. GGUF model sizes for practical mobile-capable models: Llama 3.2 1B at Q4_K_M quantization is 808MB; the 3B variant is 2.02GB. CDN egress for a 1GB file at standard rates is under $0.10 per install. For 10K users, that is roughly $1,000 total distribution cost amortized at install time, not monthly.
Ongoing monthly cost: $0.
The Break-Even Point
Using GPT-4o-mini as a baseline (a common choice for cost-conscious teams):
| MAU | GPT-4o-mini Monthly | On-Device Monthly | Break-Even (months) |
|---|---|---|---|
| 10K | $337.50 | $0 | Less than 1 month after setup |
| 50K | $1,687.50 | $0 | Less than 1 month after setup |
| 100K | $3,375.00 | $0 | Less than 1 month after setup |
The one-time fine-tuning cost of $5-$50 is recovered within the first month at virtually any MAU above a few hundred users. The only real cost is engineering time for integration and the initial model distribution.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
The Hidden Costs of Cloud APIs
The pricing table is not the full story. Cloud API dependencies carry a set of costs that do not show up on your monthly invoice.
Rate Limits and Latency Spikes
Every major provider imposes rate limits: tokens per minute, requests per minute, and daily caps. These are tiered by account level and can require weeks of usage history to increase. During a spike (a viral moment, a product launch, a feature going trending), you will hit limits exactly when you need reliability most. Rate limit errors require client-side retry logic, which adds complexity and can cascade into user-facing failures.
Latency also varies. Cloud model endpoints are shared infrastructure. P99 latencies can reach 5-10 seconds during peak load periods. On-device inference, by contrast, is deterministic. It runs on dedicated hardware with no network round-trip.
Vendor Lock-In and Deprecation Risk
Model APIs are not stable contracts. OpenAI has deprecated GPT-3, GPT-3.5, and multiple fine-tuned endpoints. Anthropic, Google, and others have followed similar patterns. When a model is deprecated, you have a migration window, often 6-12 months, to update your prompts, retest, and redeploy. Prompt engineering that works well on GPT-4o-mini does not always transfer directly to a new model.
On-device models do not deprecate on a provider's schedule. You control when you update and can support older app versions indefinitely without paying for an API endpoint you no longer control.
Network Dependency
Mobile apps that require an active internet connection for every AI feature have a hard constraint. On-device models work offline. For note-taking apps, productivity tools, local-first apps, or any app targeting regions with unreliable connectivity, offline capability is a genuine competitive advantage, not just a nice-to-have.
Privacy and Data Residency
Every API call sends user input to a third-party server. For apps handling sensitive data (health, finance, legal, HR), this creates compliance surface area. On-device inference keeps user data on the device. It never leaves.
When Cloud APIs Still Make Sense
On-device is not the right answer for every use case. Be honest about these scenarios:
Prototyping and early-stage development. When you have fewer than a few hundred MAU, the economics favor cloud. You are still validating the feature. Use GPT-4o-mini or Gemini Flash, instrument your token usage carefully, and revisit the model architecture at 1K-5K MAU.
Tasks requiring frontier model capability. On-device models in the 1B-7B parameter range are capable at summarization, classification, extraction, simple Q&A, and short-form generation. They are not suitable for complex multi-step reasoning, code generation across large codebases, or tasks that genuinely benefit from 100B+ parameter models. If your feature requires GPT-4o level reasoning, on-device is not a substitute.
Low-volume B2B tools. If you have 200 enterprise users each doing 10 interactions per week, your GPT-4o bill is under $100/month. The engineering investment to implement on-device is not worth it at that volume.
Tasks with rapidly changing requirements. If your system prompt changes weekly and you are iterating fast on the model behavior, the cloud iteration loop is much faster. Re-fine-tuning and redistributing an on-device model takes more time than pushing a new system prompt.
A Practical Decision Framework
| Factor | Cloud API | On-Device |
|---|---|---|
| MAU under 2,000 | Preferred | Overhead not justified |
| MAU over 10,000 | Expensive | Cost-effective |
| Offline required | No | Yes |
| Privacy-sensitive data | Risky | Safe by default |
| Complex reasoning tasks | Better capability | Limited |
| Rapid prompt iteration | Easy | Requires re-deploy |
| Deterministic latency | No | Yes |
| Vendor deprecation risk | High | None |
The decision is not binary. A common hybrid architecture: use on-device for the core features (summarization, tagging, quick responses) and route specific high-complexity requests to a cloud API. This keeps the 80-90% of your inference volume on-device at zero per-token cost while preserving access to frontier capability for edge cases.
The Engineering Path to On-Device
The practical barrier to on-device AI has historically been the toolchain. Fine-tuning requires ML infrastructure, exporting to GGUF requires model conversion tooling, and integrating inference into a mobile app requires platform-specific bindings.
This is where Ertas fits. The platform handles fine-tuning (LoRA adapters on your dataset), quantization, and GGUF export in a single pipeline. You provide your training data and target use case. You get back a GGUF file ready for mobile deployment, along with integration guides for iOS (via llama.cpp bindings) and Android.
The one-time fine-tuning cost of $5-$50 versus a monthly API bill that grows linearly with every user you acquire: the math resolves itself quickly.
Conclusion
At 10K MAU using GPT-4o-mini, you are paying $337/month. At 50K MAU, that is $1,687. At 100K MAU, it is $3,375 per month, and that is with a cheap model and conservative usage assumptions. GPT-4o at 100K MAU is $56,250 per month.
On-device inference costs $0 after a one-time fine-tuning investment of under $50 and model distribution costs that are amortized at install time.
The break-even is not months away. For almost any app above a few hundred active users, the API bill exceeds the fine-tuning cost within the first billing cycle after launch. The question is not whether on-device is cheaper. The question is when you build it.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Fine-Tuning vs RAG for Mobile: Why RAG Still Needs a Server
RAG is the go-to solution for giving AI domain knowledge. But on mobile, RAG reintroduces the server dependency you are trying to eliminate. Fine-tuning bakes the knowledge into the model itself.

Fine-Tuning vs Prompt Engineering for Mobile Apps
Prompt engineering is fast and flexible. Fine-tuning is accurate and cheap at scale. Here is the practical comparison for mobile developers deciding between the two approaches.

On-Device AI Unit Economics: The Math That Makes Mobile AI Profitable
The complete unit economics breakdown for on-device AI vs cloud APIs. Fixed costs, variable costs, break-even analysis, and the financial model for scaling mobile AI features profitably.