Back to blog
    On-Device vs Cloud API: The Real Math at 10K, 50K, and 100K MAU
    on-device AIcost optimizationmobile AIfine-tuningcloud API

    On-Device vs Cloud API: The Real Math at 10K, 50K, and 100K MAU

    A no-fluff cost breakdown of cloud API pricing vs on-device inference at scale. See exactly when on-device fine-tuning pays for itself, with tables, real pricing data, and the hidden costs nobody puts in the README.

    EErtas Team·

    Your AI feature works great in testing. Responses are fast, the model is capable, costs are negligible. Then you hit 10K monthly active users and the invoice arrives.

    This is the moment that separates apps that scale from apps that quietly get rebuilt. Seventy percent of CIOs cite AI cost unpredictability as their top adoption barrier, according to a 2026 Forrester report. Menlo Ventures found that average monthly organizational AI spend jumped from $63K in 2024 to $85.5K in 2025, a 36% increase in a single year. Replit's gross margins reportedly swung from +36% to -14% as AI inference costs scaled with usage (Sacra).

    The good news: you can model this before it happens. This article shows the math.

    The Pricing Landscape

    First, let's establish the actual numbers. All prices are per 1 million tokens as of early 2026.

    ModelInput (per 1M tokens)Output (per 1M tokens)
    OpenAI GPT-4o$2.50$10.00
    OpenAI GPT-4.1-mini$0.40$1.60
    OpenAI GPT-4o-mini$0.15$0.60
    Anthropic Claude 3.5 Haiku$0.80$4.00
    Google Gemini 2.0 Flash$0.10$0.40

    Output tokens cost significantly more than input tokens across every provider. This matters because most cost estimates focus on input length and undercount the output side.

    The Cost Model: Assumptions

    To make this concrete, we need a baseline usage assumption. Here is a reasonable model for a mobile app with an AI assistant feature:

    • 3 interactions per user per day (conservative for a daily-use app)
    • 500 input tokens per interaction (a short system prompt plus user message)
    • 500 output tokens per interaction (a paragraph-length response)
    • Monthly active users at 10K, 50K, and 100K

    That gives us 30 interactions per user per month, and 1,000 tokens total per interaction (split evenly between input and output).

    Total tokens per user per month: 30,000 (15K input + 15K output).

    Cloud API Costs at Scale

    Here is what that math produces at three MAU milestones.

    10,000 MAU

    ModelMonthly Cost
    Gemini 2.0 Flash$67.50
    GPT-4o-mini$337.50
    GPT-4.1-mini$900.00
    Claude 3.5 Haiku$1,500.00
    GPT-4o$5,625.00

    50,000 MAU

    ModelMonthly Cost
    Gemini 2.0 Flash$337.50
    GPT-4o-mini$1,687.50
    GPT-4.1-mini$4,500.00
    Claude 3.5 Haiku$7,500.00
    GPT-4o$28,125.00

    100,000 MAU

    ModelMonthly Cost
    Gemini 2.0 Flash$675.00
    GPT-4o-mini$3,375.00
    GPT-4.1-mini$9,000.00
    Claude 3.5 Haiku$15,000.00
    GPT-4o$56,250.00

    These are bare-minimum estimates. They do not include retry logic, streaming overhead, context window growth as conversations extend, or the cost of embedding calls if you are running RAG. Real-world token usage is typically 1.5-2x higher than estimates.

    The On-Device Alternative

    On-device inference runs the model on the user's hardware. After the model is distributed, each inference costs you nothing. No per-token fees, no API calls, no egress costs.

    The two cost components you actually pay are:

    1. Fine-tuning (one-time): Training a LoRA adapter on a cloud GPU service runs approximately $5-$50 depending on dataset size and base model. This is a one-time cost per model version, not per user or per inference.

    2. Model distribution (one-time per install): You are shipping a GGUF file with your app. GGUF model sizes for practical mobile-capable models: Llama 3.2 1B at Q4_K_M quantization is 808MB; the 3B variant is 2.02GB. CDN egress for a 1GB file at standard rates is under $0.10 per install. For 10K users, that is roughly $1,000 total distribution cost amortized at install time, not monthly.

    Ongoing monthly cost: $0.

    The Break-Even Point

    Using GPT-4o-mini as a baseline (a common choice for cost-conscious teams):

    MAUGPT-4o-mini MonthlyOn-Device MonthlyBreak-Even (months)
    10K$337.50$0Less than 1 month after setup
    50K$1,687.50$0Less than 1 month after setup
    100K$3,375.00$0Less than 1 month after setup

    The one-time fine-tuning cost of $5-$50 is recovered within the first month at virtually any MAU above a few hundred users. The only real cost is engineering time for integration and the initial model distribution.

    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    The Hidden Costs of Cloud APIs

    The pricing table is not the full story. Cloud API dependencies carry a set of costs that do not show up on your monthly invoice.

    Rate Limits and Latency Spikes

    Every major provider imposes rate limits: tokens per minute, requests per minute, and daily caps. These are tiered by account level and can require weeks of usage history to increase. During a spike (a viral moment, a product launch, a feature going trending), you will hit limits exactly when you need reliability most. Rate limit errors require client-side retry logic, which adds complexity and can cascade into user-facing failures.

    Latency also varies. Cloud model endpoints are shared infrastructure. P99 latencies can reach 5-10 seconds during peak load periods. On-device inference, by contrast, is deterministic. It runs on dedicated hardware with no network round-trip.

    Vendor Lock-In and Deprecation Risk

    Model APIs are not stable contracts. OpenAI has deprecated GPT-3, GPT-3.5, and multiple fine-tuned endpoints. Anthropic, Google, and others have followed similar patterns. When a model is deprecated, you have a migration window, often 6-12 months, to update your prompts, retest, and redeploy. Prompt engineering that works well on GPT-4o-mini does not always transfer directly to a new model.

    On-device models do not deprecate on a provider's schedule. You control when you update and can support older app versions indefinitely without paying for an API endpoint you no longer control.

    Network Dependency

    Mobile apps that require an active internet connection for every AI feature have a hard constraint. On-device models work offline. For note-taking apps, productivity tools, local-first apps, or any app targeting regions with unreliable connectivity, offline capability is a genuine competitive advantage, not just a nice-to-have.

    Privacy and Data Residency

    Every API call sends user input to a third-party server. For apps handling sensitive data (health, finance, legal, HR), this creates compliance surface area. On-device inference keeps user data on the device. It never leaves.

    When Cloud APIs Still Make Sense

    On-device is not the right answer for every use case. Be honest about these scenarios:

    Prototyping and early-stage development. When you have fewer than a few hundred MAU, the economics favor cloud. You are still validating the feature. Use GPT-4o-mini or Gemini Flash, instrument your token usage carefully, and revisit the model architecture at 1K-5K MAU.

    Tasks requiring frontier model capability. On-device models in the 1B-7B parameter range are capable at summarization, classification, extraction, simple Q&A, and short-form generation. They are not suitable for complex multi-step reasoning, code generation across large codebases, or tasks that genuinely benefit from 100B+ parameter models. If your feature requires GPT-4o level reasoning, on-device is not a substitute.

    Low-volume B2B tools. If you have 200 enterprise users each doing 10 interactions per week, your GPT-4o bill is under $100/month. The engineering investment to implement on-device is not worth it at that volume.

    Tasks with rapidly changing requirements. If your system prompt changes weekly and you are iterating fast on the model behavior, the cloud iteration loop is much faster. Re-fine-tuning and redistributing an on-device model takes more time than pushing a new system prompt.

    A Practical Decision Framework

    FactorCloud APIOn-Device
    MAU under 2,000PreferredOverhead not justified
    MAU over 10,000ExpensiveCost-effective
    Offline requiredNoYes
    Privacy-sensitive dataRiskySafe by default
    Complex reasoning tasksBetter capabilityLimited
    Rapid prompt iterationEasyRequires re-deploy
    Deterministic latencyNoYes
    Vendor deprecation riskHighNone

    The decision is not binary. A common hybrid architecture: use on-device for the core features (summarization, tagging, quick responses) and route specific high-complexity requests to a cloud API. This keeps the 80-90% of your inference volume on-device at zero per-token cost while preserving access to frontier capability for edge cases.

    The Engineering Path to On-Device

    The practical barrier to on-device AI has historically been the toolchain. Fine-tuning requires ML infrastructure, exporting to GGUF requires model conversion tooling, and integrating inference into a mobile app requires platform-specific bindings.

    This is where Ertas fits. The platform handles fine-tuning (LoRA adapters on your dataset), quantization, and GGUF export in a single pipeline. You provide your training data and target use case. You get back a GGUF file ready for mobile deployment, along with integration guides for iOS (via llama.cpp bindings) and Android.

    The one-time fine-tuning cost of $5-$50 versus a monthly API bill that grows linearly with every user you acquire: the math resolves itself quickly.

    Conclusion

    At 10K MAU using GPT-4o-mini, you are paying $337/month. At 50K MAU, that is $1,687. At 100K MAU, it is $3,375 per month, and that is with a cheap model and conservative usage assumptions. GPT-4o at 100K MAU is $56,250 per month.

    On-device inference costs $0 after a one-time fine-tuning investment of under $50 and model distribution costs that are amortized at install time.

    The break-even is not months away. For almost any app above a few hundred active users, the API bill exceeds the fine-tuning cost within the first billing cycle after launch. The question is not whether on-device is cheaper. The question is when you build it.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading