Back to blog
    AI API Rate Limits Will Throttle Your Mobile App at Scale
    rate limitsAPI scalingmobile AIreliabilitysegment:mobile-builder

    AI API Rate Limits Will Throttle Your Mobile App at Scale

    Rate limits from OpenAI, Anthropic, and Google are designed for controlled usage, not mobile apps with thousands of concurrent users. Here is where the limits hit and what happens when they do.

    EErtas Team·

    Your app gets featured on the App Store. Downloads spike. 5,000 users open the app in the same hour. Each one triggers an AI feature. Your backend fires 5,000 API calls to OpenAI.

    OpenAI's Tier 1 allows 500 requests per minute. You just exceeded it by 10x. The API returns HTTP 429 (Too Many Requests). Your users see error messages or loading spinners that never resolve.

    This is not a hypothetical. It is the predictable result of combining mobile app distribution patterns with API rate limits designed for controlled, enterprise usage.

    Rate Limits by Provider

    OpenAI

    TierRequirementRPMTPM
    FreeAPI key340,000
    Tier 1$5 payment50030,000-200,000
    Tier 2$50+ spent, 7+ days5,000450,000-2,000,000
    Tier 3$100+ spent, 7+ days5,000800,000-4,000,000
    Tier 4$250+ spent, 14+ days10,0002,000,000-10,000,000
    Tier 5$1,000+ spent, 30+ days30,00010,000,000-150,000,000

    You start at Tier 1 (500 RPM). Getting to Tier 5 requires $1,000 in cumulative spend and 30 days of account history. There is no way to skip ahead.

    Anthropic

    TierRequirementRPMTPM
    BuildDefault1,00080,000
    ScaleAfter review4,000400,000

    Anthropic requires a manual tier upgrade. You apply, they review, they decide. There is no automatic scaling.

    Google Gemini

    TierRPMTPM
    Free151,000,000
    Pay-as-you-go2,0004,000,000
    EnterpriseCustomCustom

    Gemini's free tier is extremely limited (15 RPM). Pay-as-you-go is better but still has hard caps.

    How Mobile Apps Hit Rate Limits

    Concurrent Usage Spikes

    Mobile apps have bursty usage patterns. A feature on the App Store, a viral social media post, or a product launch can drive thousands of simultaneous first-time users. Unlike web SaaS where usage ramps gradually, mobile app downloads can spike 10-100x in a single day.

    Peak Hours

    Mobile usage peaks between 7-9 PM local time. If your users are concentrated in one timezone, 60-70% of daily usage compresses into a 3-hour window. Your daily average may be well within limits, but your peak hour exceeds them.

    Feature Engagement Bursts

    When a user opens an AI feature for the first time, they often make 5-10 rapid requests exploring it. This "exploration burst" means new users generate 3-5x more requests than steady-state users. During a download spike, this compounds.

    The Math

    1,000 MAU, 3 requests/user/day = 3,000 requests/day = ~125 requests/hour average.

    But compress 60% of usage into 3 peak hours: 1,800 requests in 3 hours = 600 requests/hour = 10 RPM. Comfortable at Tier 1.

    10,000 MAU with the same pattern: 100 RPM during peak. Still okay at Tier 1.

    50,000 MAU: 500 RPM during peak. At the Tier 1 limit. Any spike exceeds it.

    Now add an App Store feature that drives 5,000 downloads in one hour, each making 3 exploration requests: 15,000 additional requests in one hour = 250 RPM on top of your baseline. You need Tier 2 minimum, which requires $50 in prior spend and 7 days of account history.

    What Happens When You Hit the Limit

    HTTP 429 Responses

    The API returns a 429 status code with a retry-after header. Your app receives no AI response. Without proper error handling, the user sees a crash, a blank response, or an infinite loading state.

    Exponential Backoff

    The standard retry strategy is exponential backoff: wait 1 second, retry, wait 2 seconds, retry, wait 4 seconds, retry. This adds latency on top of already-slow API calls.

    For a user waiting 1-2 seconds for an AI response, adding 1-4 seconds of backoff retries means 3-6 seconds total. Most users give up.

    Queue Congestion

    If you implement a server-side queue for rate-limited requests, the queue grows during spikes. A 10-minute spike at 2x your rate limit creates a backlog that takes 10 minutes to clear. Users at the back of the queue wait 10+ minutes for a response.

    Degraded Experience for All Users

    Rate limits are per-organization, not per-user. When one usage spike triggers throttling, every user of your app is affected. The user who has been using the feature daily for months gets the same 429 error as the new user who just downloaded.

    Mitigation Strategies

    Request Throttling

    Implement client-side rate limiting. Cap requests per user per minute. This protects against individual abuse but does not solve the concurrent-user problem.

    Server-Side Queue

    Route all AI requests through your own server. The server manages a queue and dispatches to the AI API within rate limits. This smooths spikes but adds latency and server infrastructure costs.

    Multiple API Keys

    Distribute requests across multiple API keys or provider accounts. This multiplies your effective rate limit but violates most providers' Terms of Service if detected.

    Model Fallback Chain

    If your primary provider is rate-limited, fall back to a secondary provider. OpenAI rate limited? Route to Gemini. This adds complexity and requires maintaining multiple integrations.

    Caching

    For identical or similar requests, cache responses. This reduces API calls but only helps if users ask similar things. Unique user inputs (the majority of chat interactions) cannot be cached.

    The Structural Solution

    Rate limits exist because cloud providers share finite GPU capacity across all customers. More users on the platform means tighter limits for everyone.

    On-device inference has no rate limits. The "server" is the user's phone. Each user has their own inference capacity. 1,000 concurrent users means 1,000 parallel inference instances, each running independently.

    FactorCloud APIOn-Device
    Rate limit500-30,000 RPM (shared)None (per-device)
    Concurrent usersLimited by provider tierUnlimited
    Spike handlingThrottledNo change
    Infrastructure neededQueue server + retry logicNone
    ReliabilityDepends on providerDepends on device

    The scaling model is fundamentally different. Cloud APIs share a pool. On-device gives each user their own pool.

    Planning for Scale

    If you are building with cloud APIs today:

    1. Know your tier. Check your current rate limits and how close you are to them.
    2. Monitor 429 rates. Track how often your users hit rate limits. If it is over 0.5%, you have a problem.
    3. Estimate your ceiling. At what MAU does your peak-hour RPM exceed your tier limit? That is your scaling cliff.
    4. Build the fallback. Queue, retry, and graceful degradation are table stakes for production apps.
    5. Plan the exit. On-device inference is the long-term answer. Fine-tune a model on your domain data with a platform like Ertas, export GGUF, and deploy to user devices. No rate limits, no shared infrastructure, no scaling cliffs.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading