Back to blog
    Why Your AI App Feels Slow: Network Latency Is the Bottleneck
    latencyUXmobile AIperformanceon-device AIsegment:mobile-builder

    Why Your AI App Feels Slow: Network Latency Is the Bottleneck

    AI API calls add 500-3,000ms of latency to every interaction. On mobile, that is the difference between a feature users love and one they abandon. Here is where the time goes and how to fix it.

    EErtas Team·

    Your AI feature works. The model is good. The prompts are tuned. But users describe it as "laggy." They wait 2-3 seconds staring at a loading spinner before seeing any response. On mobile, that is an eternity.

    The problem is not the model. It is the network round trip between the user's phone and the cloud server running inference.

    Where the Time Goes

    A typical cloud AI API call from a mobile device involves these steps:

    StepTimeNotes
    DNS resolution10-50msCached after first call
    TCP + TLS handshake50-150msPer connection (connection reuse helps)
    Request upload20-100msDepends on payload size and bandwidth
    Server queue wait50-500msVaries by provider load
    Model inference (TTFT)200-1,500msTime to first token, depends on model
    Response download (first token)20-50msNetwork transit
    Total time to first token350-2,350ms

    On a good connection, you might see 500ms. On cellular, 1-2 seconds. On a weak connection (elevator, subway, rural area), 3+ seconds or a timeout.

    The Compounding Effect

    These latencies hit on every single interaction. A 5-turn conversation with a cloud API means the user waits 5 times. Each wait reinforces the perception that the feature is slow. After 3-4 interactions, many users stop using the feature entirely.

    Research from Google found that 53% of mobile users abandon a page that takes longer than 3 seconds to load. AI features face the same threshold, but the bar is even higher because users compare AI response times to the instant feedback of every other UI element in the app.

    Why Mobile Is Worse Than Desktop

    Desktop apps calling AI APIs have a structural advantage: they typically run on stable WiFi or ethernet connections. Mobile adds several layers of latency:

    Cellular variability: 4G latency averages 50-100ms but spikes to 300-500ms during congestion. 5G is better in ideal conditions but inconsistent in practice.

    Connection switching: Moving between WiFi and cellular (entering/leaving a building) can cause 1-2 second interruptions while the connection re-establishes.

    Background/foreground transitions: iOS and Android suspend network connections when apps are backgrounded. When the user returns, the connection may need to be re-established.

    Geographic distance: API servers are typically in US-East or US-West. Users in Southeast Asia, Africa, or South America add 100-300ms of pure network transit time.

    The UX Impact

    Loading Spinners Kill Engagement

    Every loading spinner is a moment where the user can decide to do something else. On mobile, "something else" is one swipe away. The app switcher is always available.

    A/B testing by multiple mobile teams has shown that AI features with over 1 second latency see 30-40% lower completion rates compared to sub-second features. The feature works identically. The only difference is perceived speed.

    Streaming Helps, But Has Limits

    Token-by-token streaming (Server-Sent Events) reduces perceived latency by showing output as it generates. The user sees the first few words quickly, which gives the impression of responsiveness.

    Streaming improves the experience but does not eliminate the problem:

    • Time to first token is still 500-2,000ms
    • On weak connections, SSE streams buffer, creating visible stutter
    • Each streamed chunk adds its own network overhead

    The Offline Gap

    The worst latency is infinite. When the user has no connection, the cloud API returns nothing. The feature breaks completely.

    This is not an edge case on mobile. Users are regularly in subways, elevators, airplanes, rural areas, and international locations without data. A feature that fails in these moments trains users not to rely on it.

    The On-Device Alternative

    On-device inference eliminates network latency entirely. The model runs on the user's phone. The path from input to output never touches the network.

    MetricCloud APIOn-Device
    Time to first token500-2,000ms50-200ms
    Full response (100 tokens)2-5 seconds1-3 seconds
    OfflineFailsWorks
    Latency on weak connection3-10+ secondsSame as strong connection
    ConsistencyVariableConsistent

    The difference is most dramatic on the first token. Users perceive 50ms as instant. The response appears to begin "immediately" after tapping send.

    Token Generation Speed

    Modern mobile hardware generates tokens fast enough for a responsive chat experience:

    Device1B Model3B Model
    iPhone 15 Pro35-45 tok/s18-25 tok/s
    Galaxy S2435-45 tok/s18-25 tok/s
    iPhone 1320-30 tok/s10-15 tok/s
    Mid-range Android18-25 tok/s8-12 tok/s

    At 20+ tokens per second, text appears to stream smoothly. At 10+ tokens per second, it is readable and usable. Both exceed the threshold for a comfortable chat experience.

    Measuring the Impact

    Track these metrics to quantify the latency problem in your app:

    P50/P95 time to first token: The median tells you the typical experience. The P95 tells you the worst 5% of users' experience. If P95 exceeds 3 seconds, 5% of your users are having a bad time on every interaction.

    Feature completion rate: What percentage of users who start an AI interaction complete it (wait for and read the response)? Compare this to non-AI features in your app.

    Retry rate: How often do users tap "send" again because they thought the first request failed? High retry rates indicate perceived timeouts.

    AI feature retention (D7/D30): Are users who try the AI feature coming back to use it again? Low retention despite high initial trial suggests a UX problem, and latency is the most common cause.

    The Fix

    The architectural fix is to move inference to the device. Fine-tune a small model (1-3B parameters) on your specific task, export as GGUF, and run it locally via llama.cpp.

    The path:

    1. Measure your current cloud API latency (P50 and P95)
    2. Identify which AI features are latency-sensitive (any feature where users wait for a response)
    3. Collect training data from your existing API logs
    4. Fine-tune a domain-specific model using a platform like Ertas
    5. Deploy on-device and measure the latency improvement
    6. A/B test user engagement between cloud and on-device

    The latency improvement alone often drives a measurable increase in feature engagement. When you add offline support and cost elimination, the case becomes definitive.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading