
Why Your AI App Feels Slow: Network Latency Is the Bottleneck
AI API calls add 500-3,000ms of latency to every interaction. On mobile, that is the difference between a feature users love and one they abandon. Here is where the time goes and how to fix it.
Your AI feature works. The model is good. The prompts are tuned. But users describe it as "laggy." They wait 2-3 seconds staring at a loading spinner before seeing any response. On mobile, that is an eternity.
The problem is not the model. It is the network round trip between the user's phone and the cloud server running inference.
Where the Time Goes
A typical cloud AI API call from a mobile device involves these steps:
| Step | Time | Notes |
|---|---|---|
| DNS resolution | 10-50ms | Cached after first call |
| TCP + TLS handshake | 50-150ms | Per connection (connection reuse helps) |
| Request upload | 20-100ms | Depends on payload size and bandwidth |
| Server queue wait | 50-500ms | Varies by provider load |
| Model inference (TTFT) | 200-1,500ms | Time to first token, depends on model |
| Response download (first token) | 20-50ms | Network transit |
| Total time to first token | 350-2,350ms |
On a good connection, you might see 500ms. On cellular, 1-2 seconds. On a weak connection (elevator, subway, rural area), 3+ seconds or a timeout.
The Compounding Effect
These latencies hit on every single interaction. A 5-turn conversation with a cloud API means the user waits 5 times. Each wait reinforces the perception that the feature is slow. After 3-4 interactions, many users stop using the feature entirely.
Research from Google found that 53% of mobile users abandon a page that takes longer than 3 seconds to load. AI features face the same threshold, but the bar is even higher because users compare AI response times to the instant feedback of every other UI element in the app.
Why Mobile Is Worse Than Desktop
Desktop apps calling AI APIs have a structural advantage: they typically run on stable WiFi or ethernet connections. Mobile adds several layers of latency:
Cellular variability: 4G latency averages 50-100ms but spikes to 300-500ms during congestion. 5G is better in ideal conditions but inconsistent in practice.
Connection switching: Moving between WiFi and cellular (entering/leaving a building) can cause 1-2 second interruptions while the connection re-establishes.
Background/foreground transitions: iOS and Android suspend network connections when apps are backgrounded. When the user returns, the connection may need to be re-established.
Geographic distance: API servers are typically in US-East or US-West. Users in Southeast Asia, Africa, or South America add 100-300ms of pure network transit time.
The UX Impact
Loading Spinners Kill Engagement
Every loading spinner is a moment where the user can decide to do something else. On mobile, "something else" is one swipe away. The app switcher is always available.
A/B testing by multiple mobile teams has shown that AI features with over 1 second latency see 30-40% lower completion rates compared to sub-second features. The feature works identically. The only difference is perceived speed.
Streaming Helps, But Has Limits
Token-by-token streaming (Server-Sent Events) reduces perceived latency by showing output as it generates. The user sees the first few words quickly, which gives the impression of responsiveness.
Streaming improves the experience but does not eliminate the problem:
- Time to first token is still 500-2,000ms
- On weak connections, SSE streams buffer, creating visible stutter
- Each streamed chunk adds its own network overhead
The Offline Gap
The worst latency is infinite. When the user has no connection, the cloud API returns nothing. The feature breaks completely.
This is not an edge case on mobile. Users are regularly in subways, elevators, airplanes, rural areas, and international locations without data. A feature that fails in these moments trains users not to rely on it.
The On-Device Alternative
On-device inference eliminates network latency entirely. The model runs on the user's phone. The path from input to output never touches the network.
| Metric | Cloud API | On-Device |
|---|---|---|
| Time to first token | 500-2,000ms | 50-200ms |
| Full response (100 tokens) | 2-5 seconds | 1-3 seconds |
| Offline | Fails | Works |
| Latency on weak connection | 3-10+ seconds | Same as strong connection |
| Consistency | Variable | Consistent |
The difference is most dramatic on the first token. Users perceive 50ms as instant. The response appears to begin "immediately" after tapping send.
Token Generation Speed
Modern mobile hardware generates tokens fast enough for a responsive chat experience:
| Device | 1B Model | 3B Model |
|---|---|---|
| iPhone 15 Pro | 35-45 tok/s | 18-25 tok/s |
| Galaxy S24 | 35-45 tok/s | 18-25 tok/s |
| iPhone 13 | 20-30 tok/s | 10-15 tok/s |
| Mid-range Android | 18-25 tok/s | 8-12 tok/s |
At 20+ tokens per second, text appears to stream smoothly. At 10+ tokens per second, it is readable and usable. Both exceed the threshold for a comfortable chat experience.
Measuring the Impact
Track these metrics to quantify the latency problem in your app:
P50/P95 time to first token: The median tells you the typical experience. The P95 tells you the worst 5% of users' experience. If P95 exceeds 3 seconds, 5% of your users are having a bad time on every interaction.
Feature completion rate: What percentage of users who start an AI interaction complete it (wait for and read the response)? Compare this to non-AI features in your app.
Retry rate: How often do users tap "send" again because they thought the first request failed? High retry rates indicate perceived timeouts.
AI feature retention (D7/D30): Are users who try the AI feature coming back to use it again? Low retention despite high initial trial suggests a UX problem, and latency is the most common cause.
The Fix
The architectural fix is to move inference to the device. Fine-tune a small model (1-3B parameters) on your specific task, export as GGUF, and run it locally via llama.cpp.
The path:
- Measure your current cloud API latency (P50 and P95)
- Identify which AI features are latency-sensitive (any feature where users wait for a response)
- Collect training data from your existing API logs
- Fine-tune a domain-specific model using a platform like Ertas
- Deploy on-device and measure the latency improvement
- A/B test user engagement between cloud and on-device
The latency improvement alone often drives a measurable increase in feature engagement. When you add offline support and cost elimination, the case becomes definitive.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Fine-Tuning vs RAG for Mobile: Why RAG Still Needs a Server
RAG is the go-to solution for giving AI domain knowledge. But on mobile, RAG reintroduces the server dependency you are trying to eliminate. Fine-tuning bakes the knowledge into the model itself.

On-Device AI Unit Economics: The Math That Makes Mobile AI Profitable
The complete unit economics breakdown for on-device AI vs cloud APIs. Fixed costs, variable costs, break-even analysis, and the financial model for scaling mobile AI features profitably.

AI Features Mobile Users Actually Want (2026)
Research-backed list of AI features that drive retention and engagement in mobile apps. What users want, what they ignore, and how to prioritize AI features based on actual behavior data.