How to Add AI to Your Mobile App: A Developer's Decision Guide

You want to add AI features to your mobile app. Maybe an in-app assistant, smart search, content drafting, or classification. The question is not whether to add AI. The question is how.

There are three fundamentally different approaches, each with different cost structures, performance characteristics, and trade-offs. Choosing the wrong one will cost you either money or months of rework. This guide helps you choose the right one before you write any code.

The Three Approaches

1. Cloud APIs (OpenAI, Anthropic, Google)

The fastest way to add AI to your app. Make an HTTP request to a cloud endpoint, get a response back. OpenAI's GPT-4o, Anthropic's Claude, and Google's Gemini are the most popular options.

How it works: Your app sends the user's input to a cloud server. The server runs inference on a large model. The response comes back over the network. Your app displays it.

What it costs: Per-token pricing. Every request, every user, every interaction costs money. GPT-4o charges $2.50 per million input tokens and $10.00 per million output tokens. GPT-4o-mini is cheaper at $0.15/$0.60. Gemini Flash is the most affordable at $0.10/$0.40.

At 10,000 monthly active users with a typical AI assistant (3 interactions per day, 1,000 tokens per interaction), monthly costs range from $67 (Gemini Flash) to $5,625 (GPT-4o). These costs scale linearly with every user you add.

When to use it: Prototyping and validation. Tasks requiring frontier-model reasoning. Very low volume apps (fewer than 1,000 MAU). Features needing real-time access to live data.

When to avoid it: Any app where AI is a core feature used frequently. Privacy-sensitive use cases. Apps requiring offline functionality. Cost-sensitive scaling scenarios.

2. On-Device Models (Fine-Tuned + llama.cpp)

Run AI inference directly on the user's phone. A fine-tuned model stored on the device handles requests locally. No network required. No per-request cost.

How it works: You fine-tune a small language model (1-3 billion parameters) on your specific task using LoRA adapters. Export it as a GGUF file. Ship it with your app or download it post-install. The model runs on the device via llama.cpp, using the phone's CPU and GPU.

What it costs: A one-time fine-tuning cost of $5-50 per training run. Model distribution via CDN (roughly $0.08 per GB, amortized across users). After that, inference is free forever. Zero per-request cost regardless of how many users you have or how often they use the feature.

At 10,000 MAU: $0 per month for inference (versus $67-$5,625 for cloud APIs).

When to use it: High-volume AI features (chat, search, classification). Privacy-sensitive data (health, finance, personal messages). Apps needing offline support. Domain-specific tasks where a fine-tuned 3B model outperforms generic GPT-4 prompting (94% vs 71% accuracy on domain tasks, per published benchmarks).

When to avoid it: Tasks requiring frontier-model reasoning on novel inputs. Features needing access to real-time external data. Ultra-constrained devices with fewer than 4GB RAM.

3. Hybrid Architecture

Route requests based on complexity. Simple, high-volume tasks go to the on-device model. Complex, low-frequency tasks go to a cloud API.

How it works: Your app evaluates each request and routes it to either the local model or a cloud endpoint. The routing can be rule-based (task type) or confidence-based (if the local model's confidence is below a threshold, escalate to cloud).

What it costs: On-device inference for 80-90% of requests at zero marginal cost. Cloud API charges only for the 10-20% of requests that genuinely need frontier capability.

When to use it: Apps with a mix of simple and complex AI tasks. Gradual migration from cloud to on-device. When you need cloud as a fallback during initial on-device deployment.

The Decision Matrix

Factor	Cloud API	On-Device	Hybrid
Setup time	Hours	Days	Days
Cost at 1K MAU	$7-$563/mo	~$0/mo	$1-$56/mo
Cost at 100K MAU	$675-$56,250/mo	~$0/mo	$68-$5,625/mo
Latency (time to first token)	500ms-3,000ms	50-200ms	Varies by route
Offline support	No	Yes	Partial
Privacy	Data sent to third party	Data stays on-device	Partial
Model quality (general tasks)	Highest	Good (fine-tuned)	Best of both
Model quality (domain tasks)	Good	Highest (fine-tuned)	Highest
Vendor dependency	High	None	Low
Model update speed	Instant (API-side)	OTA push (hours)	Mixed

What Actually Runs on a Phone?

Modern smartphones are more capable than most developers expect. An iPhone 15 (A17, 8GB RAM) runs a 3 billion parameter model at 20-30 tokens per second. A Pixel 8 (Tensor G3, 12GB) achieves similar performance. That is fast enough for real-time chat, instant classification, and responsive content generation.

The key constraint is RAM. A 3B model quantized to 4 bits (Q4_K_M) requires about 1.7GB of RAM. Most flagship phones from the last two years have 6-12GB. After the OS and other apps, there is enough headroom for a model of this size.

For reference, here are practical model sizes at Q4 quantization:

Model Size	GGUF File Size (Q4)	RAM Required	Device Tier
1B parameters	~600MB	~800MB	Mid-range (2023+)
3B parameters	~1.7GB	~2.2GB	Flagship (2022+)
7B parameters	~4.0GB	~5.0GB	High-end flagship only

The 1-3B range is the practical sweet spot for mobile deployment in 2026.

The Cost Curve

The economics of cloud APIs versus on-device models follow a predictable pattern. At very low volume (fewer than 100 MAU), cloud APIs are cheaper because the fine-tuning cost ($5-50) exceeds the monthly API bill. The crossover happens fast.

With GPT-4o-mini at $0.15/$0.60 per million tokens and a typical mobile assistant pattern (3 interactions/day, 1,000 tokens each):

100 MAU: Cloud costs $3.37/mo. On-device costs $0. Fine-tuning break-even in 2-15 months.
1,000 MAU: Cloud costs $33.75/mo. Break-even in the first month.
10,000 MAU: Cloud costs $337.50/mo. Fine-tuning was paid off in the first billing cycle.

With GPT-4o, the break-even comes even faster because the monthly cost is 15-25x higher.

The key insight: cloud APIs are a variable cost that grows with every user. On-device inference is a fixed cost that does not. This fundamentally changes your unit economics.

The Industry Direction

The trajectory is clear. Apple invested heavily in on-device ML with CoreML and Neural Engine optimization. Google launched Gemini Nano specifically for on-device inference. Meta released Llama 3.2 with 1B and 3B models designed for mobile. Qualcomm, MediaTek, and Samsung are building dedicated NPUs into their chipsets.

The tooling ecosystem has matured. llama.cpp provides production-grade inference across iOS and Android. GGUF has become the standard format for portable model deployment. Fine-tuning with LoRA is accessible to developers without ML backgrounds.

The remaining barrier is the fine-tuning step itself. Preparing training data, running the fine-tuning job, and exporting to GGUF still involves multiple tools and some ML knowledge. Platforms like Ertas are closing this gap by providing a visual interface that handles the full pipeline: upload your data, fine-tune on cloud GPUs, export as GGUF, and ship in your app. No code, no ML expertise, setup in about 2 minutes.

Where to Start

If you are starting from zero, begin with a cloud API. It validates the feature and user demand with minimal investment. Build the feature, ship it, confirm users engage with it.

Once you have validated the feature and have real usage data, you also have real training data. Your API logs are your fine-tuning dataset. Move to on-device when: your API costs are material, your users need offline access, or privacy requirements demand it.

The migration path is well-defined: extract training data from API logs, fine-tune a small model, integrate llama.cpp, A/B test against your cloud baseline, then migrate. Many developers report the full migration takes 2-4 weeks.

The right approach depends on where you are. But if you are building a mobile app with AI features that users will use daily, the math points toward on-device inference for the core workload.