
AI in React Native: From Cloud APIs to On-Device Models
How to add AI features to React Native apps. Cloud API integration with fetch, on-device inference with llama.cpp bindings, and a practical migration path from one to the other.
React Native gives you one codebase for iOS and Android. Adding AI features should follow the same principle. But the path from "call an API" to "run inference on-device" is different in React Native than in native Swift or Kotlin. The JavaScript bridge, the native module system, and the cross-platform model delivery all require specific patterns.
This guide covers both approaches and the practical migration between them.
Cloud API Integration
The fastest way to add AI to a React Native app is calling a cloud API. React Native's fetch API works directly with OpenAI, Anthropic, Google Gemini, and other providers.
Basic Pattern
async function generateResponse(prompt: string): Promise<string> {
const response = await fetch("https://api.openai.com/v1/chat/completions", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${API_KEY}`,
},
body: JSON.stringify({
model: "gpt-4o-mini",
messages: [{ role: "user", content: prompt }],
}),
});
const data = await response.json();
return data.choices[0].message.content;
}
This works on both iOS and Android with zero platform-specific code. For streaming responses, use the EventSource pattern or a library like react-native-sse.
Streaming for Chat UIs
import EventSource from "react-native-sse";
function streamResponse(prompt: string, onToken: (token: string) => void) {
const es = new EventSource("https://api.openai.com/v1/chat/completions", {
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${API_KEY}`,
},
method: "POST",
body: JSON.stringify({
model: "gpt-4o-mini",
messages: [{ role: "user", content: prompt }],
stream: true,
}),
});
es.addEventListener("message", (event) => {
if (event.data === "[DONE]") return es.close();
const parsed = JSON.parse(event.data);
const token = parsed.choices[0]?.delta?.content;
if (token) onToken(token);
});
}
The Cloud API Ceiling
Cloud APIs work well for prototyping and low-volume apps. The problems are the same as native apps but slightly worse in React Native:
- Network dependency: React Native apps are often built for cross-platform reach, including markets with unreliable connectivity
- Latency: The JS bridge adds ~5-10ms on top of the 500-3,000ms network round trip
- Cost scaling: Every user, every request, every token costs money
- Privacy: User data crosses the network on every API call
On-Device AI in React Native
Running a model locally in React Native means using a native module that wraps llama.cpp. The JavaScript side sends prompts and receives tokens. The native side handles the actual inference on the device's CPU and GPU.
llama.rn
The llama.rn package provides React Native bindings for llama.cpp. It exposes a JavaScript API that loads GGUF models and runs inference natively on both iOS (Metal) and Android (CPU/Vulkan).
import { initLlama } from "llama.rn";
// Load the model
const context = await initLlama({
model: modelPath, // Path to .gguf file on device
n_ctx: 2048,
n_threads: 4,
n_gpu_layers: 32,
});
// Generate a response
const result = await context.completion({
prompt: "Summarize this note: ...",
n_predict: 256,
temperature: 0.7,
});
console.log(result.text);
Streaming Tokens
const result = await context.completion(
{
prompt: userPrompt,
n_predict: 512,
},
(token) => {
// Called for each generated token
setResponseText((prev) => prev + token.token);
}
);
This gives you the same token-by-token streaming experience as a cloud API, but with 50-200ms time to first token instead of 500-3,000ms.
Model Delivery
Getting the GGUF model file onto the device is the main engineering challenge in React Native.
Bundled with app: Include the model in the app's assets. For iOS, add it to the Xcode project. For Android, place it in the assets directory or use Android Asset Delivery for files over 150MB. React Native's asset system can reference the file path at runtime.
Post-install download: Download the model on first launch. Use react-native-blob-util or expo-file-system for background downloads with progress tracking:
import * as FileSystem from "expo-file-system";
const modelUri = FileSystem.documentDirectory + "model.gguf";
const download = FileSystem.createDownloadResumable(
MODEL_CDN_URL,
modelUri,
{},
(progress) => {
const pct = progress.totalBytesWritten / progress.totalBytesExpectedToWrite;
setDownloadProgress(pct);
}
);
const result = await download.downloadAsync();
Performance Expectations
On-device performance in React Native is nearly identical to native apps. The llama.cpp inference runs in native code, not through the JS bridge. The bridge is only used for sending prompts and receiving tokens.
| Device | 1B Model (tok/s) | 3B Model (tok/s) |
|---|---|---|
| iPhone 15 Pro (A17) | 35-45 | 18-25 |
| iPhone 14 (A15) | 25-35 | 12-18 |
| Galaxy S24 (SD 8 Gen 3) | 35-45 | 18-25 |
| Pixel 8 (Tensor G3) | 25-35 | 12-18 |
| Mid-range Android (SD 7 Gen 3) | 18-25 | 8-12 |
The JS bridge overhead for token delivery is negligible (under 1ms per token).
Architecture: Abstracting the AI Layer
The best React Native architecture abstracts the AI provider behind a common interface. This lets you swap between cloud and on-device without changing your UI code.
interface AiProvider {
generate(prompt: string, onToken?: (token: string) => void): Promise<string>;
isReady(): boolean;
}
class CloudProvider implements AiProvider {
async generate(prompt: string, onToken?: (token: string) => void) {
// Cloud API call
}
isReady() { return true; } // Always ready if online
}
class OnDeviceProvider implements AiProvider {
async generate(prompt: string, onToken?: (token: string) => void) {
// llama.rn inference
}
isReady() { return this.modelLoaded; }
}
This pattern supports a gradual migration. Start with CloudProvider for validation. Add OnDeviceProvider when you are ready. A/B test. Eventually make on-device the default with cloud as a fallback.
The Cross-Platform Advantage
React Native's cross-platform model works in your favor for on-device AI. You fine-tune one model, export one GGUF file, and deploy it to both iOS and Android. The inference library handles the platform differences (Metal on iOS, CPU/Vulkan on Android).
Compare this to cloud APIs where you pay per-token on every platform. On-device, you pay once for fine-tuning and model distribution. The cost does not multiply with platforms.
Migration Path
- Start with cloud API via fetch. Validate the feature, collect usage data.
- Add the abstraction layer (AiProvider interface) early. This costs nothing and pays off later.
- Collect training data from your API logs. Every cloud API call is a potential training example.
- Fine-tune a small model on your domain data. Platforms like Ertas provide a visual pipeline: upload data, train with LoRA, export GGUF.
- Integrate llama.rn and add the on-device provider behind the same interface.
- A/B test cloud vs on-device on real users.
- Ship on-device as default with cloud fallback for unsupported devices.
The end result: one codebase, two platforms, AI that works offline, zero per-inference cost.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

On-Device AI in React Native with llama.rn
How to run language models directly on the user's phone in a React Native app. Setup, model loading, streaming generation, and cross-platform considerations using llama.rn.

AI in iOS Apps: CoreML, Cloud APIs, and On-Device LLMs Compared
Three paths to AI in your iOS app. CoreML for Apple's ecosystem, cloud APIs for capability, and on-device LLMs via llama.cpp for cost and privacy. A practical comparison for Swift developers.

AI in Android Apps: ML Kit, Cloud APIs, and On-Device LLMs Compared
Three paths to AI in your Android app. Google ML Kit for common tasks, cloud APIs for full LLM capability, and on-device models via llama.cpp for cost and privacy. A practical comparison for Kotlin developers.