
On-Device AI in React Native with llama.rn
How to run language models directly on the user's phone in a React Native app. Setup, model loading, streaming generation, and cross-platform considerations using llama.rn.
llama.rn is a React Native library that provides JavaScript bindings to llama.cpp. It runs GGUF language models natively on both iOS (Metal) and Android (CPU/Vulkan) through the same JavaScript API.
For React Native developers, this means on-device AI with one codebase, one API, and zero cloud dependency.
Installation
npm install llama.rn
# or
yarn add llama.rn
For iOS, run pod install:
cd ios && pod install
For Android, the native library is included automatically via autolinking.
Expo
If you are using Expo, you need a development build (not Expo Go) since llama.rn includes native code:
npx expo prebuild
npx expo run:ios # or run:android
Loading a Model
import { initLlama, LlamaContext } from "llama.rn";
let context: LlamaContext | null = null;
async function loadModel(modelPath: string) {
context = await initLlama({
model: modelPath,
n_ctx: 2048, // Context window
n_threads: 4, // CPU threads
n_gpu_layers: 99, // Offload to GPU (Metal/Vulkan)
use_mlock: true, // Lock model in memory
});
console.log("Model loaded successfully");
}
Model Path
The model path must point to a local file on the device's filesystem. How you get the file there depends on your delivery strategy:
Bundled with app:
// iOS: Copy to app bundle, reference via RNFS
import RNFS from "react-native-fs";
const modelPath = `${RNFS.MainBundlePath}/model.gguf`;
// Android: Copy from assets to files directory on first launch
const modelPath = `${RNFS.DocumentDirectoryPath}/model.gguf`;
Downloaded post-install:
import RNFS from "react-native-fs";
const modelUrl = "https://cdn.example.com/model.gguf";
const modelPath = `${RNFS.DocumentDirectoryPath}/model.gguf`;
const download = RNFS.downloadFile({
fromUrl: modelUrl,
toFile: modelPath,
progress: (res) => {
const percentage = (res.bytesWritten / res.contentLength) * 100;
setDownloadProgress(percentage);
},
});
await download.promise;
Generating Text
Simple Generation
async function generate(prompt: string): Promise<string> {
if (!context) throw new Error("Model not loaded");
const result = await context.completion({
prompt: prompt,
n_predict: 256,
temperature: 0.7,
top_p: 0.9,
stop: ["</s>", "<|eot_id|>"], // Stop tokens
});
return result.text;
}
Streaming Generation
async function generateStream(
prompt: string,
onToken: (token: string) => void
): Promise<string> {
if (!context) throw new Error("Model not loaded");
const result = await context.completion(
{
prompt: prompt,
n_predict: 256,
temperature: 0.7,
stop: ["</s>", "<|eot_id|>"],
},
(data) => {
// Called for each generated token
onToken(data.token);
}
);
return result.text;
}
Chat Completion
For multi-turn conversations, format the prompt using the model's chat template:
interface Message {
role: "system" | "user" | "assistant";
content: string;
}
function formatChat(messages: Message[]): string {
// Llama 3.2 chat template
let prompt = "<|begin_of_text|>";
for (const msg of messages) {
prompt += `<|start_header_id|>${msg.role}<|end_header_id|>\n\n${msg.content}<|eot_id|>`;
}
prompt += "<|start_header_id|>assistant<|end_header_id|>\n\n";
return prompt;
}
async function chat(messages: Message[], onToken: (token: string) => void) {
const prompt = formatChat(messages);
return generateStream(prompt, onToken);
}
React Hook Pattern
import { useState, useCallback, useRef } from "react";
import { initLlama, LlamaContext } from "llama.rn";
export function useLlama(modelPath: string) {
const contextRef = useRef<LlamaContext | null>(null);
const [isLoaded, setIsLoaded] = useState(false);
const [isGenerating, setIsGenerating] = useState(false);
const [response, setResponse] = useState("");
const load = useCallback(async () => {
contextRef.current = await initLlama({
model: modelPath,
n_ctx: 2048,
n_threads: 4,
n_gpu_layers: 99,
});
setIsLoaded(true);
}, [modelPath]);
const generate = useCallback(async (prompt: string) => {
if (!contextRef.current) return;
setIsGenerating(true);
setResponse("");
await contextRef.current.completion(
{
prompt,
n_predict: 256,
temperature: 0.7,
stop: ["</s>", "<|eot_id|>"],
},
(data) => {
setResponse((prev) => prev + data.token);
}
);
setIsGenerating(false);
}, []);
const unload = useCallback(() => {
contextRef.current?.release();
contextRef.current = null;
setIsLoaded(false);
}, []);
return { load, generate, unload, isLoaded, isGenerating, response };
}
Usage in a Component
function AiChat() {
const { load, generate, unload, isLoaded, isGenerating, response } =
useLlama(modelPath);
const [input, setInput] = useState("");
useEffect(() => {
load();
return () => unload();
}, []);
return (
<View style={styles.container}>
<ScrollView style={styles.responseArea}>
<Text>{response}</Text>
</ScrollView>
<View style={styles.inputRow}>
<TextInput
value={input}
onChangeText={setInput}
style={styles.input}
editable={!isGenerating}
/>
<Button
title="Send"
onPress={() => {
generate(input);
setInput("");
}}
disabled={!isLoaded || isGenerating}
/>
</View>
</View>
);
}
Cross-Platform Considerations
Performance Parity
llama.rn runs native code on both platforms. The JavaScript bridge is only used for passing prompts and receiving tokens. Actual inference performance is identical to native Swift/Kotlin integration:
| Device | 1B Model (tok/s) | 3B Model (tok/s) |
|---|---|---|
| iPhone 15 Pro | 35-45 | 18-25 |
| iPhone 14 | 25-32 | 14-18 |
| Galaxy S24 | 35-45 | 18-25 |
| Mid-range Android | 18-25 | 8-12 |
The JS bridge adds under 1ms per token of overhead. Negligible.
Model File Path Differences
iOS and Android store files in different locations. Use react-native-fs to get platform-appropriate paths:
import RNFS from "react-native-fs";
import { Platform } from "react-native";
const modelDir = Platform.OS === "ios"
? RNFS.DocumentDirectoryPath
: RNFS.DocumentDirectoryPath; // Same API, different underlying path
Memory Management
React Native does not expose direct memory APIs. Use platform-specific checks via a native module if you need to verify available RAM before loading. Alternatively, catch load failures and show an appropriate message.
Model Delivery in React Native
Strategy 1: Bundle for Small Models
For 1B models (~600MB), bundling with the app is feasible:
- iOS: Add to Xcode project as a resource
- Android: Use Android Asset Delivery for files over 150MB
Strategy 2: Download for All Models
For 3B models (~1.7GB) or to keep the initial download small:
async function ensureModelReady(): Promise<string> {
const modelPath = `${RNFS.DocumentDirectoryPath}/model.gguf`;
const exists = await RNFS.exists(modelPath);
if (exists) return modelPath;
// Download with progress
await RNFS.downloadFile({
fromUrl: MODEL_CDN_URL,
toFile: modelPath,
progress: (res) => {
updateProgress(res.bytesWritten / res.contentLength);
},
}).promise;
// Verify integrity
const hash = await RNFS.hash(modelPath, "sha256");
if (hash !== EXPECTED_HASH) {
await RNFS.unlink(modelPath);
throw new Error("Model download corrupted");
}
return modelPath;
}
Production Best Practices
- Load model lazily. Only load when the user accesses the AI feature.
- Unload on blur. Release model memory when the AI screen is not focused.
- Handle errors gracefully. Model load can fail on low-memory devices. Show a clear message.
- Verify downloads. SHA256 hash check after download. Corrupted models cause crashes.
- Buffer tokens. Batch 2-3 tokens before UI update for smoother text display.
- Cancel support. Allow users to stop generation mid-stream.
The model quality depends on fine-tuning. A base model gives generic responses. A model fine-tuned on your domain data (via Ertas or similar platforms) gives responses tailored to your app's specific use case, running at the same speed on the same hardware.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

AI in React Native: From Cloud APIs to On-Device Models
How to add AI features to React Native apps. Cloud API integration with fetch, on-device inference with llama.cpp bindings, and a practical migration path from one to the other.

AI in iOS Apps: CoreML, Cloud APIs, and On-Device LLMs Compared
Three paths to AI in your iOS app. CoreML for Apple's ecosystem, cloud APIs for capability, and on-device LLMs via llama.cpp for cost and privacy. A practical comparison for Swift developers.

AI in Android Apps: ML Kit, Cloud APIs, and On-Device LLMs Compared
Three paths to AI in your Android app. Google ML Kit for common tasks, cloud APIs for full LLM capability, and on-device models via llama.cpp for cost and privacy. A practical comparison for Kotlin developers.